Experiencing partially degraded performance

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE | Academic


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu | Email: rchelp@rc.fas.harvard.edu


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

2023 MGHPCC power downtime and cluster updates June 5-8, 2023

Completed
Scheduled for June 05, 2023 at 12:00 PM – June 09, 2023 at 1:14 PM

Affects

Cannon Cluster
SLURM Scheduler - Cannon
Cannon Compute Cluster (Holyoke)
Boston Compute Nodes
GPU nodes (Holyoke)
FASSE Cluster
Updates
  • Completed
    June 09, 2023 at 1:14 PM
    Completed
    June 09, 2023 at 1:14 PM

    Downtime complete.

    For full details of changes and additional notices, please see email or visit https://www.rc.fas.harvard.edu/blog/2023-downtime/

    Rocky 8 Transition Guide: https://docs.rc.fas.harvard.edu/kb/rocky-8-transition-guide/

  • Update
    June 06, 2023 at 7:01 PM
    In progress
    June 06, 2023 at 7:01 PM

    Network issue resolved
    ~~Core network updates causing some additional interruptions in Boston data center resources. FASRC portal, website, and docs affected.

    ETA unknown, but not expected to be a long interruption.
    ~~

  • In progress
    June 05, 2023 at 12:00 PM
    In progress
    June 05, 2023 at 12:00 PM

    The annual multi-day power downtime at MGHPCC will take place June 5th-8th with return to full service 9AM June 9th.

    During this downtime we will be performing major updates to the cluster's operating system, public partition time limits, and significant changes to our software offerings and procedures. This will affect ALL cluster users and as described below may require you to modify your normal workflows and software chains. Test clusters for Cannon and FASSE (including OOD/VDI) are available. Due to the breadth of these changes, this downtime will also impact almost every service we provide, including most housed in Boston.

    • All running and all pending jobs will be purged Monday morning June 5th. All pending jobs or unfinished jobs will need to be re-submitted once the cluster is back up.

    • All login nodes will be down for the duration and will have their OS updated to Rocky 8. Once complete, the login process will not change, but the underlying operating system will. Rocky 8 test clusters for Cannon and FASSE are available NOW for you to test your workflows and prepare for these changes.

    • All storage at MGHPCC/Holyoke will be powered down during the outage. Boston storage servers will not be powered down but will likely be unavailable as there will be no nodes available to access them. Some Boston Samba shares may be accessible, but please assume they will also be affected at various times.

    OS Update - Rocky Linux

    Currently the cluster nodes (Cannon and FASSE), and indeed most of our infrastructure, are built on CentOS 7, the non-commercial version of RedHat's Enterprise Linux. CentOS is being discontinued by RedHat and new development ceased at the end of 2021. As a result, we are moving to Rocky 8 Linux, created by the same people who began CentOS and which much of the HPC community is also getting behind. Given the wide adoption indicated by other HPC sites, we feel confident Rocky Linux is the right choice so that we will be part of a large community where we can both find common support and contribute back.

    Rocky Linux will be a major update with significant changes in the same way that our move from CentOS 6 to CentOS 7 was. As such, there will be issues that will affect some pipelines, software, or codes. Additionally, we will necessarily be revamping our software offerings and, concurrently, also giving end-users more control over their own software with new build tools such as Spack.

    Rocky 8 Transition Guide: https://docs.rc.fas.harvard.edu/kb/rocky-8-transition-guide/

    Please DO expect to see SSH warnings when you log in again after June 8th as all machines including login nodes will have been completely rebuilt. See the FAQ at: https://docs.rc.fas.harvard.edu/kb/rocky-8-transition-guide/#FAQ

    Cannon and FASSE OOD (aka VDI)

    Regarding OOD/VDI: The Open OnDemand instances for Cannon (not FASSE) will involve a change to each user's settings and the location of the folder containing those settings and other OOD-specific files. This also means that values you had filled out in the OOD start form will not be retained. Form values will revert to the defaults.

    The new location in your Cannon home directory will be ~/.fasrcood

    Any "development mode" AKA "sandbox" apps that you have installed to ~/fasrc/dev (approximately 78 of you have created these) will no longer be visible through your OOD Dashboard, and will need to be moved to ~/.fasrcood/dev . The old folder at ~/fasrc will no longer be used and can be removed after June 8th.

    The location for FASSE settings in your home directory will not change and will remain at ~/.fasseood

    Public Partition Time Limits on Cannon

    During the summer of 2022 we did an analysis of job run times on the Cannon cluster with the goal of reassessing our existing partition time limits (which are 7 days). A reduced time limit has many benefits such as reduced cluster fragmentation, lower wait times, and short times to drain nodes for service. As a result of this analysis we found that over 95% of jobs complete within 3 days on all the public partitions excluding unrestricted.

    We will be changing all the public partitions on Cannon, excluding unrestricted, to a 3 day time limit. To accommodate the 5% of jobs longer than 3 days we will be adding an intermediate partition for jobs that need to run between 3 to 14 days; with unrestricted handling the rest. Also for new partitions owned by specific groups we will be instituting a default 3 day time limit, existing partitions are not impacted and groups may change this default to suit their needs.

    Software Changes

    FASRC will reduce the number of precompiled software packages that it hosts, distilling this down to the necessities of compilers, any commercial packages, apps needed for VDI, etc.

    We will reduce our dependence on the Lmod modules and provide end-users with more and expanded options for building and deploying the software they need. We will incorporate the use of Spack to give users more power to build the software and versions they need. Those who came to FASRC from other HPC sites may recognize this as the norm at many sites these days.

    There will be a Singularity image to allow those who need it to run a CentOS 7 image until they can transition away from it: https://docs.rc.fas.harvard.edu/kb/centos7-singularity/

    If you load modules in your .bashrc or other startup script, you will likely need to modify this as some modules will no longer be available or will have changed. This is why trying out the test cluster ahead of time is important. Software you have compiled may need to be re-compiled, as an example.

    FASRC has a Rocky Linux test environment available where groups can begin to test their software, codes, jobs, etc. Please see the Rocky 8 Transition Guide for details.

    Please note that Anaconda environments will need to be rebuilt (but also should build faster).

    Regarding OOD/VDI: The Open OnDemand instances for Cannon will involve a change to each user's settings and the location of the folder containing those settings and other OOD-specific files. This also means that values you had filled out in the OOD start form will not be retained. Form values will revert to the defaults.

    The new location in your Cannon home directory will be ~/.fasrcood (This does not affect FASSE OOD users which continues to use ~/.fasseood)

    Training, Office Hours, and Consultation

    Starting in April, FASRC will, in addition to our regular New User training and other upcoming classes, will be offering training sessions on:

    From CentOS7 to Rocky 8: How the new operating system will affect FASRC clusters

    FASRC has a Rocky Linux 8 test environment available up until the shutdown where groups can begin to test their software, codes, jobs, etc. This includes Cannon and FASSE users and OOD/VDI. Please see the Rocky 8 Transition Guide for details.