FAS Research Computing - Annual MGHPCC/Holyoke data center power downtime - May 21-24 2024 – Maintenance details

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE | Academic


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu | Email: rchelp@rc.fas.harvard.edu


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Annual MGHPCC/Holyoke data center power downtime - May 21-24 2024

Completed
Scheduled for May 21, 2024 at 1:00 PM – May 24, 2024 at 9:50 PM

Affects

Cannon Cluster

Under maintenance from 1:00 PM to 9:50 PM

SLURM Scheduler - Cannon

Under maintenance from 1:00 PM to 9:50 PM

Cannon Compute Cluster (Holyoke)

Under maintenance from 1:00 PM to 9:50 PM

Boston Compute Nodes

Under maintenance from 1:00 PM to 9:50 PM

GPU nodes (Holyoke)

Under maintenance from 1:00 PM to 9:50 PM

seas_compute

Under maintenance from 1:00 PM to 9:50 PM

Updates
  • Completed
    May 24, 2024 at 9:50 PM
    Completed
    May 24, 2024 at 9:50 PM

    2024 MGHPCC downtime complete

    DOWNTIME COMPLETE

    The annual multi-day power downtime at MGHPCC (https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/) is complete (with any exceptions noted below). Normal service resumes today (Friday May 24th) at 5pm.

    The cluster has been updated to Rocky Linux 8.9. Several network, InfiniBand, computer, and storage firmware updates were installed. Available security updates were also installed.

    CANNON NODES

    More than 90% of nodes are up and all partitions are enabled. If your specialty partition has a downed node, we will attend to this on Tuesday.

    FASSE OOD

    Some updates are still propagating. If your FASSE Open OnDemand/VDI session does not work initially, please wait or retry your job/session.

    POST-DOWNTIME SUPPORT

    If you have any further concerns or unanswered questions please submit a help ticket (https://portal.rc.fas.harvard.edu/rcrt/submit_ticket) and we will do our best to respond quickly. Please bear in mind it is a long weekend, so lingering issues may not be dealt with until Tuesday.

    Also, have a good long Memorial Day weekend!

    Thanks,

    FAS Research Computing

    https://www.rc.fas.harvard.edu/

    https://docs.rc.fas.harvard.edu/

    https://status.rc.fas.harvard.edu/

    rchelp@rc.fas.harvard.edu  

  • Update
    May 24, 2024 at 9:08 PM
    In progress
    May 24, 2024 at 9:08 PM

    We are currently delayed opening the cluster due to some lingering issues.

    We will re-open as soon as possible or update again at 6pm.

  • Update
    May 24, 2024 at 1:47 PM
    In progress
    May 24, 2024 at 1:47 PM

    Power work completed by facility. Currently on schedule for powerup and return to service. ETA 5pm.

  • In progress
    May 21, 2024 at 1:00 PM
    In progress
    May 21, 2024 at 1:00 PM
    Maintenance is now in progress
  • Planned
    May 21, 2024 at 1:00 PM
    Planned
    May 21, 2024 at 1:00 PM

    The 2024 MGHPCC data center annual power downtime will take place May 21-24, 2024.

    We will begin our shutdown on Tuesday May 21st and expect a return to service by 5PM Friday May 24th.

    - Jobs: Please plan ahead as all still running jobs on the morning of May 21st will be stopped and canceled and will need to be resubmitted after the downtime. Pending jobs will remain in the queue until the cluster returns to regular service on May 24th.

    - Access: The cluster, scheduler, login, and OoD nodes will be unavailable for the duration of the downtime. New lab and account requests should wait until after the downtime.

    - Storage: All Holyoke storage will be powered down and unavailable for the duration of the downtime. Boston storage will remain online, but your ability to access it may be impacted and network changes may briefly affect its availability.

    Further details, an explanation for this year's change in scheduling, a visual timeline, and a list of maintenance tasks overview can be found at:

    https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/

    Progress of the downtime will be posted here on our status page during the event. Note that you can subscribe to receive updates as they happen. Click Get Updates in the upper right.

    MAJOR TASK OVERVIEW

    • OS upgrade to Rocky 8.9 - Point upgrade, no code rebuilds will be required. Switch from system OFED to Mellanox OFED on nodes for improved performance

    • Infiniband (network) upgrades

    • BIOS updates (various)

    • Storage firmware updates

    • Network Maintenance

    • Decommission old nodes (targets contacted)

    • Additional minor one-off updates and maintenance (cable swap, reboots, etc.)

    Thanks,

    FAS Research Computing

    https://www.rc.fas.harvard.edu/

    https://docs.rc.fas.harvard.edu/

    https://status.rc.fas.harvard.edu/