Affects
- CompletedMay 24, 2024 at 9:50 PMCompletedMay 24, 2024 at 9:50 PM
2024 MGHPCC downtime complete
DOWNTIME COMPLETE
The annual multi-day power downtime at MGHPCC (https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/) is complete (with any exceptions noted below). Normal service resumes today (Friday May 24th) at 5pm.
The cluster has been updated to Rocky Linux 8.9. Several network, InfiniBand, computer, and storage firmware updates were installed. Available security updates were also installed.
CANNON NODES
More than 90% of nodes are up and all partitions are enabled. If your specialty partition has a downed node, we will attend to this on Tuesday.
FASSE OOD
Some updates are still propagating. If your FASSE Open OnDemand/VDI session does not work initially, please wait or retry your job/session.
POST-DOWNTIME SUPPORT
If you have any further concerns or unanswered questions please submit a help ticket (https://portal.rc.fas.harvard.edu/rcrt/submit_ticket) and we will do our best to respond quickly. Please bear in mind it is a long weekend, so lingering issues may not be dealt with until Tuesday.
Also, have a good long Memorial Day weekend!
Thanks,
FAS Research Computing
https://www.rc.fas.harvard.edu/
https://docs.rc.fas.harvard.edu/
- UpdateMay 24, 2024 at 9:08 PMIn progressMay 24, 2024 at 9:08 PM
We are currently delayed opening the cluster due to some lingering issues.
We will re-open as soon as possible or update again at 6pm.
- UpdateMay 24, 2024 at 1:47 PMIn progressMay 24, 2024 at 1:47 PM
Power work completed by facility. Currently on schedule for powerup and return to service. ETA 5pm.
- In progressMay 21, 2024 at 1:00 PMIn progressMay 21, 2024 at 1:00 PMMaintenance is now in progress
- PlannedMay 21, 2024 at 1:00 PMPlannedMay 21, 2024 at 1:00 PM
The 2024 MGHPCC data center annual power downtime will take place May 21-24, 2024.
We will begin our shutdown on Tuesday May 21st and expect a return to service by 5PM Friday May 24th.
- Jobs: Please plan ahead as all still running jobs on the morning of May 21st will be stopped and canceled and will need to be resubmitted after the downtime. Pending jobs will remain in the queue until the cluster returns to regular service on May 24th.
- Access: The cluster, scheduler, login, and OoD nodes will be unavailable for the duration of the downtime. New lab and account requests should wait until after the downtime.
- Storage: All Holyoke storage will be powered down and unavailable for the duration of the downtime. Boston storage will remain online, but your ability to access it may be impacted and network changes may briefly affect its availability.
Further details, an explanation for this year's change in scheduling, a visual timeline, and a list of maintenance tasks overview can be found at:
https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/
Progress of the downtime will be posted here on our status page during the event. Note that you can subscribe to receive updates as they happen. Click Get Updates in the upper right.
MAJOR TASK OVERVIEW
OS upgrade to Rocky 8.9 - Point upgrade, no code rebuilds will be required. Switch from system OFED to Mellanox OFED on nodes for improved performance
Infiniband (network) upgrades
BIOS updates (various)
Storage firmware updates
Network Maintenance
Decommission old nodes (targets contacted)
Additional minor one-off updates and maintenance (cable swap, reboots, etc.)
Thanks,
FAS Research Computing
https://www.rc.fas.harvard.edu/