FAS Research Computing

Annual MGHPCC/Holyoke data center power downtime - May 21-24 2024

Scheduled for May 22, 2024 at 1:00 AM – May 24, 2024 at 9:00 PM 3 days

Planned
May 22, 2024 at 1:00 AM
Planned
May 22, 2024 at 1:00 AM
The 2024 MGHPCC data center annual power downtime will take place May 21-24, 2024.
We will begin our shutdown on Tuesday May 21st and expect a return to service by 5PM Friday May 24th.
- Jobs: Please plan ahead as all still running jobs on the morning of May 21st will be stopped and canceled and will need to be resubmitted after the downtime. Pending jobs will remain in the queue until the cluster returns to regular service on May 24th.
- Access: The cluster, scheduler, login, and OoD nodes will be unavailable for the duration of the downtime. New lab and account requests should wait until after the downtime.
- Storage: All Holyoke storage will be powered down and unavailable for the duration of the downtime. Boston storage will remain online, but your ability to access it may be impacted and network changes may briefly affect its availability.
Further details, an explanation for this year's change in scheduling, a visual timeline, and a list of maintenance tasks overview can be found at:
https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/
Progress of the downtime will be posted here on our status page during the event. Note that you can subscribe to receive updates as they happen. Click Get Updates in the upper right.
MAJOR TASK OVERVIEW
- OS upgrade to Rocky 8.9 - Point upgrade, no code rebuilds will be required. Switch from system OFED to Mellanox OFED on nodes for improved performance
- Infiniband (network) upgrades
- BIOS updates (various)
- Storage firmware updates
- Network Maintenance
- Decommission old nodes (targets contacted)
- Additional minor one-off updates and maintenance (cable swap, reboots, etc.)
Thanks,
FAS Research Computing
https://www.rc.fas.harvard.edu/
https://docs.rc.fas.harvard.edu/
https://status.rc.fas.harvard.edu/

Cannon Cluster

Operational

SLURM Scheduler - Cannon

Operational

Cannon Compute Cluster (Holyoke)

Operational

Boston Compute Nodes

Operational

GPU nodes (Holyoke)

Operational

SEAS compute partition

Operational

FASSE Cluster

Operational

SLURM Scheduler - FASSE

Operational

FASSE Compute Cluster (Holyoke)

Operational

Kempner Cluster

Operational

Kempner Cluster CPU

Operational

Kempner Cluster GPU

Operational

FASSE login nodes

Operational

VDI/OpenOnDemand

Operational

Cannon VDI (Open OnDemand)

Operational

FASSE VDI (Open OnDemand)

Operational

Storage

Operational

Holyscratch01 (Global Scratch)

Operational

Home Directory Storage - Boston

Operational

HolyLFS03 (Tier 0)

Operational

HolyLFS04 (Tier 0)

Operational

HolyLFS05 (Tier 0)

Operational

Holystore01 (Tier 0)

Operational

Holylabs

Operational

BosLFS02 (Tier 0)

Operational

Isilon Storage Boston (Tier 1)

Operational

Isilon Storage Holyoke (Tier 1)

Operational

CEPH Storage Boston (Tier 2)

Operational

Tape - (Tier 3)

Operational

Boston Specialty Storage

Operational

Holyoke Specialty Storage

Operational

Samba Cluster

Operational

Globus Data Transfer

Operational

bosECS

Operational

holECS

Operational

Authentication

Operational

Virtual Machines

Operational

Networking

Operational

Software & LIcensing

Operational

Websites & Tools

Operational

External Reources - Data Centers

Operational

External Resources - Service Providers

Operational

Recent notices

May 17, 2024

May 16, 2024

May 15, 2024

Resolved
May 15, 2024 at 2:32 PM
Resolved
May 15, 2024 at 2:32 PM
holystore01 is back to normal operation. Please report any continued issues with holystore01 to FASRC.
Investigating
May 15, 2024 at 1:00 AM
Investigating
May 15, 2024 at 1:00 AM
We are currently investigating a hardware problem on holystore01. This may affect file access, conda/mamba environments, and some logins.

May 14, 2024

May 13, 2024

Critical power supply work at MGHPCC May 13th - A subset of Cannon nodes will be idled

Completed
May 13, 2024 at 7:15 PM
Completed
May 13, 2024 at 7:15 PM
This work has been completed and all 8A nodes are back in service.
In progress
May 13, 2024 at 12:00 PM
In progress
May 13, 2024 at 12:00 PM
Maintenance is now in progress
Planned
May 13, 2024 at 12:00 PM
Planned
May 13, 2024 at 12:00 PM
WHAT: Some nodes in row 8A will be idled at MGHPCC
WHEN: May 13th 8am-4pm
To avoid a future over-capacity situation, MGHPCC will be performing power supply work on May 13th from 8am-4pm. This includes sections of row 8A where some of our Cannon compute is located. We will be idling half the nodes in Pod 8a to allow necessary power work.
Unfortunately this cannot be done during our upcoming outage. This work is dictated by availability of electricians and other resources outside the facilities control, otherwise it would have been included in the May 21-24 downtime.
IMPACT
Half the nodes in racks 8a22, 8a28, 8a30, 8a32 will be down (only ~ 114 nodes, around 7% of total capacity). The work being done will also enable us to add more capacity for future purchases so it's important that we allow this interruption.
Impacted partitions are outlined at the bottom of this notice. Impact includes but is not limited to gpu, intermediate, sapphire, and hsph. A reservation to idle the nodes is already in place.
Pending jobs in those partitions will take longer to start due to fewer available nodes during this time. Where possible, please use or include other partitions in your job scripts and plan accordingly for any new or long-running jobs during that period: https://docs.rc.fas.harvard.edu/kb/running-jobs/#Slurm_partitions
Thanks for your understanding.
FAS Research Computing
https://www.rc.fas.harvard.edu
https://status.rc.fas.harvard.edu
Partitions with nodes in 8A whose capacity will be reduced during this maintenance:
arguelles_delgado_gpu
bigmem_intermediate
bigmem
blackhole_gpu
eddy
gershman
gpu
hejazi
hernquist_ice
hoekstra
hsph
huce_ice
iaifi_gpu
iaifi_gpu_priority
intermediate
itc_gpu
joonholee
jshapiro
jshapiro_priority
jshapiro_sapphire
kempner
kempner_dev
kempner_h100
kovac
kozinsky_gpu
kozinsky
kozinsky_priority
murphy_ice
ortegahernandez_ice
sapphire
rivas
seas_compute
seas_gpu
siag_gpu
siag_combo
siag
sur
test
yao
yao_priority
zhuang

May 12, 2024

May 11, 2024

Show notice history

MAJOR TASK OVERVIEW

Recent notices