FAS Research Computing

Annual MGHPCC/Holyoke data center power downtime - May 21-24 2024

Scheduled for May 22, 2024 at 1:00 AM – May 24, 2024 at 9:00 PM 3 days

Planned
May 22, 2024 at 1:00 AM
Planned
May 22, 2024 at 1:00 AM
The 2024 MGHPCC data center annual power downtime will take place May 21-24, 2024.
We will begin our shutdown on Tuesday May 21st and expect a return to service by 5PM Friday May 24th.
- Jobs: Please plan ahead as all still running jobs on the morning of May 21st will be stopped and canceled and will need to be resubmitted after the downtime. Pending jobs will remain in the queue until the cluster returns to regular service on May 24th.
- Access: The cluster, scheduler, login, and OoD nodes will be unavailable for the duration of the downtime. New lab and account requests should wait until after the downtime.
- Storage: All Holyoke storage will be powered down and unavailable for the duration of the downtime. Boston storage will remain online, but your ability to access it may be impacted and network changes may briefly affect its availability.
Further details, an explanation for this year's change in scheduling, a visual timeline, and a list of maintenance tasks overview can be found at:
https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/
Progress of the downtime will be posted here on our status page during the event. Note that you can subscribe to receive updates as they happen. Click Get Updates in the upper right.
MAJOR TASK OVERVIEW
- OS upgrade to Rocky 8.9 - Point upgrade, no code rebuilds will be required. Switch from system OFED to Mellanox OFED on nodes for improved performance
- Infiniband (network) upgrades
- BIOS updates (various)
- Storage firmware updates
- Network Maintenance
- Decommission old nodes (targets contacted)
- Additional minor one-off updates and maintenance (cable swap, reboots, etc.)
Thanks,
FAS Research Computing
https://www.rc.fas.harvard.edu/
https://docs.rc.fas.harvard.edu/
https://status.rc.fas.harvard.edu/

Cannon Cluster

Operational

SLURM Scheduler - Cannon

Operational

Cannon Compute Cluster (Holyoke)

Operational

Boston Compute Nodes

Operational

GPU nodes (Holyoke)

Operational

SEAS compute partition

Operational

FASSE Cluster

Operational

SLURM Scheduler - FASSE

Operational

FASSE Compute Cluster (Holyoke)

Operational

Kempner Cluster

Operational

Kempner Cluster CPU

Operational

Kempner Cluster GPU

Operational

FASSE login nodes

Operational

VDI/OpenOnDemand

Operational

Cannon VDI (Open OnDemand)

Operational

FASSE VDI (Open OnDemand)

Operational

Storage

Operational

Holyscratch01 (Global Scratch)

Operational

Home Directory Storage - Boston

Operational

HolyLFS03 (Tier 0)

Operational

HolyLFS04 (Tier 0)

Operational

HolyLFS05 (Tier 0)

Operational

Holystore01 (Tier 0)

Operational

Holylabs

Operational

BosLFS02 (Tier 0)

Operational

Isilon Storage Boston (Tier 1)

Operational

Isilon Storage Holyoke (Tier 1)

Operational

CEPH Storage Boston (Tier 2)

Operational

Tape - (Tier 3)

Operational

Boston Specialty Storage

Operational

Holyoke Specialty Storage

Operational

Samba Cluster

Operational

Globus Data Transfer

Operational

bosECS

Operational

holECS

Operational

Authentication

Operational

Virtual Machines

Operational

Networking

Operational

Software & LIcensing

Operational

Websites & Tools

Operational

External Reources - Data Centers

Operational

External Resources - Service Providers

Operational

Recent notices

April 24, 2024

Resolved
April 24, 2024 at 2:58 PM
Resolved
April 24, 2024 at 2:58 PM
gpu_test is now operational and can be used.
This incident has been resolved.
Investigating
April 24, 2024 at 2:15 PM
Investigating
April 24, 2024 at 2:15 PM
An InfiniBand switch has failed in the cabinet housing the gpu_test nodes.
Please use gpu_requeue or gpu in the meantime.
We will update this incident when we have an ETA.

April 23, 2024

Resolved
April 23, 2024 at 2:35 PM
Resolved
April 23, 2024 at 2:35 PM
boslogin02, boslogin04 and holylogin04 are stuck and will need be rebooted in 1 hour.

April 22, 2024

April 21, 2024

April 20, 2024

April 19, 2024

April 18, 2024

Show notice history

MAJOR TASK OVERVIEW

Recent notices