Many nodes in 8A down - affects sapphire, test, bigmem, and other partitions - Incident details - FAS Research Computing

Updates

Resolved
May 18, 2024 at 3:44 AM
Resolved
May 18, 2024 at 3:44 AM
This incident has been resolved.
Investigating
May 17, 2024 at 8:51 PM
Investigating
May 17, 2024 at 8:51 PM
We are still unable to resolve the issue with these nodes and are working with the facility, networking, and our staff to find a solution. The affected partitions (noted in previous update below) will be resource-constrained and continue to be slow or unable to queue new jobs.

If you are using a partition that cannot queue new jobs, please consider adding additional partitions to your job: https://docs.rc.fas.harvard.edu/kb/running-jobs/#Slurm_partitions

Also, a reminder that the data center power downtime will begin Tuesday morning. So any new jobs scheduled for more than 3 days will not complete: https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/
Identified
May 17, 2024 at 5:23 PM
Identified
May 17, 2024 at 5:23 PM
We are still working on the root cause and resolution for these downed nodes.
Partitions with one or more affected nodes [involves multiple nodes unless denoted as (1) ]:
arguelles_delgado_gpu (1)
hsph
joonholee
jshapiro_sapphire
lichtmandce01
bigmem
gpu_requeue (1)
intermediate
sapphire
serial_requeue (1)
shared (1)
test
yao / yao_priority
use 'sinfo -p [partition name]' if you wish to see the down nodes in particular queue
Investigating
May 17, 2024 at 2:30 PM
Investigating
May 17, 2024 at 2:30 PM
We are currently investigating this incident. An unknown outage has downed many nodes in row 8A of our data center. More information to follow.
Includes nodes from the sapphire, test, gpu, and other partitions.

FAS Research Computing - Many nodes in 8A down - affects sapphire, test, bigmem, and other partitions – Incident details

Many nodes in 8A down - affects sapphire, test, bigmem, and other partitions