9th July 2018

Network - MGHPCC Holyoke Power outage in Holyoke/MGHPCC

UPDATE: 6PM Powerup at MGHPCC has completed and the scheduler is again open for jobs. Please bear in mind that power was completely out at MGHPCC, so running jobs will have failed.

Storage at Holyoke is back online, however Regal may show performance issues until we can replace some failing drives. Status will be updated here.

Thank you for your patience. We will continue to work with the data center management to ascertain a root cause and discuss what steps can be taken to ensure it does not happen again.

UPDATE: Powering up of most Holyoke storage is complete or nearly complete. Compute and other nodes are coming up and any lingering issues dealt with. Current ETA is somewhere around 5pm if all continues to go as expected. Once the nodes are back in service, the scheduler (Slurm) can be restarted. Bear in mind that power to all nodes was lost, so running jobs will have failed.

11:45am: The emergency power-off has tripped in MGHPCC/Holyoke meaning the entire data center has powered off. We are awaiting more information, but please be aware that this is a major outage and will affect almost every aspect of the cluster.

