22nd June 2018

Login Nodes Power outage at Boston data center

UPDATE: (6/23/18 3:10PM) - The all-clear email has gone out. Addendum: molspace is online. The contents follow:

Dear FASRC users,

Thank you for your patience. Power has been restored to our floor of the Boston data center and we have successfully powered up the vast majority of systems there. We have now re-opened the login nodes and scheduler to allow jobs to resume and new jobs to be submitted.

Please be aware that some jobs may have failed before the scheduler was paused on Friday, especially if they relied on storage that suddenly disappeared (home directories, lab storage, etc.) So far we're seeing very few failures, but there will undoubtedly be some.

The remaining storage systems which are not yet online are molspace [now up], and rcss6, which we are actively working on. Additionally, keithfs1 and bicepfs3 are down as they have hardware failures and require mainboard replacements. If your storage appears unavailable to you and is not one of those listed above, please let us know.

We will continue to update the status page at http://fasrc.us with any additional information.

PS - Please note that this list will be closed again and will not accept replies. Please contact us as you normally would from here out.

Thank you! FAS Research Computing https://www.rc.fas.harvard.edu

UPDATE: (6/23/18 2:15) - We are planning for 3pm start of the cluster, login nodes, and most, if not all storage and partitions. An all-clear email will be sent out at that time if that's the case. Check here for any update or change to that plan. FYI: Currently the following filesystems may require additional time or service: molspace, hutlab12, rcss6, as well as keithfs1 and bicepfs3 (which have hardware failures). We will also update this list at the all-clear. (list updated 2:550pm)

UPDATE: (6/23/18 1:15) Powerup is proceeding as expected. Given that many systems lost power abruptly, we are cautious and ensuring that systems, especially storage, come up as cleanly as possible and that dependencies (mounts/etc.) are restored in order. It is crucial that we ensure lab storage and home directories are properly mounting across all data centers before an all-clear or restart of the scheduler. Next update at or before 2:15PM.

UPDATE: (6/23/18 12:05) Powerup has begun, but will take some time to fully complete and test. We are also being cautious regarding one leg of power that is being fed from an alternate source and will want to ensure its stability before, once all systems are operational, giving an all-clear. Next update 1:15PM

UPDATE: (6/23/18 11am) Power is being restored. ETA for both legs is now (11am). RC staff is on-site to facilitate powering up all our systems in an orderly fashion. Please note that jobs already running ( the scheduler and compute being in Holyoke) were paused last night in hopes that they can be resumed after power is restored. Given that many jobs access home directories, there is the potential for some such jobs to have failed before we could pause everything. Once the all-clear email goes out and/or this page indicates we are back up, you will be able to check on the status of your jobs.
Next update at noon.

UPDATE: (6/22/18) Still no ETA. The incident involved the potential for fire in a UPS (power supply) room which activated sprinklers in one or more UPS rooms. Restoring them involves drying the rooms and potentially re-routing power. We have no estimate of how long that might take. Please note that no RC assets were ever in danger of fire or water damage. This incident is limited to the building's power systems on our and at least one other floor.

MAJOR DATA CENTER OUTAGE A power incident at our Boston data center (approximately 5:30pm 6/22/18) has resulted in a full power loss for services housed on our floor of the data center. This impacts the FASRC cluster (login nodes, home directories, lab storage, etc.) and also impacts our normal ability to communicate with you (our ticket system, status page, website and listserv reside there necessarily.)

We are working with the data center management, but have no ETA for when the power will be restored. Please let your staff and researchers know.

Thank you for your understanding, FAS Research Computing rchelp@rc.fas.harvard.edu

For issues not shown here, please contact FASRC via
https://portal.rc.fas.harvard.edu or email rchelp@rc.fas.harvard.edu