FASRC: MGHPCC/Holyoke Annual Power Shutdown Aug 9-12, 2021

Completed

Scheduled for August 09, 2021 at 10:00 PM – August 18, 2021 at 1:00 PM

Affects

Cannon Cluster

Cannon Compute Cluster (Holyoke)

Updates

Completed
August 18, 2021 at 1:00 PM
Completed
August 18, 2021 at 1:00 PM
Maintenance has ended.
Update
August 13, 2021 at 6:56 PM
In progress
August 13, 2021 at 6:56 PM
GPU upgrade fix identified and in progress. Will require time for changes and reboots to take effect.

See incidents on main Status Page for GPU and Scratch issues, the two main lingering issues.
Update
August 13, 2021 at 2:23 PM
In progress
August 13, 2021 at 2:23 PM
We are continuing to restore compute nodes (and partitions) which did not come up cleanly.

GPU partitions are still unavailable and are the highest priority right now.

VDI remains unavailable until the GPU queues are working.

$scratch (/n/holyscratch01) is still in a degraded state and replacement hardware should arrive soon.

Next update at noon
Update
August 13, 2021 at 5:12 AM
In progress
August 13, 2021 at 5:12 AM
Thank you for your patience.

The main queues and their compute are coming online and the scheduler is open. However, please be aware that GPU queues require additional time and work and will not initially be available.

Some individual and owned nodes/queues may require additional attention in the morning. Please bear with us if your queue is not available or pending. We will work through these as quickly as possible.

We will post additional updates here in the morning.

HOLYSCRATCH01 One lingering issue is a hardware failure on one of the scratch storage subsystems. This will cause some files on $SCRATCH (/n/holyscratch01) to show as missing (you may see files marked ???). Dell has dispatched replacement hardware and once this is in place all data will be visible again. No data will be lost, just a portion of scratch's physical storage cannot be accessed by the server. If you find your scratch space is affected, you can continue to write new data and the missing files will eventually become available. As always, you can also use local temporary scratch if you wish:
https://docs.rc.fas.harvard.edu/kb/cluster-storage/#LocalpernodeSharedScratch_Storage
Update
August 13, 2021 at 1:07 AM
In progress
August 13, 2021 at 1:07 AM
ETA update. We are still experiencing issues which keep compute and the scheduler from being returned to service.

ETA unknown at this point. We are actively working the issue and will add another update when we have more.
Update
August 12, 2021 at 4:17 PM
In progress
August 12, 2021 at 4:17 PM
Due to some blocking issues which are delaying our ability to bring compute, some storage, and the scheduler back online, we will be delayed in returning to normal operation.

**Current ETA for full return to normal is 5PM.
** UPDATE ~ 8PM - See email

We appreciate your understanding.

Some components are back in service already, but please be cautious using them.

For more details please see our (new*) status page: https://status.rc.fas.harvard.edu

Thanks for your understanding, FAS Research Computing https://status.rc.fas.harvard.edu

*New Status Page: We have deployed a new and more-featured status page. You will see a change of appearance, organization of groups, and added functionality, including the ability to subscribe for notifications of maintenance events as well as the ability to subscribe to updates on all components or just the components that most concern you. Additionally, we can now provide better updates to events and automate some components.
Update
August 12, 2021 at 2:36 PM
In progress
August 12, 2021 at 2:36 PM
Blocking issues in storage, compute, and virtual infrastructure. New ETA 1:00PM
Update
August 12, 2021 at 11:00 AM
In progress
August 12, 2021 at 11:00 AM
Power-up sequence has begun.
Update
August 11, 2021 at 1:00 PM
In progress
August 11, 2021 at 1:00 PM
Network maintenance is now in progress.
In progress
August 09, 2021 at 10:00 PM
In progress
August 09, 2021 at 10:00 PM
FASRC: MGHPCC/Holyoke Annual Power Shutdown Aug 9-12, 2021

The annual MGHPCC data center power shutdown and maintenance will occur August 9th through August 12th. For the most up-to-date task list, see: https://www.rc.fas.harvard.edu/events/mghpcc-power-shutdown-2021/

SCHEDULE
- Power-down will begin at 6PM on August 9th. (NOTE: some jobs will be terminated at 9am due to rack shutdowns in 7C, see TASKS below)
- Power will be out that night and through the following day, August 10th. Note: Boston storage will be affected on August 10th. Boston login and VDI will be affected for the duration of the downtime. See Boston Data Center note below.
- Maintenance and network upgrades will occur on August 11th.
- Power-up ETA and expected return to service is noon on August 12th.
While this outage impacts all services and resources in the MGHPCC/Holyoke data center, please be aware that this can have a knock-on effect for some Boston services as well.

BOSTON DATA CENTER Boston storage, login, and VDI WILL be affected on August 10th. Any additional Boston outages will be noted on our website closer to the date.

TASKS
- Nodes in Row 7C (Note: starts Aug 9th 9am): Jobs running on any node in the following racks will be terminated by 9am to facilitate shutting down these racks for hardware changes/cooling shutoff: holy7c16, holy7c18, holy7c20, holy7c22, holy7c24, holy7c26 -- This will impact jobs in the following partitions: arguellesdelgado, davies, edwards, fasse, geophysics, giribet, hucecascade, hucecascadepriority, imasc, itccluster, kovac, ncf, ncfinteract, ncf_nrg, ortegahernandez, phelevan, shared (partial outage), test, unrestricted, xlin, zon -- 36 new bigmem nodes (Intel Ice Lake 64 core, 512 GB), and 18 GPU nodes (4x NVidia A100) will be added in this row. Cooling shutdown to these racks is necessary in order for Lenovo to install this new hardware.
- Login and compute OS upgrades from CentOS 7.8.2003 to CentOS 7.9.2009 Note: After upgrade SSH keys may change. See: https://docs.rc.fas.harvard.edu/kb/ssh-key-error/
- Infiniband network upgrades
- SLURM master replacement
- Core and distribution equipment replacement
- Tier 1 (Isilon) storage firmware upgrades
- Network maintenance and upgrades: Major upgrades, replacing the 8 year old distribution and core switches to support 2 x 100Gbps connectivity to campus and Internet.

FAS Research Computing - FASRC: MGHPCC/Holyoke Annual Power Shutdown Aug 9-12, 2021 – Maintenance details

FASRC: MGHPCC/Holyoke Annual Power Shutdown Aug 9-12, 2021