All systems are operational

About This Site

GETTING HELP

https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu/rcrt/submit_ticket | Email: rchelp@rc.fas.harvard.edu


Status page for the Harvard FAS Research Computing cluster and other resources.

Please scroll down to see details on any Incidents or maintenance notices.

NOTICE: Annual MGHPCC downtime August 9th 6pm - August 12th 12pm DETAILS

Maintenance
FASRC: MGHPCC/(Holyoke Annual Power Shutdown Aug 9-12, 2021

The annual MGHPCC data center power shutdown and maintenance will occur August 9th through August 12th.
For the most up-to-date task list, see: https://www.rc.fas.harvard.edu/events/mghpcc-power-shutdown-2021/

SCHEDULE

  • Power-down will begin at 6PM on August 9th. (NOTE: some jobs will be terminated at 9am due to rack shutdowns in 7C, see TASKS below)
  • Power will be out that night and through the following day, August 10th. Note: Boston storage will be affected on August 10th.
    Boston login and VDI will be affected for the duration of the downtime.
    See Boston Data Center note below.
  • Maintenance and network upgrades will occur on August 11th.
  • Power-up ETA and expected return to service is noon on August 12th.

While this outage impacts all services and resources in the MGHPCC/Holyoke data center, please be aware that this can have a knock-on effect for some Boston services as well.

BOSTON DATA CENTER
Boston storage, login, and VDI WILL be affected on August 10th.
Any additional Boston outages will be noted on our website closer to the date.

TASKS

  • Nodes in Row 7C (Note: starts Aug 9th 9am): Jobs running on any node in the following racks will be terminated by 9am to facilitate shutting down these racks for hardware changes/cooling shutoff: holy7c16, holy7c18, holy7c20, holy7c22, holy7c24, holy7c26 -- This will impact jobs in the following partitions: arguelles_delgado, davies, edwards, fasse, geophysics, giribet, huce_cascade, huce_cascade_priority, imasc, itc_cluster, kovac, cf, ncf_interact, ncf_nrg, ortegahernandez, phelevan, shared (partial outage), test, unrestricted, xlin, zon -- 36 new bigmem nodes (Intel Ice Lake 64 core, 512 GB), and 18 GPU nodes (4x NVidia A100) will be added in this row. Cooling shutdown to these racks is necessary in order for Lenovo to install this new hardware.
  • Login and compute OS upgrades
    from CentOS 7.8.2003 to CentOS 7.9.2009
    Note: After upgrade SSH keys may change. See: https://docs.rc.fas.harvard.edu/kb/ssh-key-error/
  • Infiniband network upgrades
  • SLURM master replacement
  • Core and distribution equipment replacement
  • Tier 1 (Isilon) storage firmware upgrades
  • Network maintenance and upgrades: Major upgrades, replacing the 8 year old distribution and core switches to support 2 x 100Gbps connectivity to campus and Internet.

Past Incidents

9th July 2021

Holyscratch01 (Global Scratch) holyscratch01 slowness over LNET

holyscratch01 is under heavy load making it slow over the LNET routers (anything connecting via ethernet or FDR fabric, this would include the login nodes). Nodes connected via the HDR fabric (test and shared partitions) look to be fine though a little slow at times.

ncf_bigmem down

Jobs currently cannot be started on ncf_bigmem partition.

For issues not shown here, please contact FASRC via
https://portal.rc.fas.harvard.edu or email rchelp@rc.fas.harvard.edu