4th June 2019

MGHPCC power shutdown, scheduled 1 month ago
  • UPDATE 6/12 8:50PM: All maintenance complete, we are load testing scratchlfs for the next hour and will then open partitions in SLURM.

  • UPDATE 6/12 7:55PM: Maintenance on scratchlfs is nearly complete and we expect to be able to return the cluster to operation very shortly. We will update this page and send an all-clear email as soon as that is the case.

  • UPDATE 6/12 5PM: We are commencing with final compute upgrades and initial powering up. However, we will remain in a holding pattern as we do not have a firm ETA from scratchlfs vendor on when their work will complete. We will update again at 8PM with either an all-clear, or with an updated ETA and email to all if scratchlfs is not yet online.

  • UPDATE 6/12 2PM: All internal maintenance is Progressing as expected and on time. Completion of work by vendor on scratchlfs is currently the only unknown. Next update at 5PM.

  • UPDATE 6/11 1:35PM: Lab share moves - dulacfs2, osheafs1, hoekstrafs4, tzipermanfs2 are moved and back online.

Each year our primary data center, MGHPCC (Holyoke), performs a full power shutdown for electrical maintenance. This requires us to power down all FASRC systems at MGHPCC starting the evening before. This includes all compute and many storage systems. Some systems housed at our Boston data center may also be affected.

This period also allows us a window to fit in maintenance that would otherwise require us shutting off various resources during normal operations. Note that this power event will mean the termination of all running jobs as power to the entire facility will be out. Jobs cannot be suspended and resumed as nodes will be powered off.

NOTE: Shutdown of our systems prior to MGHPCC power-off begins Monday 6/10/19 6PM. Return to normal operation ETA June 12th 8PM.

SCHEDULE

  • June 10th - Evening Before 6PM: All running jobs will be terminated and we will begin powering down all devices at MGHPCC/Holyoke.
  • June 11th - Day Of: Power will be OUT the entire day as MGHPCC performs their work. This will affect nearly the entire FASRC environment.
  • June 12th - Following Day: We will perform yearly maintenance tasks after power-up begins. We expect to be back to normal operations by approximately 8PM. — please note there will be NO office hours on Wednesday 6/12 (Main Campus) or Tuesday 6/11 (HCSPH)

graphical schedule as above

WHAT IS AFFECTED

  • All resources in Holyoke will be affected for the duration of the event. This includes the compute cluster, scheduler, scratchlfs, storage, and other devices housed at MGHPCC/Holyoke.
  • Resources in Boston and Cambridge, including storage and network, will also be affected during yearly maintenance work. Please plan accordingly as all resources will be affected at some point during this event.
  • Software modules: IMPORTANT - SEE BELOW
  • NO office hours on Wednesday 6/12
  • The help ticket system will be updated 6/11 and will be down periodically . Please see: https://status.rc.fas.harvard.edu on 6/11

SOFTWARE MODULES - !!! IF YOU SUBMIT JOBS, PLEASE READ !!!

After June 12th, EasyBuild will be added to all user environments. This requires your attention as your job scripts may fail if module calls do not use the full name.

For best interoperability of EasyBuild based modules with existing software modules, please use complete module names and versions to make sure the correct software modules are loaded in your user environment. Example: module load intel/17.0.4-fasrc01

If you are currently using "module load intel" it will load intel from EasyBuild space and break your workflow.

COMPUTE OS UPDATE

During this event, once basic power is available to us, we will also be upgrading all compute nodes to the latest CentOS. This is a minor version update. No impact after the upgrade is expected.

OTHER TASKS

  • LNET router rebuild and Lustre upgrade - Lustre (LFS) filesystems affected across the board
  • Infiniband (fiber networking) updates
  • OS update on all compute nodes
  • Add Easybuild modules to default module path
  • Cuda drivers upgrade, DGX1 firmware updates
  • Physical move of several servers - Transparent to users once complete
  • Firewall upgrade - Network affected, transparent to users once complete
  • Re-cabling of various storage systems

RETURN TO NORMAL OPERATIONS

We will notify the community via our users email list when we are back to normal operations.

We will reflect the current status on our status page: https://status.rc.fas.harvard.edu

For details, see: https://www.rc.fas.harvard.edu/mghpcc-shutdown-2019

FAS Research Computing https://www.rc.fas.harvard.edu https://status.rc.fas.harvard.edu