FAS Research Computing - Critical power supply work at MGHPCC May 13th - A subset of Cannon nodes will be idled – Maintenance details

Status page for the Harvard FAS Research Computing cluster and other resources.
WINTER BREAK: Harvard and FASRC will be closed for winter break as of Sat. Dec 21st, 2024. We will return on Jan. 2nd, 2025. We will monitor for critical issues. All other work will be deferred until we return.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu | Email: rchelp@rc.fas.harvard.edu


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Critical power supply work at MGHPCC May 13th - A subset of Cannon nodes will be idled

Completed
Scheduled for May 13, 2024 at 12:00 PM – 7:15 PM

Affects

Cannon Cluster
Cannon Compute Cluster (Holyoke)
GPU nodes (Holyoke)
FASSE Cluster
FASSE Compute Cluster (Holyoke)
Updates
  • Completed
    May 13, 2024 at 7:15 PM
    Completed
    May 13, 2024 at 7:15 PM

    This work has been completed and all 8A nodes are back in service.

  • In progress
    May 13, 2024 at 12:00 PM
    In progress
    May 13, 2024 at 12:00 PM
    Maintenance is now in progress
  • Planned
    May 13, 2024 at 12:00 PM
    Planned
    May 13, 2024 at 12:00 PM

    WHAT: Some nodes in row 8A will be idled at MGHPCC

    WHEN: May 13th 8am-4pm

    To avoid a future over-capacity situation, MGHPCC will be performing power supply work on May 13th from 8am-4pm. This includes sections of row 8A where some of our Cannon compute is located. We will be idling half the nodes in Pod 8a to allow necessary power work.

    Unfortunately this cannot be done during our upcoming outage. This work is dictated by availability of electricians and other resources outside the facilities control, otherwise it would have been included in the May 21-24 downtime.

    IMPACT

    Half the nodes in racks 8a22, 8a28, 8a30, 8a32 will be down (only ~ 114 nodes, around 7% of total capacity). The work being done will also enable us to add more capacity for future purchases so it's important that we allow this interruption.

    Impacted partitions are outlined at the bottom of this notice. Impact includes but is not limited to gpu, intermediate, sapphire, and hsph. A reservation to idle the nodes is already in place.

    Pending jobs in those partitions will take longer to start due to fewer available nodes during this time. Where possible, please use or include other partitions in your job scripts and plan accordingly for any new or long-running jobs during that period: https://docs.rc.fas.harvard.edu/kb/running-jobs/#Slurm_partitions

    Thanks for your understanding.

    FAS Research Computing

    https://www.rc.fas.harvard.edu

    https://status.rc.fas.harvard.edu

    Partitions with nodes in 8A whose capacity will be reduced during this maintenance:

    arguelles_delgado_gpu

    bigmem_intermediate

    bigmem

    blackhole_gpu

    eddy

    gershman

    gpu

    hejazi

    hernquist_ice

    hoekstra

    hsph

    huce_ice

    iaifi_gpu

    iaifi_gpu_priority

    intermediate

    itc_gpu

    joonholee

    jshapiro

    jshapiro_priority

    jshapiro_sapphire

    kempner

    kempner_dev

    kempner_h100

    kovac

    kozinsky_gpu

    kozinsky

    kozinsky_priority

    murphy_ice

    ortegahernandez_ice

    sapphire

    rivas

    seas_compute

    seas_gpu

    siag_gpu

    siag_combo

    siag

    sur

    test

    yao

    yao_priority

    zhuang