Experiencing partially degraded performance

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE | Academic


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu | Email: rchelp@rc.fas.harvard.edu


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Annual MGHPCC/Holyoke data center power downtime - May 21-24 2024
Scheduled for May 22, 2024 at 1:00 AM – May 24, 2024 at 9:00 PM 3 days
  • Planned
    May 22, 2024 at 1:00 AM
    Planned
    May 22, 2024 at 1:00 AM

    The 2024 MGHPCC data center annual power downtime will take place May 21-24, 2024.

    We will begin our shutdown on Tuesday May 21st and expect a return to service by 5PM Friday May 24th.

    - Jobs: Please plan ahead as all still running jobs on the morning of May 21st will be stopped and canceled and will need to be resubmitted after the downtime. Pending jobs will remain in the queue until the cluster returns to regular service on May 24th.

    - Access: The cluster, scheduler, login, and OoD nodes will be unavailable for the duration of the downtime. New lab and account requests should wait until after the downtime.

    - Storage: All Holyoke storage will be powered down and unavailable for the duration of the downtime. Boston storage will remain online, but your ability to access it may be impacted and network changes may briefly affect its availability.

    Further details, an explanation for this year's change in scheduling, a visual timeline, and a list of maintenance tasks overview can be found at:

    https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/

    Progress of the downtime will be posted here on our status page during the event. Note that you can subscribe to receive updates as they happen. Click Get Updates in the upper right.

    MAJOR TASK OVERVIEW

    • OS upgrade to Rocky 8.9 - Point upgrade, no code rebuilds will be required. Switch from system OFED to Mellanox OFED on nodes for improved performance

    • Infiniband (network) upgrades

    • BIOS updates (various)

    • Storage firmware updates

    • Network Maintenance

    • Decommission old nodes (targets contacted)

    • Additional minor one-off updates and maintenance (cable swap, reboots, etc.)

    Thanks,

    FAS Research Computing

    https://www.rc.fas.harvard.edu/

    https://docs.rc.fas.harvard.edu/

    https://status.rc.fas.harvard.edu/

Cannon Cluster

Operational

SLURM Scheduler - Cannon

Operational

Cannon Compute Cluster (Holyoke)

Operational

Boston Compute Nodes

Operational

GPU nodes (Holyoke)

Operational

SEAS compute partition

Operational

FASSE Cluster

Operational

SLURM Scheduler - FASSE

Operational

FASSE Compute Cluster (Holyoke)

Operational

Kempner Cluster

Operational

Kempner Cluster CPU

Operational

Kempner Cluster GPU

Operational

Login Nodes

Operational

Login Nodes - Boston

Operational

Login Nodes - Holyoke

Operational

FASSE login nodes

Operational

VDI/OpenOnDemand

Operational

Cannon VDI (Open OnDemand)

Operational

FASSE VDI (Open OnDemand)

Operational

Storage

Degraded performance

Holyscratch01 (Global Scratch)

Degraded performance

Home Directory Storage - Boston

Operational

HolyLFS03 (Tier 0)

Operational

HolyLFS04 (Tier 0)

Operational

HolyLFS05 (Tier 0)

Operational

Holystore01 (Tier 0)

Operational

Holylabs

Operational

BosLFS02 (Tier 0)

Operational

Isilon Storage Boston (Tier 1)

Operational

Isilon Storage Holyoke (Tier 1)

Operational

CEPH Storage Boston (Tier 2)

Operational

Tape - (Tier 3)

Operational

Boston Specialty Storage

Operational

Holyoke Specialty Storage

Degraded performance

Samba Cluster

Operational

Globus Data Transfer

Operational

bosECS

Operational

holECS

Operational

Recent notices

May 15, 2024

May 14, 2024

May 13, 2024

Critical power supply work at MGHPCC May 13th - A subset of Cannon nodes will be idled
  • Completed
    May 13, 2024 at 7:15 PM
    Completed
    May 13, 2024 at 7:15 PM

    This work has been completed and all 8A nodes are back in service.

  • In progress
    May 13, 2024 at 12:00 PM
    In progress
    May 13, 2024 at 12:00 PM
    Maintenance is now in progress
  • Planned
    May 13, 2024 at 12:00 PM
    Planned
    May 13, 2024 at 12:00 PM

    WHAT: Some nodes in row 8A will be idled at MGHPCC

    WHEN: May 13th 8am-4pm

    To avoid a future over-capacity situation, MGHPCC will be performing power supply work on May 13th from 8am-4pm. This includes sections of row 8A where some of our Cannon compute is located. We will be idling half the nodes in Pod 8a to allow necessary power work.

    Unfortunately this cannot be done during our upcoming outage. This work is dictated by availability of electricians and other resources outside the facilities control, otherwise it would have been included in the May 21-24 downtime.

    IMPACT

    Half the nodes in racks 8a22, 8a28, 8a30, 8a32 will be down (only ~ 114 nodes, around 7% of total capacity). The work being done will also enable us to add more capacity for future purchases so it's important that we allow this interruption.

    Impacted partitions are outlined at the bottom of this notice. Impact includes but is not limited to gpu, intermediate, sapphire, and hsph. A reservation to idle the nodes is already in place.

    Pending jobs in those partitions will take longer to start due to fewer available nodes during this time. Where possible, please use or include other partitions in your job scripts and plan accordingly for any new or long-running jobs during that period: https://docs.rc.fas.harvard.edu/kb/running-jobs/#Slurm_partitions

    Thanks for your understanding.

    FAS Research Computing

    https://www.rc.fas.harvard.edu

    https://status.rc.fas.harvard.edu

    Partitions with nodes in 8A whose capacity will be reduced during this maintenance:

    arguelles_delgado_gpu

    bigmem_intermediate

    bigmem

    blackhole_gpu

    eddy

    gershman

    gpu

    hejazi

    hernquist_ice

    hoekstra

    hsph

    huce_ice

    iaifi_gpu

    iaifi_gpu_priority

    intermediate

    itc_gpu

    joonholee

    jshapiro

    jshapiro_priority

    jshapiro_sapphire

    kempner

    kempner_dev

    kempner_h100

    kovac

    kozinsky_gpu

    kozinsky

    kozinsky_priority

    murphy_ice

    ortegahernandez_ice

    sapphire

    rivas

    seas_compute

    seas_gpu

    siag_gpu

    siag_combo

    siag

    sur

    test

    yao

    yao_priority

    zhuang

May 12, 2024

May 11, 2024

May 10, 2024

May 09, 2024

Show notice history