FAS Research Computing - Many nodes in 8A down - affects sapphire, test, bigmem, and other partitions – Incident details

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu | Email: rchelp@rc.fas.harvard.edu


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Many nodes in 8A down - affects sapphire, test, bigmem, and other partitions

Resolved
Operational
Started about 1 year agoLasted about 13 hours

Affected

Cannon Cluster

Operational from 2:30 PM to 3:44 AM

SLURM Scheduler - Cannon

Operational from 2:30 PM to 3:44 AM

Cannon Compute Cluster (Holyoke)

Operational from 2:30 PM to 3:44 AM

Boston Compute Nodes

Operational from 2:30 PM to 3:44 AM

GPU nodes (Holyoke)

Operational from 2:30 PM to 3:44 AM

FASSE Cluster

Operational from 2:30 PM to 3:44 AM

Updates
  • Resolved
    Resolved
    This incident has been resolved.
  • Investigating
    Investigating

    We are still unable to resolve the issue with these nodes and are working with the facility, networking, and our staff to find a solution. The affected partitions (noted in previous update below) will be resource-constrained and continue to be slow or unable to queue new jobs.


    If you are using a partition that cannot queue new jobs, please consider adding additional partitions to your job: https://docs.rc.fas.harvard.edu/kb/running-jobs/#Slurm_partitions

    Also, a reminder that the data center power downtime will begin Tuesday morning. So any new jobs scheduled for more than 3 days will not complete: https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/

  • Identified
    Identified

    We are still working on the root cause and resolution for these downed nodes.

    Partitions with one or more affected nodes [involves multiple nodes unless denoted as (1) ]:

    arguelles_delgado_gpu (1)

    hsph

    joonholee

    jshapiro_sapphire

    lichtmandce01

    bigmem

    gpu_requeue (1)

    intermediate

    sapphire

    serial_requeue (1)

    shared (1)

    test

    yao / yao_priority

    use 'sinfo -p [partition name]' if you wish to see the down nodes in particular queue

  • Investigating
    Investigating

    We are currently investigating this incident. An unknown outage has downed many nodes in row 8A of our data center. More information to follow.

    Includes nodes from the sapphire, test, gpu, and other partitions.