FAS Research Computing - Notice history

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE | Academic


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu | Email: rchelp@rc.fas.harvard.edu


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Operational

SLURM Scheduler - Cannon - Operational

Cannon Compute Cluster (Holyoke) - Operational

Boston Compute Nodes - Operational

GPU nodes (Holyoke) - Operational

seas_compute - Operational

Operational

SLURM Scheduler - FASSE - Operational

FASSE Compute Cluster (Holyoke) - Operational

Operational

Kempner Cluster CPU - Operational

Kempner Cluster GPU - Operational

Operational

Login Nodes - Boston - Operational

Login Nodes - Holyoke - Operational

FASSE login nodes - Operational

Operational

Cannon Open OnDemand/VDI - Operational

FASSE Open OnDemand/VDI - Operational

Operational

Netscratch (Global Scratch) - Operational

Holyscratch01 (Pending Retirement) - Operational

Home Directory Storage - Boston - Operational

HolyLFS06 (Tier 0) - Operational

HolyLFS04 (Tier 0) - Operational

HolyLFS05 (Tier 0) - Operational

Holystore01 (Tier 0) - Operational

Holylabs - Operational

BosLFS02 (Tier 0) - Operational

Isilon Storage Boston (Tier 1) - Operational

Isilon Storage Holyoke (Tier 1) - Operational

CEPH Storage Boston (Tier 2) - Operational

Tape - (Tier 3) - Operational

Boston Specialty Storage - Operational

Holyoke Specialty Storage - Operational

Samba Cluster - Operational

Globus Data Transfer - Operational

bosECS - Operational

holECS - Operational

Notice history

Nov 2024

some nodes in drain state
  • Resolved
    Resolved

    The remaining number of affected nodes is very low and well-below our incident threshold (at any time, there will be some number of nodes out of commission on the cluster).

    We are closing this incident. Any remaining nodes will be brought back into service as they are assessed and cleared up. Some require reboots but must be fully drained of jobs first.

    Thanks,

    FAS Research Computing

  • Monitoring
    Monitoring

    More nodes open. Please note that this was less than 10% of our total nodes, so if you did not experience any issues this is why.

    Root cause appears to be holyscratch01 issues. Failover on a hung OST helped.

    Please do move off of holyscratch01 as soon as possible and begin using the new netscratch instead. holyscratch01 will go read-only on Dec. 2nd, 2024.

    https://www.rc.fas.harvard.edu/blog/announcing-netscratch/

  • Update
    Update

    Please note that this also affects other nodes and is not limited to the test partition. We are working to return all nodes to service but do need to allow jobs/draining to complete on many nodes.

  • Update
    Update

    Some test nodes re-opened. Working on remainder. No ETA.

  • Identified
    Identified
    Many/most nodes in the test partition are in drain state. We are investigating and re-opening nodes as we can. More details to follow.
holyscratch01 degraded - storage brick failover
  • Resolved
    Resolved
    Failover has completed.
  • Identified
    Identified

    We are continuing to work on a fix for The storage brick OST2b needs failing over again. WIP.

    Please do move to the new Netscratch as soon as possible. https://www.rc.fas.harvard.edu/blog/announcing-netscratch/this incident.

Oct 2024

Sep 2024

Virtual Machine hypervisor down - Affects FASSE login/OOD
  • Resolved
    Resolved
    Resolving. The hypervisor and all but one VM, which has separate issue, are operational.
  • Update
    Update

    FASSE Open OnDemand and FASSE login services should be operational now.

  • Monitoring
    Monitoring

    FASSE OOD is back up

    FASSE login nodes are still down

  • Identified
    Identified

    One of the hypervisors managing virtual machines is down. We are working to bring it back up. This does affect FASSE login and FASSE OOD nodes as well as may degrade OpenAuth (two-factor).

    Affected hosts are:
    HOST -- STATUS

    dataverse-backup UNKNOWN

    demo2-l3-fs UNKNOWN

    enos-vote-l3-fs UNKNOWN

    fasselogin01 UNKNOWN

    fasselogin02 UNKNOWN

    frontier-squid02 UNKNOWN

    frontier-squid03 UNKNOWN

    frontier-squid04 UNKNOWN

    goel-adm24-l3-fs UNKNOWN

    goel-blind-l3-fs UNKNOWN

    goel-l3-fs UNKNOWN

    h-dev-fasseooda-01 UNKNOWN

    h-dev-fasseooda-lb01 UNKNOWN

    h-dev-fasseoodb-lb11 UNKNOWN

    h-fasseooda-01 UNKNOWN

    h-fasseooda-lb02 UNKNOWN

    h-fasseoodb-lb11 UNKNOWN

    h-fasseoodb-lb12 UNKNOWN

    h-fasseoodc-lb21 UNKNOWN

    h-fasseoodc-lb22 UNKNOWN

    h-qa-fasseooda-01 UNKNOWN

    h-qa-fasseooda-lb02 UNKNOWN

    holy-es-master01 UNKNOWN

    holy-es-master02 UNKNOWN

    holy-es-master03 UNKNOWN

    holynagios UNKNOWN

    kreindlerl3-fs UNKNOWN

    martin-su-l3-fs UNKNOWN

    mcconnell-l3-fs UNKNOWN

    openauth02 jtriley UNKNOWN

    shleifer-dsl3-fs UNKNOWN

    stock-solar-l3-fs UNKNOWN

    stopsack-l3-fs UNKNOWN

    xcat UNKNOWN

Sep 2024 to Nov 2024

Next