FAS Research Computing - some nodes in drain state – Incident details

Status page for the Harvard FAS Research Computing cluster and other resources.
WINTER BREAK: Harvard and FASRC will be closed for winter break as of Sat. Dec 21st, 2024. We will return on Jan. 2nd, 2025. We will monitor for critical issues. All other work will be deferred until we return.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu | Email: rchelp@rc.fas.harvard.edu


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

some nodes in drain state

Resolved
Operational
Started about 2 months agoLasted about 6 hours

Affected

Cannon Cluster
Cannon Compute Cluster (Holyoke)
Updates
  • Resolved
    Resolved

    The remaining number of affected nodes is very low and well-below our incident threshold (at any time, there will be some number of nodes out of commission on the cluster).

    We are closing this incident. Any remaining nodes will be brought back into service as they are assessed and cleared up. Some require reboots but must be fully drained of jobs first.

    Thanks,

    FAS Research Computing

  • Monitoring
    Monitoring

    More nodes open. Please note that this was less than 10% of our total nodes, so if you did not experience any issues this is why.

    Root cause appears to be holyscratch01 issues. Failover on a hung OST helped.

    Please do move off of holyscratch01 as soon as possible and begin using the new netscratch instead. holyscratch01 will go read-only on Dec. 2nd, 2024.

    https://www.rc.fas.harvard.edu/blog/announcing-netscratch/

  • Update
    Update

    Please note that this also affects other nodes and is not limited to the test partition. We are working to return all nodes to service but do need to allow jobs/draining to complete on many nodes.

  • Update
    Update

    Some test nodes re-opened. Working on remainder. No ETA.

  • Identified
    Identified
    Many/most nodes in the test partition are in drain state. We are investigating and re-opening nodes as we can. More details to follow.