FAS Research Computing - Cluster-wide storage issues affecting jobs and performance - Severe - WIP – Incident details

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu | Email: rchelp@rc.fas.harvard.edu


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Cluster-wide storage issues affecting jobs and performance - Severe - WIP

Resolved
Operational
Started almost 2 years agoLasted 4 days

Affected

Cannon Cluster

Operational from 9:02 PM to 3:48 PM

SLURM Scheduler - Cannon

Operational from 9:02 PM to 3:48 PM

Cannon Compute Cluster (Holyoke)

Operational from 9:02 PM to 3:48 PM

Boston Compute Nodes

Operational from 9:02 PM to 3:48 PM

GPU nodes (Holyoke)

Operational from 9:02 PM to 3:48 PM

Login Nodes

Operational from 9:02 PM to 3:48 PM

Updates
  • Resolved
    Resolved

    holylabs is once again stable and remounted across the cluster.
    We will resolve this incident once the maintenance period is over.

  • Update
    Update

    The issue with holylabs and the LNet routers is affecting other filesystems and the cluster at large. In order to allow some degree of stability on the cluster this weekend, we are temporarily unmounting holylabs. This means it will not be accessible. All data on holylabs is safe.

    We are continuing to work on this issue.

  • Identified
    Identified

    We will be removing holylabs ability to talk to LNet routers (FDR) so that we can reclaim stability elsewhere. However, this means that holy labs will only be available to nodes which use HDR, such as compute nodes, and which do not rely on LNet.

    This means that until a solution is found to the instability, holylabs will not be available on login nodes, OOD head nodes (where the GUI runs), or on data transfer nodes including Globus.

    We are also in contact with Lustre support for help resolving this.

    No ETA at this time.

  • Investigating
    Investigating

    This issue has returned but is not precisely the same as the earlier issue. An issue we do not yet fully understand is causing the instability between Lustre (LFS) filesystems, holylabs being most-susceptible, and the LNet routers that connect them to/from certain hosts. We have rolled back changes to the LNET routers, but we are still seeing holylabs become unusable from hosts and/or only partially usable.

    At this time the knock-on effect impacts not just cluster nodes and jobs, but VDI/OOD and login nodes.

    Unfortunately at this time we do not have an ETA and much of the compute cluster is unstable.