Experiencing partially degraded performance

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE | Academic


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu | Email: rchelp@rc.fas.harvard.edu


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

holyscratch01 degradation

Resolved
Operational
Started 4 months ago Lasted about 23 hours

Affected

Cannon Cluster
SLURM Scheduler - Cannon
Login Nodes
Login Nodes - Boston
Login Nodes - Holyoke
FASSE login nodes
Updates
  • Resolved
    Resolved

    This incident has been resolved.

    holyscratch01 is performing normally again.

    Thanks for your patience.

  • Identified
    Update

    (message sent to cluster users)

    We are currently experiencing a problem with the scratch filesystem which has varied impact on users and jobs.

    We are in contact with the vendor as it appears to be an underlying hardware-caused issue that affects the distributed nature of the filesystem. To be more specific one of the storage bricks making up the filesystem is inaccessible. This makes it impossible to predict which processes, files, and/or users will be impacted as the files are distributed randomly over the entire filesystem including the impacted brick.

    At this time scratch is online, but performance is degraded and it's difficult to note where problems will manifest for users. Login sessions may become stuck, ls of the scratch filesystem may hang, read/write actions may hang or hang intermittently.
    Globus endpoints may complain or stall.

    Login nodes and OOD/VDI may also be affected.

    It should be noted that data itself is not lost but simply inaccessible. Once the impacted brick is repaired it will be put back into service.

    While we in general recommend using holyscratch01 for production work, while it is in this state we recommend users use other storage they have access to as temporary scratch space until holyscratch01 is repaired. Tier 0 and Tier 1 storage can support computation along with holylabs. If you do so we do ask that you curtail your use appropriately so that you do not adversely impact yourself or your neighbors in the filesystem.

    We will provide updates on our status page as this issue evolves. Please visit our status page at:
    https://fasrc.instatus.com/ (aka https://status.rc.fas.harvard.edu/)

    Thanks for your patience as we work to resolve this.
    FAS Research Computing
    https://www.rc.fas.harvard.edu/
    https://status.rc.fas.harvard.edu/

  • Identified
    Update

    We are continuing to work on a fix for this incident.

    This impacts general scratch usage. It may cause sessions to freeze or affect your ability to start jobs. Also affects Globus and often logins which touch scratch (if you have scratch symlinks in your home directory or links in your .bashrc to scratch, it is a good idea in general to remove those).
    Login nodes and OOD/VDI may also be affected.

    Updates as we have them.

  • Identified
    Identified

    holyscratch01 is stabilized, but 2 OSTs (object storage targets) are offline. The data on these OSTs will be unavailable while we work on this.

    This is also impacting the Slurm scheduler as jobs are not exiting properly or getting wedged on exit. This is causing new jobs to pend for very long times before being able to start.

    We are continuing to work on a fix for this incident.

  • Investigating
    Investigating

    Holyscratch01 is having connectivity issues. We are investigating this problem.