holyscratch01 degradation

Updates

Resolved
January 11, 2024 at 2:44 PM
Resolved
January 11, 2024 at 2:44 PM
This incident has been resolved.

holyscratch01 is performing normally again.

Thanks for your patience.
Update
January 10, 2024 at 8:48 PM
Update
January 10, 2024 at 8:48 PM
(message sent to cluster users)

We are currently experiencing a problem with the scratch filesystem which has varied impact on users and jobs.

We are in contact with the vendor as it appears to be an underlying hardware-caused issue that affects the distributed nature of the filesystem. To be more specific one of the storage bricks making up the filesystem is inaccessible. This makes it impossible to predict which processes, files, and/or users will be impacted as the files are distributed randomly over the entire filesystem including the impacted brick.

At this time scratch is online, but performance is degraded and it's difficult to note where problems will manifest for users. Login sessions may become stuck, ls of the scratch filesystem may hang, read/write actions may hang or hang intermittently.
Globus endpoints may complain or stall.

Login nodes and OOD/VDI may also be affected.

It should be noted that data itself is not lost but simply inaccessible. Once the impacted brick is repaired it will be put back into service.

While we in general recommend using holyscratch01 for production work, while it is in this state we recommend users use other storage they have access to as temporary scratch space until holyscratch01 is repaired. Tier 0 and Tier 1 storage can support computation along with holylabs. If you do so we do ask that you curtail your use appropriately so that you do not adversely impact yourself or your neighbors in the filesystem.

We will provide updates on our status page as this issue evolves. Please visit our status page at:
https://fasrc.instatus.com/ (aka https://status.rc.fas.harvard.edu/)

Thanks for your patience as we work to resolve this.
FAS Research Computing
https://www.rc.fas.harvard.edu/
https://status.rc.fas.harvard.edu/
Update
January 10, 2024 at 6:27 PM
Update
January 10, 2024 at 6:27 PM
We are continuing to work on a fix for this incident.

This impacts general scratch usage. It may cause sessions to freeze or affect your ability to start jobs. Also affects Globus and often logins which touch scratch (if you have scratch symlinks in your home directory or links in your .bashrc to scratch, it is a good idea in general to remove those).
Login nodes and OOD/VDI may also be affected.

Updates as we have them.
Identified
January 10, 2024 at 6:22 PM
Identified
January 10, 2024 at 6:22 PM
holyscratch01 is stabilized, but 2 OSTs (object storage targets) are offline. The data on these OSTs will be unavailable while we work on this.

This is also impacting the Slurm scheduler as jobs are not exiting properly or getting wedged on exit. This is causing new jobs to pend for very long times before being able to start.

We are continuing to work on a fix for this incident.
Investigating
January 10, 2024 at 3:37 PM
Investigating
January 10, 2024 at 3:37 PM
Holyscratch01 is having connectivity issues. We are investigating this problem.

FAS Research Computing - holyscratch01 degradation – Incident details