(message sent to cluster users)
We are currently experiencing a problem with the scratch filesystem which has varied impact on users and jobs.
We are in contact with the vendor as it appears to be an underlying hardware-caused issue that affects the distributed nature of the filesystem. To be more specific one of the storage bricks making up the filesystem is inaccessible. This makes it impossible to predict which processes, files, and/or users will be impacted as the files are distributed randomly over the entire filesystem including the impacted brick.
At this time scratch is online, but performance is degraded and it's difficult to note where problems will manifest for users. Login sessions may become stuck, ls of the scratch filesystem may hang, read/write actions may hang or hang intermittently.
Globus endpoints may complain or stall.
Login nodes and OOD/VDI may also be affected.
It should be noted that data itself is not lost but simply inaccessible. Once the impacted brick is repaired it will be put back into service.
While we in general recommend using holyscratch01 for production work, while it is in this state we recommend users use other storage they have access to as temporary scratch space until holyscratch01 is repaired. Tier 0 and Tier 1 storage can support computation along with holylabs. If you do so we do ask that you curtail your use appropriately so that you do not adversely impact yourself or your neighbors in the filesystem.
We will provide updates on our status page as this issue evolves. Please visit our status page at:
https://fasrc.instatus.com/ (aka https://status.rc.fas.harvard.edu/)
Thanks for your patience as we work to resolve this.
FAS Research Computing
https://www.rc.fas.harvard.edu/
https://status.rc.fas.harvard.edu/