At about 3:00pm today we started seeing issues with ScratchLFS being unreachable from some nodes, around that time it became apparent that ScratchLFS is currently down across all of Odyssey. Our engineers are actively reaching a solution to bring it back up and make it available to all users again. We have also notified our vendor for ScratchLFS of the issue(s) for assistance. This page will be periodically updated as things progress.
EDIT @ 5:09pm: Our engineers have employed several fixes in an attempt to bring ScratchLFS to a functioning state that have not yielded any promising results. It is not completely apparent to the root cause as of yet. The system continues to be down. We are continuing to try and bring the system up, at this time we do not have an ETA of when it will be available for use.
EDIT @ 5:30pm: We are pausing SLURM to try and work through the issues with ScratchLFS. The amount of time SLURM is paused may be intermittent as we attempt to work through the ScratchLFS outage This will prevent new jobs from being submitted as well as making SLURM unreachable for other commands.
EDIT @ 7:47pm: We have continued to persist in identifying a fix to bring ScratchLFS back in to operation, at this time we have provided detail information regarding the failure of mdt1 (and to a lesser degree of mdt0) to the vendor and are awaiting a response from them for a potential fix. In addition we have decided to keep SLURM in a paused state due to this outage.
EDIT @ April 13th at 5:00PM We found a fix to make ScrachLFS operational. We will continue to work on this next week to further diagnose the issue and work with the vendor to find a better solution so it does not happen in the future.