Our engineers struggled throughout the night to bring ScratchLFS to an operational state with help from the vendor. It has become apparent that the issue resides in a bug with Lustre that has caused the entire system to wedge itself into a continuous failed state. We pulled the crash logs from /scratch on that host and it seems to be an issue between the (DNE) Dynamic namespacing environment and possible the (PFL) Progressive File Layout, this is going to require all the failed over systems to be recovered which may end up being time consuming. In addition the vendor also identified several fibre cables that may have gone bad or that otherwise need replacement, we have FASRC engineers onsite at the datacenter in Holyoke today to swap out the needed parts.
SLURM is going to continue to be in a down state until ScratchLFS can be brought in to an operational state. Currently we do not have an ETA of when that will be, feel free to check back here for updates as the situation evolves.