8th August 2019

SLURM Scheduler SLURM scheduler issues

UPDATE 8/19/19: Slurm has been stable throughout the weekend and today. Returning status to normal.

UPDATE 8/12/19: We continue to have stability issues with Slurm. As you may or may not have seen, it has crashed a couple more times during the weekend/overnight. If you receive errors submitting new jobs or interactive sessions, or while trying to view job information, please wait a few minutes and try again. We are continuing to work with the developers and our local experts to find the root cause. Thanks for your understanding.

UPDATE 8/9/19 6:45PM - Restart successful, but stability is expected to remain the same until we have a solution from the vendor or determine a root cause. We will monitor as best we can and restart the scheduler as necessary. If you receive errors submitting new jobs or interactive sessions, or while trying to view job information, please wait a few minutes and try again. If you receive errors over a long period or overnight, please do contact us.

UPDATE 8/9/19 6:35PM - Slurm has crashed again. We are attempting a full restart.

UPDATE 4:20 PM: (second update to correct expectations - prior update was prematurely optimistic) Updates complete and scheduler restarted. Slurm appears to be stable for the time-being. This update may not alleviate the crashing issue, but will allow us to provide more detailed forensics to the developers if it does. If you receive errors submitting new jobs or interactive sessions, or while trying to view job information, please wait a few minutes and try again. If the instability is not resolved by this update, we will continue to watch and restart as necessary, so the above advice will still apply. If you receive errors over a long period or overnight, please do contact us.

UPDATE 3:30PM: To: FASRC Cluster Users

As some of you have no doubt seen, we have had some instability issues with Slurm (the cluster's job scheduler) after the recent update. This has resulted in Slurm crashing or needing to be restarted. This manifests on your end as timeouts, connection errors, as well as not being able to see up-to-date job information.

We are working with the Slurm developers on this issue, but please be aware that this will require occasionally restarting the scheduler this. This will commence ~ 3:30PM today and through the afternoon. Currently running jobs will not be affected, but your ability to schedule new jobs/sessions and view stats may be at times.

If you receive errors submitting new jobs or interactive sessions, or while trying to view job information, please wait a few minutes and try again. If the instability is not resolved by this update, we will continue to watch and restart as necessary, so the above advice will still apply.

We will update the status as it changes at https://status.rc.fas.harvard.edu/

Thanks you for your patience. We will attempt to minimize the impact as much as possible.

FAS Research Computing | FAS Division of Sciences https://rc.fas.harvard.edu/ https://status.rc.fas.harvard.edu/

=====

The cluster scheduler (SLURM), while running, is somewhat unstable after the latest upgrade and requires restarting periodically.

If you are getting Slurm connection errors or not seeing up-to-date job info, please wait a few minutes and try again. If this lasts more than about 15 minutes, then please notify us at that point.

We are monitoring this as closely as possible, but overnight stalls/crashes may not be addressed as quickly as during normal hours.

For issues not shown here, please contact FASRC via
https://portal.rc.fas.harvard.edu or email rchelp@rc.fas.harvard.edu