SLURM Scheduler - Cannon - 정상
SLURM Scheduler - Cannon
Cannon Compute Cluster (Holyoke) - 정상
Cannon Compute Cluster (Holyoke)
Boston Compute Nodes - 정상
Boston Compute Nodes
GPU nodes (Holyoke) - 정상
GPU nodes (Holyoke)
seas_compute - 정상
seas_compute
SLURM Scheduler - FASSE - 정상
SLURM Scheduler - FASSE
FASSE Compute Cluster (Holyoke) - 정상
FASSE Compute Cluster (Holyoke)
Kempner Cluster CPU - 정상
Kempner Cluster CPU
Kempner Cluster GPU - 정상
Kempner Cluster GPU
FASSE login nodes - 정상
FASSE login nodes
Cannon Open OnDemand - 정상
Cannon Open OnDemand
FASSE Open OnDemand - 정상
FASSE Open OnDemand
Netscratch (Global Scratch) - 정상
Netscratch (Global Scratch)
Home Directory Storage - Boston - 정상
Home Directory Storage - Boston
Tape - (Tier 3) - 정상
Tape - (Tier 3)
Holylabs - 정상
Holylabs
Isilon Storage Holyoke (Tier 1) - 정상
Isilon Storage Holyoke (Tier 1)
Holystore01 (Tier 0) - 정상
Holystore01 (Tier 0)
HolyLFS04 (Tier 0) - 정상
HolyLFS04 (Tier 0)
HolyLFS05 (Tier 0) - 정상
HolyLFS05 (Tier 0)
HolyLFS06 (Tier 0) - 정상
HolyLFS06 (Tier 0)
Holyoke Tier 2 NFS (new) - 정상
Holyoke Tier 2 NFS (new)
Holyoke Specialty Storage - 정상
Holyoke Specialty Storage
holECS - 정상
holECS
Isilon Storage Boston (Tier 1) - 정상
Isilon Storage Boston (Tier 1)
BosLFS02 (Tier 0) - 정상
BosLFS02 (Tier 0)
Boston Tier 2 NFS (new) - 정상
Boston Tier 2 NFS (new)
CEPH Storage Boston (Tier 2) - 정상
CEPH Storage Boston (Tier 2)
Boston Specialty Storage - 정상
Boston Specialty Storage
bosECS - 정상
bosECS
Samba Cluster - 정상
Samba Cluster
Globus Data Transfer - 정상
Globus Data Transfer
알림 내역
4월 2025
- 해결됨해결됨
Cannon boslogin and FASSE login nodes are back up and operational.
All holylogin nodes are still down for repair, please see our posted incident for more updates: https://status.rc.fas.harvard.edu/cm97gyay90013dturk7fxg5pb
We apologize for the unexpected disruption.
- 조사 중조사 중
Due to a configuration error, all cluster login nodes are rebooting and are temporarily unavailable. Please save any work immediately.
- 해결됨해결됨Hardware has been repaired and holyoke login nodes are back online. Thanks for your patience.
- 모니터링 중모니터링 중
Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.
holyoke login nodes (holylogin05-08) are down for hardware repair
Only Boston login nodes available (ie, boslogin[05-08])
If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.
As always, the best method for obtaining a login node is using
login.rc.fas.harvard.eduwhich will pick a node for you.If you require a login node in a specific data center, use
boslogin.rc.fas.harvard.edu(Boston) or (once they are back in service)holylogin.rc.fas.harvard.edu(Holyoke).See also: Command line access with Terminal (login nodes) – FASRC DOCS
- 해결됨해결됨
This incident was posted by mistake.
holylogin01-04 were replaced by holylogin05-08 some time back.
As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.eduwhich will pick a node for you.Or if you require a login node in a specific data center, use
boslogin.rc.fas.harvard.edu(Boston) orholylogin.rc.fas.harvard.edu(Holyoke).See also: Command line access with Terminal (login nodes) – FASRC DOCS
- 조사 중조사 중
Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.
Audience:
All cluster users
Impact:
All holylogin** servers will be down till further notice
Only Boston login nodes available (ie, boslogin[05-08])
If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.
Updates to follow as we have them.
3월 2025
- 업데이트3월 03, 2025 ~에서 18:00업데이트3월 03, 2025 ~에서 18:00Maintenance has completed successfully
- 완료됨3월 03, 2025 ~에서 18:00완료됨3월 03, 2025 ~에서 18:00Maintenance has completed successfully
- 진행 중3월 03, 2025 ~에서 14:00진행 중3월 03, 2025 ~에서 14:00Maintenance is now in progress
- 예정됨3월 03, 2025 ~에서 14:00예정됨3월 03, 2025 ~에서 14:00
PLEASE NOTE - New time window going forward - 9am-1pm
FASRC monthly maintenance will take place Monday March 3rd, 2025 from 9am-1pm
NOTICES
Training: Upcoming training from FASRC and other sources can be found on our Training Calendar. at https://www.rc.fas.harvard.edu/upcoming-training/
Status Page: You can subscribe to our status to receive notifications of maintenance, incidents, and their resolution at https://status.rc.fas.harvard.edu/ (click Get Updates for options).
Upcoming holidays: Memorial Day - Monday, May 26
You can subscribe to our status page using the Get Updates button in the upper right
MAINTENANCE TASKS
Cannon cluster will be paused during this maintenance?: YES
FASSE cluster will be paused during this maintenance?: YESSlurm Upgrade to 24.11.2 - Crucial Update
Audience: All cluster users
Impact: Jobs and the scheduler will be paused during this upgrade
Open Ondemand (OOD) reboots
Audience: All OOD users
Impact: All Open OnDemand (aka OOD/VDI/RCOOD) nodes will be rebooted
Login node reboots
Audience: Anyone logged into a FASRC Cannon or FASSE login node
Impact: Login nodes will rebooted during this maintenance window
bos-Isilon firmware updates
Audience: bos-isilon users
Impact: No noticeable impact for storage users
Netscratch retention/cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ )
Audience: Cluster users
Impact: Files older than 90 days will be removed. Please note that retention cleanup can and does run at any time, not just during the maintenance window.
Thank you,
FAS Research Computing
https://www.rc.fas.harvard.edu/
https://docs.rc.fas.harvard.edu/
2월 2025
- 해결됨해결됨
An emergency patch of the scheduler has resolved the Multiple Partition issue
- 조사 중조사 중
Since mid-January we've been seeing some strange issues with the scheduler which caused periodic stalls or unresponsiveness in the scheduler. We had hoped that the Slurm upgrade to 24.11.1 would resolve those issues due to various architecture changes in the communications backend. Unfortunately they did not; we have since opened an issue with SchedMD (our service vendor for the scheduler). This has since spiraled into finding several other issues with the scheduler which we are working to remediate. Below is a status report regarding these issues:
1. High Agent Load Stall (RESOLVED): This was reported in https://support.schedmd.com/show_bug.cgi?id=21975 The scheduler would stall due to being oversaturated with blocking requests. This turned out to be due to a new Slurm feature called stepmgr which we had enabled to handle jobs with many steps. Unfortunately this feature also increased the load on the scheduler for array jobs exiting at the same time which caused the stall. Since we tend not to have many users that use many steps we opted to disable the stepmgr function. This resolved the High Agent Load issue. Users that have many steps in their job may still turn on the stepmgr for their specific job by adding #SBATCH --stepmgr (https://slurm.schedmd.com/sbatch.html#OPT_stepmgr)
2. Scheduler Thrashing (MONITORING): We discovered this while working on the previous bug and continued to work on it in the same bug report: https://support.schedmd.com/show_bug.cgi?id=21975 Under high load, the scheduler would get into a thrashing state where the scheduler would effectively go heads down and ignore incoming requests in order to focus on scheduling jobs. To users this would look like the scheduler was unresponsive as the scheduler was ignoring their requests to deal with higher priority traffic. To remediate this we increased the thread count for the scheduler and implemented a throttle to slow things down so that the scheduler could respond to all the requests with out impacting scheduler throughput. This is in place now and appears to have resolved the issue. We are continuing to monitor the scheduler to tune this throttle.
3. --test-only requeue crash (RESOLVED): During this investigation we also ran into another bug reported by another group related to jobs that were submitted using --test-only that would in theory preempt other jobs (see: https://support.schedmd.com/show_bug.cgi?id=21997). This caused the scheduler to crash. Given the severity of the bug we emergency patched the scheduler on Feb 12th to resolve this issue.
4. Multiple Partition Jobs Labelled with Wrong Partition (IN PROGRESS): This is a new issue identified on 2/13 related to jobs that submit to multiple partitions at once (https://support.schedmd.com/show_bug.cgi?id=22076). When the job schedules it may run in one partition but be labelled as being in another. This can lead to job preemption issues as the jobs are labelled as being in partitions that cannot be preempted even though they were originally scheduled in partitions that could be. This was identified earlier by another group and SchedMD is working on a patch. Depending on the timing FASRC will either emergency patch the scheduler for this issue or wait for the formal release of 24.11.2. Note that this issue really only impacts preemption and the scheduler is working fine otherwise. If you see jobs that you think should be preempted but are not and are blocking your work please let us know and we will investigate.
Thank you for your patience as we work through these issues.

