Lịch sử thông báo

Đang hoạt động

thg 4 2025

Đã khắc phục
tháng 04 28, 2025 tại 20:36UTC
Đã khắc phục
tháng 04 28, 2025 tại 20:36UTC
Starfish is once again accepting logins.
Đang điều tra
tháng 04 28, 2025 tại 14:25UTC
Đang điều tra
tháng 04 28, 2025 tại 14:25UTC
Starfish is not allowing login due to a database issue. We have contacted the vendor and expect to resolve this soon.

Đã khắc phục
tháng 04 27, 2025 tại 12:41UTC
Đã khắc phục
tháng 04 27, 2025 tại 12:41UTC
Most nodes have been returned to service, both Cannon and FASSE are operable. Remaining down nodes will be remediated by FASRC staff over the next several days.
Đang điều tra
tháng 04 27, 2025 tại 11:02UTC
Đang điều tra
tháng 04 27, 2025 tại 11:02UTC
At 1:05am EDT MGHPCC lost power to compute due to cooling failure. The chiller has been restored and power is back as of 5:43am. As a result of this all compute is powered off and all running jobs were requeued. FASRC staff is triaging the down hardware and will bring it back online when it is ready. Jobs will start once FASRC staff have confirmed that everything is in good state.

Đã khắc phục
tháng 04 16, 2025 tại 14:40UTC
Đã khắc phục
tháng 04 16, 2025 tại 14:40UTC
holylabs is back up
Đã nhận diện
tháng 04 16, 2025 tại 14:27UTC
Đã nhận diện
tháng 04 16, 2025 tại 14:27UTC
holylabs must be rebooted to resolved a stability/funciton issue.

Đã khắc phục
tháng 04 14, 2025 tại 17:18UTC
Đã khắc phục
tháng 04 14, 2025 tại 17:18UTC
Cannon boslogin and FASSE login nodes are back up and operational.
All holylogin nodes are still down for repair, please see our posted incident for more updates: https://status.rc.fas.harvard.edu/cm97gyay90013dturk7fxg5pb
We apologize for the unexpected disruption.
Đang điều tra
tháng 04 14, 2025 tại 16:58UTC
Đang điều tra
tháng 04 14, 2025 tại 16:58UTC
Due to a configuration error, all cluster login nodes are rebooting and are temporarily unavailable. Please save any work immediately.

Đã khắc phục
tháng 04 14, 2025 tại 18:48UTC
Đã khắc phục
tháng 04 14, 2025 tại 18:48UTC
Hardware has been repaired and holyoke login nodes are back online. Thanks for your patience.
Đang theo dõi
tháng 04 09, 2025 tại 15:15UTC
Đang theo dõi
tháng 04 09, 2025 tại 15:15UTC
Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.
- holyoke login nodes (holylogin05-08) are down for hardware repair
- Only Boston login nodes available (ie, boslogin[05-08])
If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.
As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.
If you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or (once they are back in service) holylogin.rc.fas.harvard.edu (Holyoke).
See also: Command line access with Terminal (login nodes) – FASRC DOCS
Đã khắc phục
tháng 04 09, 2025 tại 15:01UTC
Đã khắc phục
tháng 04 09, 2025 tại 15:01UTC
This incident was posted by mistake.

holylogin01-04 were replaced by holylogin05-08 some time back.

As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.
Or if you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or holylogin.rc.fas.harvard.edu (Holyoke).
See also: Command line access with Terminal (login nodes) – FASRC DOCS
Đang điều tra
tháng 04 07, 2025 tại 19:33UTC
Đang điều tra
tháng 04 07, 2025 tại 19:33UTC
Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.
Audience:
- All cluster users
Impact:
- All holylogin** servers will be down till further notice
- Only Boston login nodes available (ie, boslogin[05-08])
If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.
Updates to follow as we have them.

thg 3 2025

Đã khắc phục
tháng 05 14, 2025 tại 14:13UTC
Đã khắc phục
tháng 05 14, 2025 tại 14:13UTC
This issue has not reoccurred and appears to be due to unique cluster state. We are resolving this but will continue to monitor for recurrence.
Đã nhận diện
tháng 03 24, 2025 tại 14:01UTC
Đã nhận diện
tháng 03 24, 2025 tại 14:01UTC
We are investigating an issue with the seas_gpu queue where backfill jobs sometimes are not pre-empted. We have filed a bug with SchedMD and are monitoring.

Starfish maintenance - Friday March 21, 2025 5pm-7pm

Hoàn thành
tháng 03 21, 2025 tại 23:00UTC
Hoàn thành
tháng 03 21, 2025 tại 23:00UTC
Maintenance has completed successfully
Đang tiến hành
tháng 03 21, 2025 tại 21:00UTC
Đang tiến hành
tháng 03 21, 2025 tại 21:00UTC
Maintenance is now in progress
Chưa bắt đầu
tháng 03 21, 2025 tại 21:00UTC
Chưa bắt đầu
tháng 03 21, 2025 tại 21:00UTC
Starfish will be under maintenance Friday March 21, 2025 5pm-7pm
Starfish will be inaccessible during this period.

Đã khắc phục
tháng 03 20, 2025 tại 16:16UTC
Đã khắc phục
tháng 03 20, 2025 tại 16:16UTC
boslfs02 has been restarted
Đã nhận diện
tháng 03 20, 2025 tại 15:57UTC
Đã nhận diện
tháng 03 20, 2025 tại 15:57UTC
boslfs02 is experiencing issues. We are restarting the filesystem.

Open OnDemand update March 12th 2025, 8:30AM-11:30AM

Hoàn thành
tháng 03 13, 2025 tại 03:30UTC
Hoàn thành
tháng 03 13, 2025 tại 03:30UTC
Maintenance has completed successfully
Đang tiến hành
tháng 03 13, 2025 tại 00:30UTC
Đang tiến hành
tháng 03 13, 2025 tại 00:30UTC
Maintenance is now in progress
Chưa bắt đầu
tháng 03 13, 2025 tại 00:30UTC
Chưa bắt đầu
tháng 03 13, 2025 tại 00:30UTC
OOD will be updated to include a Metrics page, reachable from the Help menu
- Audience: Affects all production Open OnDemand nodes on FASSE and Cannon
- Impact: There will be minimal impact to users. Some users may encounter an error on the dashboard. Simply refresh the page a few times and the error should go away.
Logins and jobs on OOD should not be affected during this maintenance.

FASRC monthly maintenance - Monday March 3rd, 2025 from 9am-1pm

Cập nhật
tháng 03 03, 2025 tại 18:00UTC
Cập nhật
tháng 03 03, 2025 tại 18:00UTC
Maintenance has completed successfully
Hoàn thành
tháng 03 03, 2025 tại 18:00UTC
Hoàn thành
tháng 03 03, 2025 tại 18:00UTC
Maintenance has completed successfully
Đang tiến hành
tháng 03 03, 2025 tại 14:00UTC
Đang tiến hành
tháng 03 03, 2025 tại 14:00UTC
Maintenance is now in progress
Chưa bắt đầu
tháng 03 03, 2025 tại 14:00UTC
Chưa bắt đầu
tháng 03 03, 2025 tại 14:00UTC
PLEASE NOTE - New time window going forward - 9am-1pm
FASRC monthly maintenance will take place Monday March 3rd, 2025 from 9am-1pm
NOTICES
- Training: Upcoming training from FASRC and other sources can be found on our Training Calendar. at https://www.rc.fas.harvard.edu/upcoming-training/
- Status Page: You can subscribe to our status to receive notifications of maintenance, incidents, and their resolution at https://status.rc.fas.harvard.edu/ (click Get Updates for options).
- Upcoming holidays: Memorial Day - Monday, May 26
- You can subscribe to our status page using the Get Updates button in the upper right
MAINTENANCE TASKS
Cannon cluster will be paused during this maintenance?: YES
FASSE cluster will be paused during this maintenance?: YES
- Slurm Upgrade to 24.11.2 - Crucial Update
  Audience: All cluster users
  Impact: Jobs and the scheduler will be paused during this upgrade
- Open Ondemand (OOD) reboots
  Audience: All OOD users
  Impact: All Open OnDemand (aka OOD/VDI/RCOOD) nodes will be rebooted
- Login node reboots
  Audience: Anyone logged into a FASRC Cannon or FASSE login node
  Impact: Login nodes will rebooted during this maintenance window
- bos-Isilon firmware updates
  Audience: bos-isilon users
  Impact: No noticeable impact for storage users
- Netscratch retention/cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ )
  Audience: Cluster users
  Impact: Files older than 90 days will be removed. Please note that retention cleanup can and does run at any time, not just during the maintenance window.
Thank you,
FAS Research Computing
https://www.rc.fas.harvard.edu/
https://docs.rc.fas.harvard.edu/

thg 2 2025

Đã khắc phục
tháng 02 25, 2025 tại 15:02UTC
Đã khắc phục
tháng 02 25, 2025 tại 15:02UTC
The majority of nodes have been physically reseated. Only a handful of these remain and will be dealt with as their jobs clear. Closing this incident.
Đã nhận diện
tháng 02 20, 2025 tại 16:15UTC
Đã nhận diện
tháng 02 20, 2025 tại 16:15UTC
Approximately 100 compute nodes in row 7c (holy7cxxxxx) came back up at the wrong clock speed. This is a known issue with some older nodes that can occur after power dip or loss. These nodes will need to be physically reset in the chassis to clear this issue.
Jobs are running on these nodes so they will need to be drained of jobs before we can do this. As such we are marking the cluster partially degraded.
However, please note that this is a fraction of the total nodes and should not impact the average user.
This affects some specific owned partitions:
doshi-velez - 2 of 2 nodes
shakhnovich 2 of 3 nodes
blackhole/blackhole_priority - some % nodes
hernquist - some % nodes
pehlevan - some % nodes
seas_compute - some % nodes
tambe - some % nodes
shared - some % nodes

Đã khắc phục
tháng 02 20, 2025 tại 14:42UTC
Đã khắc phục
tháng 02 20, 2025 tại 14:42UTC
Affected nodes have been restored. Jobs on the nodes that went down will automatically requeue.
Đã nhận diện
tháng 02 20, 2025 tại 14:08UTC
Đã nhận diện
tháng 02 20, 2025 tại 14:08UTC
Due to a power glitch at MGHPCC some nodes are down. We are working to restore. You may have noticed some jobs re-queued overnight.

Đã khắc phục
tháng 02 14, 2025 tại 16:30UTC
Đã khắc phục
tháng 02 14, 2025 tại 16:30UTC
An emergency patch of the scheduler has resolved the Multiple Partition issue
Đang điều tra
tháng 02 13, 2025 tại 17:57UTC
Đang điều tra
tháng 02 13, 2025 tại 17:57UTC
Since mid-January we've been seeing some strange issues with the scheduler which caused periodic stalls or unresponsiveness in the scheduler. We had hoped that the Slurm upgrade to 24.11.1 would resolve those issues due to various architecture changes in the communications backend. Unfortunately they did not; we have since opened an issue with SchedMD (our service vendor for the scheduler). This has since spiraled into finding several other issues with the scheduler which we are working to remediate. Below is a status report regarding these issues:

1. High Agent Load Stall (RESOLVED): This was reported in https://support.schedmd.com/show_bug.cgi?id=21975 The scheduler would stall due to being oversaturated with blocking requests. This turned out to be due to a new Slurm feature called stepmgr which we had enabled to handle jobs with many steps. Unfortunately this feature also increased the load on the scheduler for array jobs exiting at the same time which caused the stall. Since we tend not to have many users that use many steps we opted to disable the stepmgr function. This resolved the High Agent Load issue. Users that have many steps in their job may still turn on the stepmgr for their specific job by adding #SBATCH --stepmgr (https://slurm.schedmd.com/sbatch.html#OPT_stepmgr)

2. Scheduler Thrashing (MONITORING): We discovered this while working on the previous bug and continued to work on it in the same bug report: https://support.schedmd.com/show_bug.cgi?id=21975 Under high load, the scheduler would get into a thrashing state where the scheduler would effectively go heads down and ignore incoming requests in order to focus on scheduling jobs. To users this would look like the scheduler was unresponsive as the scheduler was ignoring their requests to deal with higher priority traffic. To remediate this we increased the thread count for the scheduler and implemented a throttle to slow things down so that the scheduler could respond to all the requests with out impacting scheduler throughput. This is in place now and appears to have resolved the issue. We are continuing to monitor the scheduler to tune this throttle.

3. --test-only requeue crash (RESOLVED): During this investigation we also ran into another bug reported by another group related to jobs that were submitted using --test-only that would in theory preempt other jobs (see: https://support.schedmd.com/show_bug.cgi?id=21997). This caused the scheduler to crash. Given the severity of the bug we emergency patched the scheduler on Feb 12th to resolve this issue.

4. Multiple Partition Jobs Labelled with Wrong Partition (IN PROGRESS): This is a new issue identified on 2/13 related to jobs that submit to multiple partitions at once (https://support.schedmd.com/show_bug.cgi?id=22076). When the job schedules it may run in one partition but be labelled as being in another. This can lead to job preemption issues as the jobs are labelled as being in partitions that cannot be preempted even though they were originally scheduled in partitions that could be. This was identified earlier by another group and SchedMD is working on a patch. Depending on the timing FASRC will either emergency patch the scheduler for this issue or wait for the formal release of 24.11.2. Note that this issue really only impacts preemption and the scheduler is working fine otherwise. If you see jobs that you think should be preempted but are not and are blocking your work please let us know and we will investigate.

Thank you for your patience as we work through these issues.

Đã khắc phục
tháng 02 14, 2025 tại 16:30UTC
Đã khắc phục
tháng 02 14, 2025 tại 16:30UTC
After another patch of the scheduler, Slurm should be much more stable now
Cập nhật
tháng 02 07, 2025 tại 18:19UTC
Cập nhật
tháng 02 07, 2025 tại 18:19UTC
We are still investigating this incident, we are working with SchedMD to find a solution.
Đang điều tra
tháng 02 07, 2025 tại 15:17UTC
Đang điều tra
tháng 02 07, 2025 tại 15:17UTC
We are currently investigating this incident.
Đã nhận diện
tháng 02 07, 2025 tại 15:16UTC
Đã nhận diện
tháng 02 07, 2025 tại 15:16UTC
We are continuing to work on a fix for this incident.
Đang điều tra
tháng 02 07, 2025 tại 15:06UTC
Đang điều tra
tháng 02 07, 2025 tại 15:06UTC
We are currently investigating this incident.

Đã khắc phục
tháng 02 03, 2025 tại 21:31UTC
Đã khắc phục
tháng 02 03, 2025 tại 21:31UTC
The scheduler is back to normal operation.
Đang điều tra
tháng 02 03, 2025 tại 21:01UTC
Đang điều tra
tháng 02 03, 2025 tại 21:01UTC
The SLURM scheduler is stalling and we have reached out to SchedMD regarding this. The Cannon cluster will be degraded and jobs may pend for longer than usual.

thg 2 2025 đến thg 4 2025

FAS Research Computing - Lịch sử thông báo

Lịch sử thông báo

thg 4 2025

thg 3 2025

thg 2 2025