Notice history

Operational

Feb 2025

Resolved
February 25, 2025 at 3:02 PM
Resolved
February 25, 2025 at 3:02 PM
The majority of nodes have been physically reseated. Only a handful of these remain and will be dealt with as their jobs clear. Closing this incident.
Identified
February 20, 2025 at 4:15 PM
Identified
February 20, 2025 at 4:15 PM
Approximately 100 compute nodes in row 7c (holy7cxxxxx) came back up at the wrong clock speed. This is a known issue with some older nodes that can occur after power dip or loss. These nodes will need to be physically reset in the chassis to clear this issue.
Jobs are running on these nodes so they will need to be drained of jobs before we can do this. As such we are marking the cluster partially degraded.
However, please note that this is a fraction of the total nodes and should not impact the average user.
This affects some specific owned partitions:
doshi-velez - 2 of 2 nodes
shakhnovich 2 of 3 nodes
blackhole/blackhole_priority - some % nodes
hernquist - some % nodes
pehlevan - some % nodes
seas_compute - some % nodes
tambe - some % nodes
shared - some % nodes

Resolved
February 20, 2025 at 2:42 PM
Resolved
February 20, 2025 at 2:42 PM
Affected nodes have been restored. Jobs on the nodes that went down will automatically requeue.
Identified
February 20, 2025 at 2:08 PM
Identified
February 20, 2025 at 2:08 PM
Due to a power glitch at MGHPCC some nodes are down. We are working to restore. You may have noticed some jobs re-queued overnight.

Resolved
February 14, 2025 at 4:30 PM
Resolved
February 14, 2025 at 4:30 PM
An emergency patch of the scheduler has resolved the Multiple Partition issue
Investigating
February 13, 2025 at 5:57 PM
Investigating
February 13, 2025 at 5:57 PM
Since mid-January we've been seeing some strange issues with the scheduler which caused periodic stalls or unresponsiveness in the scheduler. We had hoped that the Slurm upgrade to 24.11.1 would resolve those issues due to various architecture changes in the communications backend. Unfortunately they did not; we have since opened an issue with SchedMD (our service vendor for the scheduler). This has since spiraled into finding several other issues with the scheduler which we are working to remediate. Below is a status report regarding these issues:

1. High Agent Load Stall (RESOLVED): This was reported in https://support.schedmd.com/show_bug.cgi?id=21975 The scheduler would stall due to being oversaturated with blocking requests. This turned out to be due to a new Slurm feature called stepmgr which we had enabled to handle jobs with many steps. Unfortunately this feature also increased the load on the scheduler for array jobs exiting at the same time which caused the stall. Since we tend not to have many users that use many steps we opted to disable the stepmgr function. This resolved the High Agent Load issue. Users that have many steps in their job may still turn on the stepmgr for their specific job by adding #SBATCH --stepmgr (https://slurm.schedmd.com/sbatch.html#OPT_stepmgr)

2. Scheduler Thrashing (MONITORING): We discovered this while working on the previous bug and continued to work on it in the same bug report: https://support.schedmd.com/show_bug.cgi?id=21975 Under high load, the scheduler would get into a thrashing state where the scheduler would effectively go heads down and ignore incoming requests in order to focus on scheduling jobs. To users this would look like the scheduler was unresponsive as the scheduler was ignoring their requests to deal with higher priority traffic. To remediate this we increased the thread count for the scheduler and implemented a throttle to slow things down so that the scheduler could respond to all the requests with out impacting scheduler throughput. This is in place now and appears to have resolved the issue. We are continuing to monitor the scheduler to tune this throttle.

3. --test-only requeue crash (RESOLVED): During this investigation we also ran into another bug reported by another group related to jobs that were submitted using --test-only that would in theory preempt other jobs (see: https://support.schedmd.com/show_bug.cgi?id=21997). This caused the scheduler to crash. Given the severity of the bug we emergency patched the scheduler on Feb 12th to resolve this issue.

4. Multiple Partition Jobs Labelled with Wrong Partition (IN PROGRESS): This is a new issue identified on 2/13 related to jobs that submit to multiple partitions at once (https://support.schedmd.com/show_bug.cgi?id=22076). When the job schedules it may run in one partition but be labelled as being in another. This can lead to job preemption issues as the jobs are labelled as being in partitions that cannot be preempted even though they were originally scheduled in partitions that could be. This was identified earlier by another group and SchedMD is working on a patch. Depending on the timing FASRC will either emergency patch the scheduler for this issue or wait for the formal release of 24.11.2. Note that this issue really only impacts preemption and the scheduler is working fine otherwise. If you see jobs that you think should be preempted but are not and are blocking your work please let us know and we will investigate.

Thank you for your patience as we work through these issues.

Resolved
February 14, 2025 at 4:30 PM
Resolved
February 14, 2025 at 4:30 PM
After another patch of the scheduler, Slurm should be much more stable now
Update
February 07, 2025 at 6:19 PM
Update
February 07, 2025 at 6:19 PM
We are still investigating this incident, we are working with SchedMD to find a solution.
Investigating
February 07, 2025 at 3:17 PM
Investigating
February 07, 2025 at 3:17 PM
We are currently investigating this incident.
Identified
February 07, 2025 at 3:16 PM
Identified
February 07, 2025 at 3:16 PM
We are continuing to work on a fix for this incident.
Investigating
February 07, 2025 at 3:06 PM
Investigating
February 07, 2025 at 3:06 PM
We are currently investigating this incident.

Resolved
February 03, 2025 at 9:31 PM
Resolved
February 03, 2025 at 9:31 PM
The scheduler is back to normal operation.
Investigating
February 03, 2025 at 9:01 PM
Investigating
February 03, 2025 at 9:01 PM
The SLURM scheduler is stalling and we have reached out to SchedMD regarding this. The Cannon cluster will be degraded and jobs may pend for longer than usual.

Jan 2025

Resolved
February 03, 2025 at 3:19 PM
Resolved
February 03, 2025 at 3:19 PM
This incident has been resolved.
Investigating
January 31, 2025 at 4:03 PM
Investigating
January 31, 2025 at 4:03 PM
We continue to experience a number of outages and connection issues in various systems. We are currently investigating this incident.

Resolved
January 30, 2025 at 10:50 PM
Resolved
January 30, 2025 at 10:50 PM
OpenOnDemand/OOD is now operational.
Monitoring
January 30, 2025 at 10:36 PM
Monitoring
January 30, 2025 at 10:36 PM
OpenOnDemand/OOD (rcood/fasseood) is still experiencing issues. Not ETA.
These other components should be functional again.

Services
- Citrix (rcapps, TMS)
- Spinal
- FIINE
- IFX (portal, approvals, billing)
- Software licenses
VMS
- MCZ
- SSBC Central
- Giza
- Asgard
- Peromyscus
- Seismology
- gbimaculatusgenome
- harvard-neutrinos
- Spasa
Investigating
January 30, 2025 at 10:06 PM
Investigating
January 30, 2025 at 10:06 PM
Several VMs and some services are currently degraded or down. This includes:
- Cannon OOD
- Citrix
- Spinal
- FIINE
- IFX
- Some software license
Group specific:
- MCZ
- SSBC Central
- Giza
- Asgard
- Peromyscus
- Seismology
- gbimaculatusgenome
- harvard-neutrinos
- Spasa
Continued network maintenance may result in VPN connections being dropped. We are working to rectify this and will continue to update.

Resolved
January 30, 2025 at 7:52 PM
Resolved
January 30, 2025 at 7:52 PM
The network issues have been resolved.
Identified
January 30, 2025 at 7:15 PM
Identified
January 30, 2025 at 7:15 PM
Most services restored. Some VPN connectivity or lag may still exist for some user.
Networking expects to have this fully resolved very soon.
Investigating
January 30, 2025 at 5:51 PM
Investigating
January 30, 2025 at 5:51 PM
We are currently investigating this issue.
We've identified some are unable to connect to VPN.
~~OOD/OpenOnDemand access is affected.~~
~~Other symptoms are the FASRC websites,~~ ~~portal.rc.fas.harvard.edu~~ ~~and other internet-facing sites (coldfront, spinal, minilims, etc.) are not accessible~~

SSH to/from nodes or login may be affected or laggy.
Networking is investigating.

Resolved
January 27, 2025 at 2:48 PM
Resolved
January 27, 2025 at 2:48 PM
Portal is operating normally.
Monitoring
January 23, 2025 at 6:32 PM
Monitoring
January 23, 2025 at 6:32 PM
Portal is online, but requires brief maintenance before approvers can use.
Investigating
January 23, 2025 at 4:40 PM
Investigating
January 23, 2025 at 4:40 PM
portal.rc.fas.harvard.edu is unavailable. We are currently investigating this issue.

Resolved
January 27, 2025 at 3:07 PM
Resolved
January 27, 2025 at 3:07 PM
Performance has remained stable over the weekend. Closing this issue.
Investigating
January 23, 2025 at 2:33 PM
Investigating
January 23, 2025 at 2:33 PM
Bosecs is currently in a degraded state while Dell technicians are doing maintenance.

Dec 2024

Resolved
December 13, 2024 at 4:02 PM
Resolved
December 13, 2024 at 4:02 PM
Jobs have cleared overnight and a fix for the high load appears to be working. We will monitor for any recurrence, but all appears well at this time.
Investigating
December 12, 2024 at 5:44 PM
Investigating
December 12, 2024 at 5:44 PM
Low priority jobs are not getting scheduled despite being at the top of the queue. We are currently investigating this incident and have reached out to SchedMD regarding this.
See https://support.schedmd.com/show_bug.cgi?id=21627

NESE tape maintenance Dec. 11, 2024

Completed
December 11, 2024 at 10:00 PM
Completed
December 11, 2024 at 10:00 PM
Maintenance has completed successfully
In progress
December 11, 2024 at 1:00 PM
In progress
December 11, 2024 at 1:00 PM
Maintenance is now in progress
Planned
December 11, 2024 at 1:00 PM
Planned
December 11, 2024 at 1:00 PM
From Harvard University RC:
NESE is scheduling a hardware repair for 12/11 to one of disk cache management servers for NESE Tape.
The scheduled repair will not interrupt the Tape service, but you will likely experience performance degradation in interacting with the archive on 12/11 during the maintenance.

This maintenance will not affect the NESE Disk service.

Resolved
December 10, 2024 at 5:16 PM
Resolved
December 10, 2024 at 5:16 PM
The scheduler load has returned to normal. A root cause was identified and remediated.
Identified
December 10, 2024 at 3:17 PM
Identified
December 10, 2024 at 3:17 PM
The scheduler is experiencing a high load at the moment which may manifest as slowness, timeout or general instability for job scheduling. We are actively working to mitigate this.

FASRC monthly maintenance - Monday December 2nd, 2024 7am-11am

Completed
December 02, 2024 at 4:00 PM
Completed
December 02, 2024 at 4:00 PM
Maintenance has completed successfully
Update
December 02, 2024 at 12:52 PM
In progress
December 02, 2024 at 12:52 PM
Due to an urgent network issue which requires a restart of some network hardware, all jobs will need to be paused.
Interactive jobs and the ability to write to some storage may be interrupted.
In progress
December 02, 2024 at 12:00 PM
In progress
December 02, 2024 at 12:00 PM
Maintenance is now in progress
Planned
December 02, 2024 at 12:00 PM
Planned
December 02, 2024 at 12:00 PM
FASRC monthly maintenance will occur Monday December 2nd, 2024 from 7am-11am
IMPORTANT NOTICES
- holyscratch01 will be set to read-only during this maintenance and will be decommissioned February 1, 2025. Please move any needed scratch data to netscratch and begin using it instead if you have not done so already. The global $SCRATCH variable will be changed to /n/netscratch
- FASRC will be switching to the Harvard ServiceNow ticket system on Dec. 2nd. Our email addresses remain the same and no action is required on your part.
  Please do not re-open old/closed tickets after Dec. 2nd and instead create a new ticket.
- Cannon cluster: serial_requeue and gpu_requeue will be set to allow MPI/multinode jobs. Such jobs need to be able to handle preemption/being requeued.
Training: Upcoming training from FASRC and other sources can be found on our Training Calendar. at https://www.rc.fas.harvard.edu/upcoming-training/
Status Page: You can subscribe to our status to receive notifications of maintenance, incidents, and their resolution at https://status.rc.fas.harvard.edu/ (click Get Updates for options).
Upcoming holidays: Thanksgiving Nov. 28th and 29th. Winter break Dec. 23rd through January 1st
MAINTENANCE TASKS
Cannon cluster will be paused during this maintenance?: NO
FASSE cluster will be paused during this maintenance?: NO
- Set /n/holyscratch01 scratch filesystem to read-only
  Audience: All cluster users
  Impact: Please adoptthe new scratch filesystem /n/netscratch prior to Dec. 2nd. The $SCRATCH variable will move to /n/netscratch during this maintenance
  Data on holyscratch01 will still be readable, but not writable, and will be fully decommissioned on Feb. 1, 2025.
- Switch ticketing system to ServiceNow. Our email addresses remain the same.
  Audience: All FASRC users
  Impact: All new tickets will go to Harvard'sServiceNow,our email remains the same. Existing tickets will get moved any time someone replies.
  NOTE: From Dec. 2nd on, please do not re-open any old tickets. Create a new one instead by emailing rchelp@rc.fas.harvard.edu
- Login node reboots
  Audience: Anyone logged into a FASRC Cannon or FASSE login node
  Impact: Login nodes will rebooted during this maintenance window
- Scratch cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ )
  Audience: Cluster users
  Impact: Files older than 90 days will be removed. Please note that retention cleanup can and does run at any time, not just during the maintenance window.
Thank you,
FAS Research Computing
https://docs.rc.fas.harvard.edu/
https://www.rc.fas.harvard.edu/upcoming-training/

Dec 2024 to Feb 2025

FAS Research Computing - Notice history

Notice history

Feb 2025

Jan 2025

Dec 2024