Notice history

Operational

May 2026

MGHPCC power work - Part 2 May 18

Scheduled for May 18, 2026 at 11:00 AM – 11:00 AM

Update

May 18, 2026 at 11:00 AM

Update

May 18, 2026 at 11:00 AM

Our Holyoke data center, MGHPCC, will be doing power work on Row 8A. This work, which is being completed over the course of 2 weeks, will bring online another power feed which will increase power capacity.

In order to do this work, it will require us to idle half the nodes in 8a for the duration of the week. This means all partitions in this row will be at half capacity. Existing jobs should drain naturally and no job should need to be canceled.

The impacted partitions are:

arguelles_delgado_h100
bigmem
bigmem_intermediate
blackhole_gpu
dvorkin
eddy
enos
gershman
gpu
gpu_h200
gpu_requeue
hejazi
hernquist_ice
hoekstra
hsph
hsph_gpu
huce_ice
iaifi_gpu_requeue
intermediate
itc_cluster
itc_gpu
janson_sapphire
joonholee
jshapiro
kempner
kempner_priority
kempner_dev
kempner_eng
kempner_h200_priority
kempner_h100
kempner_h100_priority
kempner_h100_priority2
kempner_h100_priority3
kempner_h100_priority4
kempner_interactive
kovac
kozinsky
kozinsky_gpu
kozinsky_requeue
murphy_ice
mweber_compute
mweber_gpu
olveczky_sapphire
ortegahernandez_ice
rivas
sapphire
seas_compute
siag
siag_combo
test
yao
yao_priority
zhuang

Planned
May 18, 2026 at 11:00 AM
Planned
May 18, 2026 at 11:00 AM
Rescheduled to May 18

MGHPCC power work - Part 1 May 11

Completed
May 16, 2026 at 11:00 PM
Completed
May 16, 2026 at 11:00 PM
Maintenance has completed successfully
In progress
May 11, 2026 at 11:00 AM
In progress
May 11, 2026 at 11:00 AM
Maintenance is now in progress
Update
May 11, 2026 at 11:00 AM
Update
May 11, 2026 at 11:00 AM
Rescheduled to May 11

Planned

May 11, 2026 at 11:00 AM

Planned

May 11, 2026 at 11:00 AM

Our Holyoke data center, MGHPCC, will be doing power work on Row 8A. This work, which will occur this week and next week, will bring online another power feed which will increase power capacity.

The impacted partitions are:

arguelles_delgado_h100
bigmem
bigmem_intermediate
blackhole_gpu
dvorkin
eddy
enos
gershman
gpu
gpu_h200
gpu_requeue
hejazi
hernquist_ice
hoekstra
hsph
hsph_gpu
huce_ice
iaifi_gpu_requeue
intermediate
itc_cluster
itc_gpu
janson_sapphire
joonholee
jshapiro
kempner
kempner_priority
kempner_dev
kempner_eng
kempner_h200_priority
kempner_h100
kempner_h100_priority
kempner_h100_priority2
kempner_h100_priority3
kempner_h100_priority4
kempner_interactive
kovac
kozinsky
kozinsky_gpu
kozinsky_requeue
murphy_ice
mweber_compute
mweber_gpu
olveczky_sapphire
ortegahernandez_ice
rivas
sapphire
seas_compute
siag
siag_combo
test
yao
yao_priority
zhuang

Resolved
May 06, 2026 at 4:30 PM
Resolved
May 06, 2026 at 4:30 PM
www.rc.fas.harvard.edu is back up. This incident was resolved automatically.
Investigating
May 06, 2026 at 4:20 PM
Investigating
May 06, 2026 at 4:20 PM
www.rc.fas.harvard.edu is down at the moment. This incident was created automatically.

Monthly maintenance May 4th 2026 9am-1pm

Completed
May 04, 2026 at 5:00 PM
Completed
May 04, 2026 at 5:00 PM
Maintenance has completed successfully
Update
May 04, 2026 at 2:56 PM
Update
May 04, 2026 at 2:56 PM
The scheduler is re-opened and jobs un-paused. Other, non-impacting, work continues.
In progress
May 04, 2026 at 1:00 PM
In progress
May 04, 2026 at 1:00 PM
Maintenance is now in progress
Planned
May 04, 2026 at 1:00 PM
Planned
May 04, 2026 at 1:00 PM
FASRC monthly maintenance will take place on May 4th 2026. Our maintenance tasks should be completed between 9am-1pm.
NOTICES:
- Annual data center power downtime: The annual downtime at MGHPCC will take place June 15 - June 18. This year's downtime will be one day longer. More details will be sent to all users next month.
- Training: Upcoming training from FASRC and other sources can be found on our Training Calendar. at https://www.rc.fas.harvard.edu/upcoming-training/
- Status Page: You can subscribe to our status to receive notifications of maintenance, incidents, and their resolution at https://status.rc.fas.harvard.edu/ (click Get Updates for options).
MAINTENANCE TASKS
Cannon cluster will be paused during this maintenance?: YES
FASSE cluster will be paused during this maintenance?: YES
- Slurm 25.11.5 Upgrade
  Audience: All cluster users
  Impact: Jobs will be paused during the upgrade
- Reboot remaining stuck nodes from power outage
  Audience: N/A
  Impact: No visible impact to user
- Two-Factor/OpenAuth (two-factor.rc.fas.harvard.edu) replacement
  Audience: All account holders
  Impact: The server will be unavailable during maintenance. You will be unable to obtain a new or replacement OpenAuth token during this period.
- Domain controller replacement
  Audience: Internal
  Impact: End users should not see any impact
- OOD/Open OnDemand reboots
  Audience: All OOD users, reboot of the head nodes
  Impact: Running sessions will not be affected
- Login node reboots
  Audience; All login node users
  Impact: Login nodes will reboot during the maintenance window
- Netscratch 90-day retention cleanup
  Audience; All netscratch users
  Impact: Files older than 90 days will be removed per our scratch policy. Please note that this cleanup can happen at any time, not just during maintenance.
Thank you,
FAS Research Computing
https://docs.rc.fas.harvard.edu/
https://www.rc.fas.harvard.edu/

Starfish maintenance

Completed
May 01, 2026 at 9:00 PM
Completed
May 01, 2026 at 9:00 PM
Maintenance has completed successfully
In progress
May 01, 2026 at 8:00 PM
In progress
May 01, 2026 at 8:00 PM
Maintenance is now in progress
Planned
May 01, 2026 at 8:00 PM
Planned
May 01, 2026 at 8:00 PM
Starfish will be upgraded to the latest version on Friday, May 1st from 4pm-5pm. The service and dashboard will be down during this time.

Apr 2026

OpenOnDemand maintenance

Completed
April 30, 2026 at 2:00 PM
Completed
April 30, 2026 at 2:00 PM
Maintenance has completed successfully
In progress
April 30, 2026 at 12:00 PM
In progress
April 30, 2026 at 12:00 PM
Maintenance is now in progress
Planned
April 30, 2026 at 12:00 PM
Planned
April 30, 2026 at 12:00 PM
At 8am on Thursday April 30th we will be upgrading from Open OnDemand version 4.0.7 to 4.1.4 on both the Cannon and FASSE clusters.
This is not expected to impact running jobs.
This upgrade adds the Jobs->Project Manager menu item and fixes an issue that affected access to the Clusters->Shell Access menu item when using Firefox.

Resolved
April 30, 2026 at 3:43 PM
Resolved
April 30, 2026 at 3:43 PM
\[login.rc.fas.harvard.edu\](http://login.rc.fas.harvard.edu) is responding normally. This incident was automatically resolved.
Investigating
April 30, 2026 at 2:31 AM
Investigating
April 30, 2026 at 2:31 AM
login.rc.fas.harvard.edu is not responding normally. This incident was automatically created.

Resolved
April 30, 2026 at 6:12 PM
Resolved
April 30, 2026 at 6:12 PM
The cluster has been rebooted and all nodes, including login and OOD, have been patched.
The scheduler is re-opened and jobs which were preempted/requeued have priority for re-scheduling.
Some non-standard, lab-owned nodes may still require patching. The owners of these machines may be contacted about this.
Thank you for your patience. This is a global issue and is being addressed at centers everywhere.
Update
April 30, 2026 at 3:40 PM
Update
April 30, 2026 at 3:40 PM
To mitigate this exploit we will need to restart -all nodes- on the cluster.
This will begin at 1PM and run until all nodes have restarted (no ETA).
This will mean any un-finished jobs will be terminated. There is no way to avoid this.

We will then be validating the fix before re-opening the login. OOD nodes, and scheduler.
Next steps and updates will be posted here.
Update
April 30, 2026 at 2:44 PM
Update
April 30, 2026 at 2:44 PM
We are developing a plan of attack to mitigate this exploit. Please know that this is a very serious issue and so we are treating it as such. Thank you for your understanding.

We are currently awaiting further information from the Redhat/Fedora/Rocky community but building a plan in the meantime with the information we have. More details to follow as we can share them.

If you need to access storage (except scratch and home directories), Globus is still online and available. But again, login nodes and OOD are not available.
Identified
April 30, 2026 at 2:11 AM
Identified
April 30, 2026 at 2:11 AM
Due to a serious in-the-wild exploit which can compromise Fedora-based Linux distributions including Rocky, which is used on the cluster, we need to restrict access. All login and OOD nodes are shut down until a fix can be put in place. Jobs running on the cluster will continue running.
No ETA, There is not fix at this time. We will update our status page in the morning once we have more information or a fix to roll out.
This is a serious exploit and we do not take this measure lightly. Please follow this status page for updates and eventual resolution.

Resolved
April 29, 2026 at 4:32 PM
Resolved
April 29, 2026 at 4:32 PM
Holylfs06 is accessible again.
This incident has been resolved.
Identified
April 29, 2026 at 4:00 PM
Identified
April 29, 2026 at 4:00 PM
Holylfs06 storage is down. We are investigating. More details as they are known.

Website security maintenance (www.rc and docs.rc) 4-28-26 1pm

Completed
April 28, 2026 at 5:16 PM
Completed
April 28, 2026 at 5:16 PM
Website maintenance has completed successfully.
In progress
April 28, 2026 at 5:00 PM
In progress
April 28, 2026 at 5:00 PM
Maintenance is now in progress
Planned
April 28, 2026 at 5:00 PM
Planned
April 28, 2026 at 5:00 PM
Security updates are required for www.rc.fas.harvard.edu and docs.rc.fas.harvard.edu
This work will take place today between 1pm and 2pm
Both sites will be down for very short periods during the updates.

Mar 2026

Resolved
April 01, 2026 at 12:11 PM
Resolved
April 01, 2026 at 12:11 PM
This incident has been resolved. The scheduler is running normally.
Investigating
March 31, 2026 at 9:15 PM
Investigating
March 31, 2026 at 9:15 PM
The scheduler is in a degraded state due to thrashing
We are actively working to resolve this problem.

Resolved
March 31, 2026 at 4:44 PM
Resolved
March 31, 2026 at 4:44 PM
This incident has been resolved. two-factor.rc.fas.harvard.edu is working normally again.
Investigating
March 31, 2026 at 3:32 PM
Investigating
March 31, 2026 at 3:32 PM
We are currently investigating this incident. Requesting a new token or re-requesting your token from two-factor is not currently working.

Resolved
March 31, 2026 at 3:00 PM
Resolved
March 31, 2026 at 3:00 PM
This incident has been resolved.
Investigating
March 25, 2026 at 2:30 PM
Investigating
March 25, 2026 at 2:30 PM
We are currently investigating this incident.

Resolved
March 30, 2026 at 8:41 PM
Resolved
March 30, 2026 at 8:41 PM
This incident has been resolved by draining and rebooting any nodes with stuck mounts.
Monitoring
March 25, 2026 at 2:31 PM
Monitoring
March 25, 2026 at 2:31 PM
Mounts to Holyoke Isilon (specifically /n/sw) are broken on numerous nodes across the cluster. We have a check rolling out to find these nodes so we can remediate them individually. Until remediated the cluster will be in a degraded state. Running jobs may randomly die or fail as they hit nodes that have stale mounts.
It will be risky to run jobs for the next hour and then, after that point, the cluster will have a large number of nodes closed waiting for them to drain so we can reboot them and fix the mounts.
At this time we are unaware of any holy-isilon problems other than the effect this had on cluster nodes/running jobs. We will update should we identify any data storage concerns.
Identified
March 25, 2026 at 2:10 PM
Identified
March 25, 2026 at 2:10 PM
Mounts to Holyoke Isilon (specifically /n/sw) are broken on numerous nodes across the cluster. We have a check rolling out to find these nodes so we can remediate them individually. Until remediated the cluster will be in a degraded state. Running jobs may randomly die or fail as they hit nodes that have stale mounts.
It will be risky to run jobs for the next hour and then, after that point, the cluster will have a large number of nodes closed waiting for them to drain so we can reboot them and fix the mounts.
Investigating
March 25, 2026 at 1:34 PM
Investigating
March 25, 2026 at 1:34 PM
A network issue affecting storage critical to the cluster is It's causing instability. The cluster is currently in a degraded state as a result. We are looking into the problem. Updates to follow..

Resolved
March 19, 2026 at 2:52 PM
Resolved
March 19, 2026 at 2:52 PM
Cold front is back up. Thank you for your patience.
Identified
March 19, 2026 at 12:58 PM
Identified
March 19, 2026 at 12:58 PM
ColdFront is down. We are working to bring it back up. The instance got replaced last night, but it had trouble configuring itself on the way up again.

Mar 2026 to May 2026

FAS Research Computing - Notice history

Notice history

May 2026

Apr 2026

Mar 2026