Historique des incidents

Opérationnel

avr. 2026

OpenOnDemand maintenance

Terminé
avril 30, 2026 à 14:00UTC
Terminé
avril 30, 2026 à 14:00UTC
Maintenance has completed successfully
En cours
avril 30, 2026 à 12:00UTC
En cours
avril 30, 2026 à 12:00UTC
Maintenance is now in progress
Pas encore commencé
avril 30, 2026 à 12:00UTC
Pas encore commencé
avril 30, 2026 à 12:00UTC
At 8am on Thursday April 30th we will be upgrading from Open OnDemand version 4.0.7 to 4.1.4 on both the Cannon and FASSE clusters.
This is not expected to impact running jobs.
This upgrade adds the Jobs->Project Manager menu item and fixes an issue that affected access to the Clusters->Shell Access menu item when using Firefox.

Résolu
avril 30, 2026 à 15:43UTC
Résolu
avril 30, 2026 à 15:43UTC
\[login.rc.fas.harvard.edu\](http://login.rc.fas.harvard.edu) is responding normally. This incident was automatically resolved.
Détecté
avril 30, 2026 à 02:31UTC
Détecté
avril 30, 2026 à 02:31UTC
login.rc.fas.harvard.edu is not responding normally. This incident was automatically created.

Résolu
avril 30, 2026 à 18:12UTC
Résolu
avril 30, 2026 à 18:12UTC
The cluster has been rebooted and all nodes, including login and OOD, have been patched.
The scheduler is re-opened and jobs which were preempted/requeued have priority for re-scheduling.
Some non-standard, lab-owned nodes may still require patching. The owners of these machines may be contacted about this.
Thank you for your patience. This is a global issue and is being addressed at centers everywhere.
Mettre à jour
avril 30, 2026 à 15:40UTC
Mettre à jour
avril 30, 2026 à 15:40UTC
To mitigate this exploit we will need to restart -all nodes- on the cluster.
This will begin at 1PM and run until all nodes have restarted (no ETA).
This will mean any un-finished jobs will be terminated. There is no way to avoid this.

We will then be validating the fix before re-opening the login. OOD nodes, and scheduler.
Next steps and updates will be posted here.
Mettre à jour
avril 30, 2026 à 14:44UTC
Mettre à jour
avril 30, 2026 à 14:44UTC
We are developing a plan of attack to mitigate this exploit. Please know that this is a very serious issue and so we are treating it as such. Thank you for your understanding.

We are currently awaiting further information from the Redhat/Fedora/Rocky community but building a plan in the meantime with the information we have. More details to follow as we can share them.

If you need to access storage (except scratch and home directories), Globus is still online and available. But again, login nodes and OOD are not available.
Identifié
avril 30, 2026 à 02:11UTC
Identifié
avril 30, 2026 à 02:11UTC
Due to a serious in-the-wild exploit which can compromise Fedora-based Linux distributions including Rocky, which is used on the cluster, we need to restrict access. All login and OOD nodes are shut down until a fix can be put in place. Jobs running on the cluster will continue running.
No ETA, There is not fix at this time. We will update our status page in the morning once we have more information or a fix to roll out.
This is a serious exploit and we do not take this measure lightly. Please follow this status page for updates and eventual resolution.

Résolu
avril 29, 2026 à 16:32UTC
Résolu
avril 29, 2026 à 16:32UTC
Holylfs06 is accessible again.
This incident has been resolved.
Identifié
avril 29, 2026 à 16:00UTC
Identifié
avril 29, 2026 à 16:00UTC
Holylfs06 storage is down. We are investigating. More details as they are known.

Website security maintenance (www.rc and docs.rc) 4-28-26 1pm

Terminé
avril 28, 2026 à 17:16UTC
Terminé
avril 28, 2026 à 17:16UTC
Website maintenance has completed successfully.
En cours
avril 28, 2026 à 17:00UTC
En cours
avril 28, 2026 à 17:00UTC
Maintenance is now in progress
Pas encore commencé
avril 28, 2026 à 17:00UTC
Pas encore commencé
avril 28, 2026 à 17:00UTC
Security updates are required for www.rc.fas.harvard.edu and docs.rc.fas.harvard.edu
This work will take place today between 1pm and 2pm
Both sites will be down for very short periods during the updates.

mars 2026

Résolu
avril 01, 2026 à 12:11UTC
Résolu
avril 01, 2026 à 12:11UTC
This incident has been resolved. The scheduler is running normally.
Détecté
mars 31, 2026 à 21:15UTC
Détecté
mars 31, 2026 à 21:15UTC
The scheduler is in a degraded state due to thrashing
We are actively working to resolve this problem.

Résolu
mars 31, 2026 à 16:44UTC
Résolu
mars 31, 2026 à 16:44UTC
This incident has been resolved. two-factor.rc.fas.harvard.edu is working normally again.
Détecté
mars 31, 2026 à 15:32UTC
Détecté
mars 31, 2026 à 15:32UTC
We are currently investigating this incident. Requesting a new token or re-requesting your token from two-factor is not currently working.

Résolu
mars 31, 2026 à 15:00UTC
Résolu
mars 31, 2026 à 15:00UTC
This incident has been resolved.
Détecté
mars 25, 2026 à 14:30UTC
Détecté
mars 25, 2026 à 14:30UTC
We are currently investigating this incident.

Résolu
mars 30, 2026 à 20:41UTC
Résolu
mars 30, 2026 à 20:41UTC
This incident has been resolved by draining and rebooting any nodes with stuck mounts.
Surveillé
mars 25, 2026 à 14:31UTC
Surveillé
mars 25, 2026 à 14:31UTC
Mounts to Holyoke Isilon (specifically /n/sw) are broken on numerous nodes across the cluster. We have a check rolling out to find these nodes so we can remediate them individually. Until remediated the cluster will be in a degraded state. Running jobs may randomly die or fail as they hit nodes that have stale mounts.
It will be risky to run jobs for the next hour and then, after that point, the cluster will have a large number of nodes closed waiting for them to drain so we can reboot them and fix the mounts.
At this time we are unaware of any holy-isilon problems other than the effect this had on cluster nodes/running jobs. We will update should we identify any data storage concerns.
Identifié
mars 25, 2026 à 14:10UTC
Identifié
mars 25, 2026 à 14:10UTC
Mounts to Holyoke Isilon (specifically /n/sw) are broken on numerous nodes across the cluster. We have a check rolling out to find these nodes so we can remediate them individually. Until remediated the cluster will be in a degraded state. Running jobs may randomly die or fail as they hit nodes that have stale mounts.
It will be risky to run jobs for the next hour and then, after that point, the cluster will have a large number of nodes closed waiting for them to drain so we can reboot them and fix the mounts.
Détecté
mars 25, 2026 à 13:34UTC
Détecté
mars 25, 2026 à 13:34UTC
A network issue affecting storage critical to the cluster is It's causing instability. The cluster is currently in a degraded state as a result. We are looking into the problem. Updates to follow..

Résolu
mars 19, 2026 à 14:52UTC
Résolu
mars 19, 2026 à 14:52UTC
Cold front is back up. Thank you for your patience.
Identifié
mars 19, 2026 à 12:58UTC
Identifié
mars 19, 2026 à 12:58UTC
ColdFront is down. We are working to bring it back up. The instance got replaced last night, but it had trouble configuring itself on the way up again.

févr. 2026

Résolu
mars 04, 2026 à 15:09UTC
Résolu
mars 04, 2026 à 15:09UTC
This incident has been resolved. Normal tape operations are restored.
Surveillé
mars 03, 2026 à 14:04UTC
Surveillé
mars 03, 2026 à 14:04UTC
The tape library outage is further extended to Wednesday March 4th at 9am awaiting a hardware replacement part due today. Data can still be uploaded to lab collections via Globus, but be mindful of the 10 TB buffer file limit. The outage affects storage and recall from tape.
Identifié
mars 02, 2026 à 14:03UTC
Identifié
mars 02, 2026 à 14:03UTC
NESE Tape Service is still working with IBM technical support at restoring the inventory. The expected downtime is extended until Tuesday March 3rd, 9am.
Apologies for the inconvenvenience.
Détecté
février 27, 2026 à 21:27UTC
Détecté
février 27, 2026 à 21:27UTC
NESE Tape service will be down or operating with degraded service (no store and recall) Friday from 12 Noon EST until as late as Monday, 2 March at 9 AM.

SUMMARY OF ISSUE:

NESE Tape service is currently not able to store or recall files to and from tape due to vendor firmware issues in the IBM TS4500 tape library. The issue is related to the library robotics and cartridge database and we do NOT expect any data loss from this issue.

The issue is apparently due to an issue with the inventory database related to a recent firmware update. This database can be scrubbed and reconstructed by the library, which will scan the bar code labels on all the cartridges to rebuild the inventory. Association of files in Globus to tapes is handled separately from the tape library and is not affected by the firmware update.

Résolu
février 27, 2026 à 22:04UTC
Résolu
février 27, 2026 à 22:04UTC
This incident has been resolved. The Starfish dashboard is available.
Détecté
février 26, 2026 à 14:13UTC
Détecté
février 26, 2026 à 14:13UTC
The starfish dashboard is unavailable. We are currently investigating this issue with Starfish..

Starfish maintenance Feb 25, 2026 all day

Terminé
février 26, 2026 à 14:00UTC
Terminé
février 26, 2026 à 14:00UTC
Maintenance has completed successfully
En cours
février 25, 2026 à 14:00UTC
En cours
février 25, 2026 à 14:00UTC
Maintenance is now in progress
Pas encore commencé
février 25, 2026 à 14:00UTC
Pas encore commencé
février 25, 2026 à 14:00UTC
Starfish will be unavailable starting Wednesday, February 25th at 9AM until Thursday, February 26th at 9AM, for routine maintenance. The online dashboard will be inaccessible during this time.

Résolu
février 24, 2026 à 15:44UTC
Résolu
février 24, 2026 à 15:44UTC
Openauth/radius is now operational. This update was created by an automated monitoring service.
Détecté
février 24, 2026 à 15:39UTC
Détecté
février 24, 2026 à 15:39UTC
Authentication issues with openauth/radius. This incident was created by an automated monitoring service.

NESE tape maintenance Feb 19th 2026

Terminé
février 19, 2026 à 22:00UTC
Terminé
février 19, 2026 à 22:00UTC
Maintenance has completed successfully
En cours
février 19, 2026 à 13:00UTC
En cours
février 19, 2026 à 13:00UTC
Maintenance is now in progress
Pas encore commencé
février 19, 2026 à 13:00UTC
Pas encore commencé
février 19, 2026 à 13:00UTC
From our partners at NESE. Details follow:
We are installing four new tape frames, which will bring the tape system raw storage capacity to 253 petabytes.
Service Affected: NESE Tape Service
Maintenance Window: 8:00 AM - 5:00 PM (EST)
- The tape service will be unavailable.
- All upgrade activities are expected to be completed on the same day.
NOTES:
- Monitor the MGHPCC Slack #nese channel for status updates and announcements
- Monitor https://nese.instatus.com/ for real-time updates on progress
Subscribe to https://nese.instatus.com/subscribe/email for updates and announcements

févr. 2026 à avr. 2026

FAS Research Computing - Historique des incidents

Historique des incidents

avr. 2026

mars 2026

févr. 2026