Historique des incidents

Opérationnel

juil. 2026

FASRC monthly maintenance Monday July 6th, 2026 9am-1pm

Terminé
juillet 06, 2026 à 17:00UTC
Terminé
juillet 06, 2026 à 17:00UTC
La maintenance s'est terminée avec succès
En cours
juillet 06, 2026 à 13:00UTC
En cours
juillet 06, 2026 à 13:00UTC
La maintenance est en cours
Pas encore commencé
juin 26, 2026 à 13:45UTC
Pas encore commencé
juin 26, 2026 à 13:45UTC
FASRC monthly maintenance will take place on July 6th 2026. Our maintenance tasks should be completed between 9am-1pm.
Cannon cluster will be paused during this maintenance?: NO
FASSE cluster will be paused during this maintenance?: NO
NOTICES:
- Friday July 3rd is a university holiday (independence Day observed)
- Training: Upcoming training from FASRC and other sources can be found on our Training Calendar. at https://www.rc.fas.harvard.edu/upcoming-training/
- Status Page: You can subscribe to our status to receive notifications of maintenance, incidents, and their resolution at https://status.rc.fas.harvard.edu/ (click Get Updates for options).
- We'd love to hear success stories about your or your lab's use of FASRC. Submit your story here.
MAINTENANCE TASKS
- Domain controller replacement
  Audience: Internal
  Impact: None. End users should not see any impact.
- Reboot drained nodes in error state
  Audience: Cluster nodes with errors.
  Impact: These nodes will have been drained already in preparation. No impact on jobs on the day and the affected nodes will return to service in their respective partitions after the maintenance period.
- OOD/Open OnDemand reboots
  Audience: All OOD users, reboot of the head nodes.
  Impact: Running sessions will not be affected.
- Login node reboots
  Audience; All login node users.
  Impact: Login nodes will reboot during the maintenance window.
- Netscratch 90-day retention cleanup
  Audience; All netscratch users
  Impact: Files older than 90 days will be removed per our scratch policy. Please note that this cleanup can happen at any time, not just during maintenance.
Thank you,
FAS Research Computing
https://docs.rc.fas.harvard.edu/
https://www.rc.fas.harvard.edu/

Résolu
juillet 02, 2026 à 14:54UTC
Résolu
juillet 02, 2026 à 14:54UTC
The underlying cause has been identified and fixed. Logins should work as expected now.
This incident has been resolved.
Détecté
juillet 02, 2026 à 14:20UTC
Détecté
juillet 02, 2026 à 14:20UTC
We are currently investigating an issue where users cannot log into login nodes or cannot log in consistently.

juin 2026

Résolu
juin 29, 2026 à 04:31UTC
Résolu
juin 29, 2026 à 04:31UTC
This incident has been resolved. Less than 18 nodes remain in this state and will be cleaned up soon.
Surveillé
juin 24, 2026 à 14:06UTC
Surveillé
juin 24, 2026 à 14:06UTC
We have rolled out a fix and are currently monitoring the result.

Any nodes still in Kill Task Fail will need to fully drain of running jobs first before being reopened to ensure there are no orphaned processes.
PLEASE NOTE: This will take quite some time to fully resolve as jobs are still running on these nodes. But the nodes cannot accept new jobs until cleared.
Currently # of potentially impacted nodes: 18 (of an original 600 - number will update periodically)
Détecté
juin 24, 2026 à 13:40UTC
Détecté
juin 24, 2026 à 13:40UTC
We are currently investigating this incident. Due to a CVMFS mount issue, some nodes are being closed and labelled as "Kill task fail"
WIP

Résolu
juin 20, 2026 à 14:09UTC
Résolu
juin 20, 2026 à 14:09UTC
This incident has been resolved.
Détecté
juin 19, 2026 à 19:00UTC
Détecté
juin 19, 2026 à 19:00UTC
holylfs06 storage may be slow or unresponsive. We are investigating this issue.

2026 MGHPCC power downtime June 15-18, 2026

Terminé
juin 18, 2026 à 21:15UTC
Terminé
juin 18, 2026 à 21:15UTC
The yearly power downtime at our Holyoke data center, MGHPCC, has completed.
The clusters and storage are back online and login nodes and OOD nodes are now available.
If you have an issue/need help, please send a ticket to rchelp@rc.fas.harvard.edu with details.
IMPORTANT NOTE: Tomorrow, June 19th is a university holiday. FASRC staff will return Monday to address any lingering issues and any new tickets.
Mettre à jour
juin 18, 2026 à 20:45UTC
Mettre à jour
juin 18, 2026 à 20:45UTC
Power-up is nearly complete, but a delay earlier in the day has us slightly behind.
New ETA is 6PM.
Mettre à jour
juin 18, 2026 à 12:16UTC
Mettre à jour
juin 18, 2026 à 12:16UTC
MGHPCC has completed their maintenance and restored power to the facility.
FASRC will now begin the power-up process. Please be aware that this takes several hours.
We will update this status once complete.
NOTE: A reminder that tomorrow (Friday) is a university holiday.
En cours
juin 15, 2026 à 13:00UTC
En cours
juin 15, 2026 à 13:00UTC
Maintenance is now in progress
Pas encore commencé
juin 15, 2026 à 13:00UTC
Pas encore commencé
juin 15, 2026 à 13:00UTC
The yearly power downtime at our Holyoke data center, MGHPCC, has been scheduled by the facility. This year's power downtime will take place on Tuesday June 15th - 18th, 2025. There will be no June monthly maintenance as a result.
Since the facility will be powered down for two days this year, we will not be performing the usual maintenance tasks.
That said, networking and other key infrastructure will be doing maintenance.
IMPORTANT NOTE: FASRC storage at both Holyoke and Boston will be affected and should not be expected to be available throughout the downtime. Please plan ahead accordingly.
- Monday June 15th - Power-down begins at 9AM
- Tuesday June 16th - Power out at MGHPCC
- Wednesday June 17th - Power out at MGHPCC
- Thursday June 18th - Expected return to full service by 5PM
- Friday June 19th - Please note that June 19th is a university holiday
For more detailed information and follow-up, please see:
https://www.rc.fas.harvard.edu/mghpcc-yearly-shutdown or this Status Page

Résolu
juin 10, 2026 à 20:07UTC
Résolu
juin 10, 2026 à 20:07UTC
This incident has been resolved.
Détecté
juin 10, 2026 à 19:41UTC
Détecté
juin 10, 2026 à 19:41UTC
Holylfs06 storage may be slow or unresponsive
We are currently investigating this incident.

Résolu
juin 09, 2026 à 19:38UTC
Résolu
juin 09, 2026 à 19:38UTC
The Portal website should be accessible for all now.
This incident has been resolved.
Détecté
juin 09, 2026 à 16:22UTC
Détecté
juin 09, 2026 à 16:22UTC
There is an SSL issue with the Portal which will cause an error for anyone attempting to connect.
Investigating.

mai 2026

Résolu
juin 01, 2026 à 14:45UTC
Résolu
juin 01, 2026 à 14:45UTC
Slurm crashed on 4:30p on Friday due to a user running a large sacct query against the Slurm database. This caused the database host to run out of memory and crash the scheduler. To prevent this from reoccurring we are reducing the time range that users are permitted to query at one time to 7 days. Thus if you need to cover a month you would need to query in four 7 day increments.

We do ask users to be judicious in their querying of the Slurm. Only ask for those fields that you require. Please also ensure any AI agents you have running limit their queries appropriately.
Identifié
mai 29, 2026 à 21:13UTC
Identifié
mai 29, 2026 à 21:13UTC
To temporarily stabilize the situation, we have reduced the maximum query time for sacct and other Slurm commands to be 1 day. We have filed a ticket with SchedMD to further analyze the issue.
The cluster is back up and the scheduler is accepting new jobs.
We will continue to monitor for emergencies over the weekend, and resume in-depth troubleshooting on Monday.
Détecté
mai 29, 2026 à 20:21UTC
Détecté
mai 29, 2026 à 20:21UTC
The Slurm scheduler is experiencing an error which is impacting jobs. The Cannon cluster will be inaccessible while we troubleshoot.
We are currently investigating this incident.

Résolu
mai 26, 2026 à 14:20UTC
Résolu
mai 26, 2026 à 14:20UTC
This incident has been resolved. Login is again available for approvals/onboarding.
Détecté
mai 22, 2026 à 15:18UTC
Détecté
mai 22, 2026 à 15:18UTC
Account approvals through Portal are not available at this time.
We are currently investigating this incident.

Résolu
mai 22, 2026 à 05:09UTC
Résolu
mai 22, 2026 à 05:09UTC
FASRC VPN (Cambridge) is back up. This incident was automatically resolved by Instatus monitoring.
Détecté
mai 22, 2026 à 04:50UTC
Détecté
mai 22, 2026 à 04:50UTC
FASRC VPN (Cambridge) is down at the moment. This incident was automatically created by Instatus monitoring.

Résolu
mai 22, 2026 à 03:55UTC
Résolu
mai 22, 2026 à 03:55UTC
FASRC VPN (Cambridge) is back up. This incident was automatically resolved by Instatus monitoring.
Détecté
mai 22, 2026 à 03:35UTC
Détecté
mai 22, 2026 à 03:35UTC
FASRC VPN (Cambridge) is down at the moment. This incident was automatically created by Instatus monitoring.

Résolu
mai 22, 2026 à 03:14UTC
Résolu
mai 22, 2026 à 03:14UTC
FASRC VPN (Cambridge) is back up. This incident was automatically resolved by Instatus monitoring.
Détecté
mai 22, 2026 à 03:05UTC
Détecté
mai 22, 2026 à 03:05UTC
FASRC VPN (Cambridge) is down at the moment. This incident was automatically created by Instatus monitoring.

mai 2026 à juil. 2026

FAS Research Computing - Historique des incidents

Historique des incidents

juil. 2026

juin 2026

mai 2026