Histórico de avisos

Operacional

jul 2026

FASRC monthly maintenance Monday July 6th, 2026 9am-1pm

Concluído
julho 06, 2026 em 17:00UTC
Concluído
julho 06, 2026 em 17:00UTC
Manutenção concluída com sucesso
Em curso
julho 06, 2026 em 13:00UTC
Em curso
julho 06, 2026 em 13:00UTC
A manutenção já está em andamento
Ainda não começou
junho 26, 2026 em 13:45UTC
Ainda não começou
junho 26, 2026 em 13:45UTC
FASRC monthly maintenance will take place on July 6th 2026. Our maintenance tasks should be completed between 9am-1pm.
Cannon cluster will be paused during this maintenance?: NO
FASSE cluster will be paused during this maintenance?: NO
NOTICES:
- Friday July 3rd is a university holiday (independence Day observed)
- Training: Upcoming training from FASRC and other sources can be found on our Training Calendar. at https://www.rc.fas.harvard.edu/upcoming-training/
- Status Page: You can subscribe to our status to receive notifications of maintenance, incidents, and their resolution at https://status.rc.fas.harvard.edu/ (click Get Updates for options).
- We'd love to hear success stories about your or your lab's use of FASRC. Submit your story here.
MAINTENANCE TASKS
- Domain controller replacement
  Audience: Internal
  Impact: None. End users should not see any impact.
- Reboot drained nodes in error state
  Audience: Cluster nodes with errors.
  Impact: These nodes will have been drained already in preparation. No impact on jobs on the day and the affected nodes will return to service in their respective partitions after the maintenance period.
- OOD/Open OnDemand reboots
  Audience: All OOD users, reboot of the head nodes.
  Impact: Running sessions will not be affected.
- Login node reboots
  Audience; All login node users.
  Impact: Login nodes will reboot during the maintenance window.
- Netscratch 90-day retention cleanup
  Audience; All netscratch users
  Impact: Files older than 90 days will be removed per our scratch policy. Please note that this cleanup can happen at any time, not just during maintenance.
Thank you,
FAS Research Computing
https://docs.rc.fas.harvard.edu/
https://www.rc.fas.harvard.edu/

Resolvido
julho 02, 2026 em 14:54UTC
Resolvido
julho 02, 2026 em 14:54UTC
The underlying cause has been identified and fixed. Logins should work as expected now.
This incident has been resolved.
Investigando
julho 02, 2026 em 14:20UTC
Investigando
julho 02, 2026 em 14:20UTC
We are currently investigating an issue where users cannot log into login nodes or cannot log in consistently.

jun 2026

Resolvido
junho 29, 2026 em 04:31UTC
Resolvido
junho 29, 2026 em 04:31UTC
This incident has been resolved. Less than 18 nodes remain in this state and will be cleaned up soon.
Monitorização
junho 24, 2026 em 14:06UTC
Monitorização
junho 24, 2026 em 14:06UTC
We have rolled out a fix and are currently monitoring the result.

Any nodes still in Kill Task Fail will need to fully drain of running jobs first before being reopened to ensure there are no orphaned processes.
PLEASE NOTE: This will take quite some time to fully resolve as jobs are still running on these nodes. But the nodes cannot accept new jobs until cleared.
Currently # of potentially impacted nodes: 18 (of an original 600 - number will update periodically)
Investigando
junho 24, 2026 em 13:40UTC
Investigando
junho 24, 2026 em 13:40UTC
We are currently investigating this incident. Due to a CVMFS mount issue, some nodes are being closed and labelled as "Kill task fail"
WIP

Resolvido
junho 20, 2026 em 14:09UTC
Resolvido
junho 20, 2026 em 14:09UTC
This incident has been resolved.
Investigando
junho 19, 2026 em 19:00UTC
Investigando
junho 19, 2026 em 19:00UTC
holylfs06 storage may be slow or unresponsive. We are investigating this issue.

2026 MGHPCC power downtime June 15-18, 2026

Concluído
junho 18, 2026 em 21:15UTC
Concluído
junho 18, 2026 em 21:15UTC
The yearly power downtime at our Holyoke data center, MGHPCC, has completed.
The clusters and storage are back online and login nodes and OOD nodes are now available.
If you have an issue/need help, please send a ticket to rchelp@rc.fas.harvard.edu with details.
IMPORTANT NOTE: Tomorrow, June 19th is a university holiday. FASRC staff will return Monday to address any lingering issues and any new tickets.
Atualizar
junho 18, 2026 em 20:45UTC
Atualizar
junho 18, 2026 em 20:45UTC
Power-up is nearly complete, but a delay earlier in the day has us slightly behind.
New ETA is 6PM.
Atualizar
junho 18, 2026 em 12:16UTC
Atualizar
junho 18, 2026 em 12:16UTC
MGHPCC has completed their maintenance and restored power to the facility.
FASRC will now begin the power-up process. Please be aware that this takes several hours.
We will update this status once complete.
NOTE: A reminder that tomorrow (Friday) is a university holiday.
Em curso
junho 15, 2026 em 13:00UTC
Em curso
junho 15, 2026 em 13:00UTC
Maintenance is now in progress
Ainda não começou
junho 15, 2026 em 13:00UTC
Ainda não começou
junho 15, 2026 em 13:00UTC
The yearly power downtime at our Holyoke data center, MGHPCC, has been scheduled by the facility. This year's power downtime will take place on Tuesday June 15th - 18th, 2025. There will be no June monthly maintenance as a result.
Since the facility will be powered down for two days this year, we will not be performing the usual maintenance tasks.
That said, networking and other key infrastructure will be doing maintenance.
IMPORTANT NOTE: FASRC storage at both Holyoke and Boston will be affected and should not be expected to be available throughout the downtime. Please plan ahead accordingly.
- Monday June 15th - Power-down begins at 9AM
- Tuesday June 16th - Power out at MGHPCC
- Wednesday June 17th - Power out at MGHPCC
- Thursday June 18th - Expected return to full service by 5PM
- Friday June 19th - Please note that June 19th is a university holiday
For more detailed information and follow-up, please see:
https://www.rc.fas.harvard.edu/mghpcc-yearly-shutdown or this Status Page

Resolvido
junho 10, 2026 em 20:07UTC
Resolvido
junho 10, 2026 em 20:07UTC
This incident has been resolved.
Investigando
junho 10, 2026 em 19:41UTC
Investigando
junho 10, 2026 em 19:41UTC
Holylfs06 storage may be slow or unresponsive
We are currently investigating this incident.

Resolvido
junho 09, 2026 em 19:38UTC
Resolvido
junho 09, 2026 em 19:38UTC
The Portal website should be accessible for all now.
This incident has been resolved.
Investigando
junho 09, 2026 em 16:22UTC
Investigando
junho 09, 2026 em 16:22UTC
There is an SSL issue with the Portal which will cause an error for anyone attempting to connect.
Investigating.

mai 2026

Resolvido
junho 01, 2026 em 14:45UTC
Resolvido
junho 01, 2026 em 14:45UTC
Slurm crashed on 4:30p on Friday due to a user running a large sacct query against the Slurm database. This caused the database host to run out of memory and crash the scheduler. To prevent this from reoccurring we are reducing the time range that users are permitted to query at one time to 7 days. Thus if you need to cover a month you would need to query in four 7 day increments.

We do ask users to be judicious in their querying of the Slurm. Only ask for those fields that you require. Please also ensure any AI agents you have running limit their queries appropriately.
Identificado
maio 29, 2026 em 21:13UTC
Identificado
maio 29, 2026 em 21:13UTC
To temporarily stabilize the situation, we have reduced the maximum query time for sacct and other Slurm commands to be 1 day. We have filed a ticket with SchedMD to further analyze the issue.
The cluster is back up and the scheduler is accepting new jobs.
We will continue to monitor for emergencies over the weekend, and resume in-depth troubleshooting on Monday.
Investigando
maio 29, 2026 em 20:21UTC
Investigando
maio 29, 2026 em 20:21UTC
The Slurm scheduler is experiencing an error which is impacting jobs. The Cannon cluster will be inaccessible while we troubleshoot.
We are currently investigating this incident.

Resolvido
maio 26, 2026 em 14:20UTC
Resolvido
maio 26, 2026 em 14:20UTC
This incident has been resolved. Login is again available for approvals/onboarding.
Investigando
maio 22, 2026 em 15:18UTC
Investigando
maio 22, 2026 em 15:18UTC
Account approvals through Portal are not available at this time.
We are currently investigating this incident.

Resolvido
maio 22, 2026 em 05:09UTC
Resolvido
maio 22, 2026 em 05:09UTC
FASRC VPN (Cambridge) is back up. This incident was automatically resolved by Instatus monitoring.
Investigando
maio 22, 2026 em 04:50UTC
Investigando
maio 22, 2026 em 04:50UTC
FASRC VPN (Cambridge) is down at the moment. This incident was automatically created by Instatus monitoring.

Resolvido
maio 22, 2026 em 03:55UTC
Resolvido
maio 22, 2026 em 03:55UTC
FASRC VPN (Cambridge) is back up. This incident was automatically resolved by Instatus monitoring.
Investigando
maio 22, 2026 em 03:35UTC
Investigando
maio 22, 2026 em 03:35UTC
FASRC VPN (Cambridge) is down at the moment. This incident was automatically created by Instatus monitoring.

Resolvido
maio 22, 2026 em 03:14UTC
Resolvido
maio 22, 2026 em 03:14UTC
FASRC VPN (Cambridge) is back up. This incident was automatically resolved by Instatus monitoring.
Investigando
maio 22, 2026 em 03:05UTC
Investigando
maio 22, 2026 em 03:05UTC
FASRC VPN (Cambridge) is down at the moment. This incident was automatically created by Instatus monitoring.

mai 2026 para jul 2026

FAS Research Computing - Histórico de avisos

Histórico de avisos

jul 2026

jun 2026

mai 2026