Historique des incidents

Performances dégradées

Opérationnel

juin 2025

Résolu
juin 20, 2025 à 18:27
Résolu
juin 20, 2025 à 18:27
Coldfront is back up. This incident has been resolved.
Détecté
juin 20, 2025 à 16:58
Détecté
juin 20, 2025 à 16:58
Coldfront may be inaccessible. We are currently investigating this incident.

Résolu
juin 16, 2025 à 20:26
Résolu
juin 16, 2025 à 20:26
Allowed one week for the message to propagate. Closing this incident.
Identifié
juin 09, 2025 à 14:34
Identifié
juin 09, 2025 à 14:34
While attempting to correct the over-quota/extra data issue on holylabs, an error in the sync command caused the deletion of newly created files since the re-open of the cluster (6/5/25 9AM) for 54 lab directories. We see no evidence that any other lab directories were affected.

Due to the large nature of the original cleanup and the error being discovered after the fact, regretfully these deleted files cannot be recovered.
A list follows of affected /n/holylabs lab directories. If your lab is not on that list, then it is not identified as being affected but this error:
acc_lab
alvarez_lab
avillar_lab
barnett_lab
bertoldi_lab
bol_lab
brenner_lab
cgolden_lab
charbonneau_lab
charrison_lab
chetty_lab
cnelya_lab
dam_lab
doshi-velez_lab
eisenstein_lab
enos_lab
eps_preceptors
glassman_lab
hanson_lab
hekstra_lab
holbrook_lab
iaifi_lab
idreos_lab
iebecker_lab
imai_lab
jacobsen_lab
jialiu_lab
junweil_lab
kaxiras_lab
kdbrantley_lab
kempner_dev
king_lab
kiyoul_lab
konkle_lab
koumoutsakos_lab
kozinsky_lab
kramer_lab
maustern_lab
nliu_lab
pallais_lab
park_lab
pierce_lab
protopapas_lab
pslade_lab
shro_lab
sitanc_lab
smousavih_lab
sneel_lab
snyder_lab
sompolinsky_lab
tamano_lab
ylei_lab
zickler_lab
Détecté
juin 09, 2025 à 14:07
Détecté
juin 09, 2025 à 14:07
We are currently investigating an issue on holylabs where some labs have noticed newly created files are missing.

We will update this incident with more info as soon as possible.

Résolu
juin 06, 2025 à 15:50
Résolu
juin 06, 2025 à 15:50
During the downtime holylabs was migrated to our new Vast filesystem. During this migration data was synced from from the old system to the new and then a final sync was done at powerup while the systems were not available for use by end users.
We have since found an issue for some labs that will present itself in one of two ways (or both, for some):
1 - You are able to use your holylabs lab directory but you notice files that were previously deleted
2- You cannot use your holylabs lab directory because of over quota errors
Our storage team is working to resolve this issue by temporarily suspending quotas on holylabs to prevent over-quota errors and also to remove the data that should not have remained after the final sync(s) so as to get lab directories back below quota.
This work has already begun but will may take several days to complete. Updates will be posted here.

June 5-6 MGHPCC pod 7c cooling updates - See partition list below

Terminé
juin 05, 2025 à 19:03
Terminé
juin 05, 2025 à 19:03
The work on row 7c is complete. Returning idled nodes to normal service.
En cours
juin 05, 2025 à 11:00
En cours
juin 05, 2025 à 11:00
Maintenance is now in progress
Pas encore commencé
juin 05, 2025 à 11:00
Pas encore commencé
juin 05, 2025 à 11:00
There will be additional scheduled maintenance at MGHPCC between June 5th and 6th.
As part of the work during the MGHPCC Outage, one of the Cooling Distribution Unit (CDU) in Pod 7c will be replaced. This will allow for future expansion into this space.
This work will run from Thursday Jun 5th until the evening of Friday June 6th. This means nodes whose names begin with holy7c02, 04, 06, 08, 10, 12 will not come back online after the outage and will remain down until this CDU update is complete.
This impacts the following partitions. If you are using one of those partitions please use the public sapphire partition while your equipment is being serviced. These nodes will be returned to service once the CDU work is complete:
- blackhole
- blackhole_priority
- davies
- desai
- eddy
- huce_cascade
- huce_cascade_priority
- huttenhower
- jacobsen2
- janson
- janson_cascade
- ke
- lukin
- nguyen
- seas_compute
- shared
- tambe
- vishwanath
- whipple
- xlin

2025 MGHPCC power downtime June 2-4, 2025

Terminé
juin 05, 2025 à 13:00
Terminé
juin 05, 2025 à 13:00
Maintenance has completed successfully
En cours
juin 02, 2025 à 13:00
En cours
juin 02, 2025 à 13:00
Maintenance is now in progress
Pas encore commencé
juin 02, 2025 à 13:00
Pas encore commencé
juin 02, 2025 à 13:00
The yearly power downtime at our Holyoke data center, MGHPCC, has been scheduled.
This year's power downtime will take place on Tuesday June 3, 2025.
This will require FASRC to begin shutdown of our systems beginning at 9AM on Monday, June 2nd.
We have worked to reduce the total outage time this year.
We will begin power-up on Wednesday June 4th with an expected return to full service by 9AM Thursday June 5th.
- Monday June 2nd - Power-down begins at 9AM
- Tuesday June 3rd - Power out at MGHPCC
- Wednesday June 4th - Maintenance tasks and then power-up begins
- Thursday June 5th - Expected return to full service by 9AM
Maintenance:
During this downtime, Holylabs (/n/holylabs) will move to new hardware.
Starfish, Coldfront, and the Portal will be unavailable during the downtime.
For more details including a graphical timeline, please see: https://www.rc.fas.harvard.edu/events/2025-mghpcc-power-downtime/
Updates will be posted here on our status page: https://status.rc.fas.harvard.edu/
Note that you can subscribe to receive updates as they happen. On the status page, click Get Updates.
Notices and reminders will also be sent to all users via our mailing lists.

mai 2025

Résolu
mai 30, 2025 à 19:59
Résolu
mai 30, 2025 à 19:59
This incident has been resolved.
Détecté
mai 30, 2025 à 16:33
Détecté
mai 30, 2025 à 16:33
Coldfront is undergoing maintenance and may experience service disruption in the course of this process
ETA for resolution is end of business on Friday.

Starfish upgrade Thursday, May 29th from 5PM-6PM

Terminé
mai 30, 2025 à 14:27
Terminé
mai 30, 2025 à 14:27
Maintenance has completed successfully.
Pas encore commencé
mai 29, 2025 à 21:00
Pas encore commencé
mai 29, 2025 à 21:00
Starfish upgrade Thursday, May 29th from 5PM-6PM. Starfish will be unavailable during that time

Résolu
mai 27, 2025 à 23:51
Résolu
mai 27, 2025 à 23:51
maintenance complete
Identifié
mai 27, 2025 à 18:28
Identifié
mai 27, 2025 à 18:28
Coldfront is in need of database maintenance. Expect return to service by 5PM.

MGHPCC power work 5/21 - 5/23 - Some partitions will be at half capacity

Terminé
mai 23, 2025 à 19:00
Terminé
mai 23, 2025 à 19:00
Maintenance has completed successfully
En cours
mai 21, 2025 à 11:00
En cours
mai 21, 2025 à 11:00
Maintenance is now in progress
Pas encore commencé
mai 21, 2025 à 11:00
Pas encore commencé
mai 21, 2025 à 11:00
The MGHPCC Holyoke data center will be performing power work on May 21st -23rd. This work will take out one half (or one 'side') of the power capacity for certain rows/racks including our compute rows. Because of our power draw, one side is not enough power to keep each full rack running.
As such, we will be adding a reservation to idle half the nodes in the partitions listed below. A reservation will cause nodes to drain as jobs complete and stop scheduling new jobs on those nodes if they cannot be completed before the outage. This will allow us to idle and power down those nodes prior to the work and avoid potential blackout/brownout on those racks.
This will mean that these partitions will be up and available, but that half the nodes from each will be down (assuming an even number of nodes).
This work is part of an on-going power capacity upgrade at MGHPCC. We expect this will be the last power work needed and the facility will then provide enough additional power for future expansion as well adding overhead for the current load.
The affected partitions are:
- arguelles_delgado
- bigmem_intermediate
- blackhole_gpu
- eddy gershman
- hejazi
- hernquist
- hoekstra
- huce_ice
- iaifi_gpu
- iaifi_gpu_requeue
- iaifi_priority
- jshapiro
- jshapiro_priority
- kempner
- kempner_requeue
- kempner_h100
- kempner_h100_priority
- kempner_h100_priority2
- kovac kozinsky
- kozinsky_gpu
- kozinsky_requeue
- ortegahernandez_ice
- rivas
- seas_compute
- seas_gpu
- siag_combo
- siag_gpu
- sur
- zhuang

Résolu
mai 06, 2025 à 14:30
Résolu
mai 06, 2025 à 14:30
We have reverted the changes to the Jupyter OOD app while we work on further development.
This incident is resolved.
Identifié
mai 06, 2025 à 13:56
Identifié
mai 06, 2025 à 13:56
The Jupyter app for OOD is failing to start for certain workflows that load older versions of python or modules. We are in the process of reverting Jupyter to fix the bug.

avr. 2025

Résolu
avril 28, 2025 à 20:36
Résolu
avril 28, 2025 à 20:36
Starfish is once again accepting logins.
Détecté
avril 28, 2025 à 14:25
Détecté
avril 28, 2025 à 14:25
Starfish is not allowing login due to a database issue. We have contacted the vendor and expect to resolve this soon.

Résolu
avril 27, 2025 à 12:41
Résolu
avril 27, 2025 à 12:41
Most nodes have been returned to service, both Cannon and FASSE are operable. Remaining down nodes will be remediated by FASRC staff over the next several days.
Détecté
avril 27, 2025 à 11:02
Détecté
avril 27, 2025 à 11:02
At 1:05am EDT MGHPCC lost power to compute due to cooling failure. The chiller has been restored and power is back as of 5:43am. As a result of this all compute is powered off and all running jobs were requeued. FASRC staff is triaging the down hardware and will bring it back online when it is ready. Jobs will start once FASRC staff have confirmed that everything is in good state.

Résolu
avril 16, 2025 à 14:40
Résolu
avril 16, 2025 à 14:40
holylabs is back up
Identifié
avril 16, 2025 à 14:27
Identifié
avril 16, 2025 à 14:27
holylabs must be rebooted to resolved a stability/funciton issue.

Résolu
avril 14, 2025 à 17:18
Résolu
avril 14, 2025 à 17:18
Cannon boslogin and FASSE login nodes are back up and operational.
All holylogin nodes are still down for repair, please see our posted incident for more updates: https://status.rc.fas.harvard.edu/cm97gyay90013dturk7fxg5pb
We apologize for the unexpected disruption.
Détecté
avril 14, 2025 à 16:58
Détecté
avril 14, 2025 à 16:58
Due to a configuration error, all cluster login nodes are rebooting and are temporarily unavailable. Please save any work immediately.

Résolu
avril 14, 2025 à 18:48
Résolu
avril 14, 2025 à 18:48
Hardware has been repaired and holyoke login nodes are back online. Thanks for your patience.
Surveillé
avril 09, 2025 à 15:15
Surveillé
avril 09, 2025 à 15:15
Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.
- holyoke login nodes (holylogin05-08) are down for hardware repair
- Only Boston login nodes available (ie, boslogin[05-08])
If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.
As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.
If you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or (once they are back in service) holylogin.rc.fas.harvard.edu (Holyoke).
See also: Command line access with Terminal (login nodes) – FASRC DOCS
Résolu
avril 09, 2025 à 15:01
Résolu
avril 09, 2025 à 15:01
This incident was posted by mistake.

holylogin01-04 were replaced by holylogin05-08 some time back.

As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.
Or if you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or holylogin.rc.fas.harvard.edu (Holyoke).
See also: Command line access with Terminal (login nodes) – FASRC DOCS
Détecté
avril 07, 2025 à 19:33
Détecté
avril 07, 2025 à 19:33
Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.
Audience:
- All cluster users
Impact:
- All holylogin** servers will be down till further notice
- Only Boston login nodes available (ie, boslogin[05-08])
If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.
Updates to follow as we have them.

avr. 2025 à juin 2025

FAS Research Computing - Historique des incidents

Historique des incidents

juin 2025

mai 2025

avr. 2025