Legg merke til historikk

Nedsatt ytelse

Operasjonell

juni 2025

Løst
juni 20, 2025 kl. 18:27
Løst
juni 20, 2025 kl. 18:27
Coldfront is back up. This incident has been resolved.
Undersøker
juni 20, 2025 kl. 16:58
Undersøker
juni 20, 2025 kl. 16:58
Coldfront may be inaccessible. We are currently investigating this incident.

Løst
juni 16, 2025 kl. 20:26
Løst
juni 16, 2025 kl. 20:26
Allowed one week for the message to propagate. Closing this incident.
Identifisert
juni 09, 2025 kl. 14:34
Identifisert
juni 09, 2025 kl. 14:34
While attempting to correct the over-quota/extra data issue on holylabs, an error in the sync command caused the deletion of newly created files since the re-open of the cluster (6/5/25 9AM) for 54 lab directories. We see no evidence that any other lab directories were affected.

Due to the large nature of the original cleanup and the error being discovered after the fact, regretfully these deleted files cannot be recovered.
A list follows of affected /n/holylabs lab directories. If your lab is not on that list, then it is not identified as being affected but this error:
acc_lab
alvarez_lab
avillar_lab
barnett_lab
bertoldi_lab
bol_lab
brenner_lab
cgolden_lab
charbonneau_lab
charrison_lab
chetty_lab
cnelya_lab
dam_lab
doshi-velez_lab
eisenstein_lab
enos_lab
eps_preceptors
glassman_lab
hanson_lab
hekstra_lab
holbrook_lab
iaifi_lab
idreos_lab
iebecker_lab
imai_lab
jacobsen_lab
jialiu_lab
junweil_lab
kaxiras_lab
kdbrantley_lab
kempner_dev
king_lab
kiyoul_lab
konkle_lab
koumoutsakos_lab
kozinsky_lab
kramer_lab
maustern_lab
nliu_lab
pallais_lab
park_lab
pierce_lab
protopapas_lab
pslade_lab
shro_lab
sitanc_lab
smousavih_lab
sneel_lab
snyder_lab
sompolinsky_lab
tamano_lab
ylei_lab
zickler_lab
Undersøker
juni 09, 2025 kl. 14:07
Undersøker
juni 09, 2025 kl. 14:07
We are currently investigating an issue on holylabs where some labs have noticed newly created files are missing.

We will update this incident with more info as soon as possible.

Løst
juni 06, 2025 kl. 15:50
Løst
juni 06, 2025 kl. 15:50
During the downtime holylabs was migrated to our new Vast filesystem. During this migration data was synced from from the old system to the new and then a final sync was done at powerup while the systems were not available for use by end users.
We have since found an issue for some labs that will present itself in one of two ways (or both, for some):
1 - You are able to use your holylabs lab directory but you notice files that were previously deleted
2- You cannot use your holylabs lab directory because of over quota errors
Our storage team is working to resolve this issue by temporarily suspending quotas on holylabs to prevent over-quota errors and also to remove the data that should not have remained after the final sync(s) so as to get lab directories back below quota.
This work has already begun but will may take several days to complete. Updates will be posted here.

June 5-6 MGHPCC pod 7c cooling updates - See partition list below

Fullført
juni 05, 2025 kl. 19:03
Fullført
juni 05, 2025 kl. 19:03
The work on row 7c is complete. Returning idled nodes to normal service.
Pågår
juni 05, 2025 kl. 11:00
Pågår
juni 05, 2025 kl. 11:00
Maintenance is now in progress
Ikke startet ennå
juni 05, 2025 kl. 11:00
Ikke startet ennå
juni 05, 2025 kl. 11:00
There will be additional scheduled maintenance at MGHPCC between June 5th and 6th.
As part of the work during the MGHPCC Outage, one of the Cooling Distribution Unit (CDU) in Pod 7c will be replaced. This will allow for future expansion into this space.
This work will run from Thursday Jun 5th until the evening of Friday June 6th. This means nodes whose names begin with holy7c02, 04, 06, 08, 10, 12 will not come back online after the outage and will remain down until this CDU update is complete.
This impacts the following partitions. If you are using one of those partitions please use the public sapphire partition while your equipment is being serviced. These nodes will be returned to service once the CDU work is complete:
- blackhole
- blackhole_priority
- davies
- desai
- eddy
- huce_cascade
- huce_cascade_priority
- huttenhower
- jacobsen2
- janson
- janson_cascade
- ke
- lukin
- nguyen
- seas_compute
- shared
- tambe
- vishwanath
- whipple
- xlin

2025 MGHPCC power downtime June 2-4, 2025

Fullført
juni 05, 2025 kl. 13:00
Fullført
juni 05, 2025 kl. 13:00
Maintenance has completed successfully
Pågår
juni 02, 2025 kl. 13:00
Pågår
juni 02, 2025 kl. 13:00
Maintenance is now in progress
Ikke startet ennå
juni 02, 2025 kl. 13:00
Ikke startet ennå
juni 02, 2025 kl. 13:00
The yearly power downtime at our Holyoke data center, MGHPCC, has been scheduled.
This year's power downtime will take place on Tuesday June 3, 2025.
This will require FASRC to begin shutdown of our systems beginning at 9AM on Monday, June 2nd.
We have worked to reduce the total outage time this year.
We will begin power-up on Wednesday June 4th with an expected return to full service by 9AM Thursday June 5th.
- Monday June 2nd - Power-down begins at 9AM
- Tuesday June 3rd - Power out at MGHPCC
- Wednesday June 4th - Maintenance tasks and then power-up begins
- Thursday June 5th - Expected return to full service by 9AM
Maintenance:
During this downtime, Holylabs (/n/holylabs) will move to new hardware.
Starfish, Coldfront, and the Portal will be unavailable during the downtime.
For more details including a graphical timeline, please see: https://www.rc.fas.harvard.edu/events/2025-mghpcc-power-downtime/
Updates will be posted here on our status page: https://status.rc.fas.harvard.edu/
Note that you can subscribe to receive updates as they happen. On the status page, click Get Updates.
Notices and reminders will also be sent to all users via our mailing lists.

mai 2025

Løst
mai 30, 2025 kl. 19:59
Løst
mai 30, 2025 kl. 19:59
This incident has been resolved.
Undersøker
mai 30, 2025 kl. 16:33
Undersøker
mai 30, 2025 kl. 16:33
Coldfront is undergoing maintenance and may experience service disruption in the course of this process
ETA for resolution is end of business on Friday.

Starfish upgrade Thursday, May 29th from 5PM-6PM

Fullført
mai 30, 2025 kl. 14:27
Fullført
mai 30, 2025 kl. 14:27
Maintenance has completed successfully.
Ikke startet ennå
mai 29, 2025 kl. 21:00
Ikke startet ennå
mai 29, 2025 kl. 21:00
Starfish upgrade Thursday, May 29th from 5PM-6PM. Starfish will be unavailable during that time

Løst
mai 27, 2025 kl. 23:51
Løst
mai 27, 2025 kl. 23:51
maintenance complete
Identifisert
mai 27, 2025 kl. 18:28
Identifisert
mai 27, 2025 kl. 18:28
Coldfront is in need of database maintenance. Expect return to service by 5PM.

MGHPCC power work 5/21 - 5/23 - Some partitions will be at half capacity

Fullført
mai 23, 2025 kl. 19:00
Fullført
mai 23, 2025 kl. 19:00
Maintenance has completed successfully
Pågår
mai 21, 2025 kl. 11:00
Pågår
mai 21, 2025 kl. 11:00
Maintenance is now in progress
Ikke startet ennå
mai 21, 2025 kl. 11:00
Ikke startet ennå
mai 21, 2025 kl. 11:00
The MGHPCC Holyoke data center will be performing power work on May 21st -23rd. This work will take out one half (or one 'side') of the power capacity for certain rows/racks including our compute rows. Because of our power draw, one side is not enough power to keep each full rack running.
As such, we will be adding a reservation to idle half the nodes in the partitions listed below. A reservation will cause nodes to drain as jobs complete and stop scheduling new jobs on those nodes if they cannot be completed before the outage. This will allow us to idle and power down those nodes prior to the work and avoid potential blackout/brownout on those racks.
This will mean that these partitions will be up and available, but that half the nodes from each will be down (assuming an even number of nodes).
This work is part of an on-going power capacity upgrade at MGHPCC. We expect this will be the last power work needed and the facility will then provide enough additional power for future expansion as well adding overhead for the current load.
The affected partitions are:
- arguelles_delgado
- bigmem_intermediate
- blackhole_gpu
- eddy gershman
- hejazi
- hernquist
- hoekstra
- huce_ice
- iaifi_gpu
- iaifi_gpu_requeue
- iaifi_priority
- jshapiro
- jshapiro_priority
- kempner
- kempner_requeue
- kempner_h100
- kempner_h100_priority
- kempner_h100_priority2
- kovac kozinsky
- kozinsky_gpu
- kozinsky_requeue
- ortegahernandez_ice
- rivas
- seas_compute
- seas_gpu
- siag_combo
- siag_gpu
- sur
- zhuang

Løst
mai 06, 2025 kl. 14:30
Løst
mai 06, 2025 kl. 14:30
We have reverted the changes to the Jupyter OOD app while we work on further development.
This incident is resolved.
Identifisert
mai 06, 2025 kl. 13:56
Identifisert
mai 06, 2025 kl. 13:56
The Jupyter app for OOD is failing to start for certain workflows that load older versions of python or modules. We are in the process of reverting Jupyter to fix the bug.

apr. 2025

Løst
april 28, 2025 kl. 20:36
Løst
april 28, 2025 kl. 20:36
Starfish is once again accepting logins.
Undersøker
april 28, 2025 kl. 14:25
Undersøker
april 28, 2025 kl. 14:25
Starfish is not allowing login due to a database issue. We have contacted the vendor and expect to resolve this soon.

Løst
april 27, 2025 kl. 12:41
Løst
april 27, 2025 kl. 12:41
Most nodes have been returned to service, both Cannon and FASSE are operable. Remaining down nodes will be remediated by FASRC staff over the next several days.
Undersøker
april 27, 2025 kl. 11:02
Undersøker
april 27, 2025 kl. 11:02
At 1:05am EDT MGHPCC lost power to compute due to cooling failure. The chiller has been restored and power is back as of 5:43am. As a result of this all compute is powered off and all running jobs were requeued. FASRC staff is triaging the down hardware and will bring it back online when it is ready. Jobs will start once FASRC staff have confirmed that everything is in good state.

Løst
april 16, 2025 kl. 14:40
Løst
april 16, 2025 kl. 14:40
holylabs is back up
Identifisert
april 16, 2025 kl. 14:27
Identifisert
april 16, 2025 kl. 14:27
holylabs must be rebooted to resolved a stability/funciton issue.

Løst
april 14, 2025 kl. 17:18
Løst
april 14, 2025 kl. 17:18
Cannon boslogin and FASSE login nodes are back up and operational.
All holylogin nodes are still down for repair, please see our posted incident for more updates: https://status.rc.fas.harvard.edu/cm97gyay90013dturk7fxg5pb
We apologize for the unexpected disruption.
Undersøker
april 14, 2025 kl. 16:58
Undersøker
april 14, 2025 kl. 16:58
Due to a configuration error, all cluster login nodes are rebooting and are temporarily unavailable. Please save any work immediately.

Løst
april 14, 2025 kl. 18:48
Løst
april 14, 2025 kl. 18:48
Hardware has been repaired and holyoke login nodes are back online. Thanks for your patience.
Overvåker
april 09, 2025 kl. 15:15
Overvåker
april 09, 2025 kl. 15:15
Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.
- holyoke login nodes (holylogin05-08) are down for hardware repair
- Only Boston login nodes available (ie, boslogin[05-08])
If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.
As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.
If you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or (once they are back in service) holylogin.rc.fas.harvard.edu (Holyoke).
See also: Command line access with Terminal (login nodes) – FASRC DOCS
Løst
april 09, 2025 kl. 15:01
Løst
april 09, 2025 kl. 15:01
This incident was posted by mistake.

holylogin01-04 were replaced by holylogin05-08 some time back.

As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.
Or if you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or holylogin.rc.fas.harvard.edu (Holyoke).
See also: Command line access with Terminal (login nodes) – FASRC DOCS
Undersøker
april 07, 2025 kl. 19:33
Undersøker
april 07, 2025 kl. 19:33
Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.
Audience:
- All cluster users
Impact:
- All holylogin** servers will be down till further notice
- Only Boston login nodes available (ie, boslogin[05-08])
If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.
Updates to follow as we have them.

apr. 2025 til juni 2025

FAS Research Computing - Legg merke til historikk

Legg merke til historikk

juni 2025

mai 2025

apr. 2025