Istorija obaveštenja

Funkcioniše

Jun 2025

Rešeno
June 20, 2025 u 6:27 PM
Rešeno
June 20, 2025 u 6:27 PM
Coldfront is back up. This incident has been resolved.
Istražuje se
June 20, 2025 u 4:58 PM
Istražuje se
June 20, 2025 u 4:58 PM
Coldfront may be inaccessible. We are currently investigating this incident.

Rešeno
June 16, 2025 u 8:26 PM
Rešeno
June 16, 2025 u 8:26 PM
Allowed one week for the message to propagate. Closing this incident.
Identifikovano
June 09, 2025 u 2:34 PM
Identifikovano
June 09, 2025 u 2:34 PM
While attempting to correct the over-quota/extra data issue on holylabs, an error in the sync command caused the deletion of newly created files since the re-open of the cluster (6/5/25 9AM) for 54 lab directories. We see no evidence that any other lab directories were affected.

Due to the large nature of the original cleanup and the error being discovered after the fact, regretfully these deleted files cannot be recovered.
A list follows of affected /n/holylabs lab directories. If your lab is not on that list, then it is not identified as being affected but this error:
acc_lab
alvarez_lab
avillar_lab
barnett_lab
bertoldi_lab
bol_lab
brenner_lab
cgolden_lab
charbonneau_lab
charrison_lab
chetty_lab
cnelya_lab
dam_lab
doshi-velez_lab
eisenstein_lab
enos_lab
eps_preceptors
glassman_lab
hanson_lab
hekstra_lab
holbrook_lab
iaifi_lab
idreos_lab
iebecker_lab
imai_lab
jacobsen_lab
jialiu_lab
junweil_lab
kaxiras_lab
kdbrantley_lab
kempner_dev
king_lab
kiyoul_lab
konkle_lab
koumoutsakos_lab
kozinsky_lab
kramer_lab
maustern_lab
nliu_lab
pallais_lab
park_lab
pierce_lab
protopapas_lab
pslade_lab
shro_lab
sitanc_lab
smousavih_lab
sneel_lab
snyder_lab
sompolinsky_lab
tamano_lab
ylei_lab
zickler_lab
Istražuje se
June 09, 2025 u 2:07 PM
Istražuje se
June 09, 2025 u 2:07 PM
We are currently investigating an issue on holylabs where some labs have noticed newly created files are missing.

We will update this incident with more info as soon as possible.

Rešeno
June 06, 2025 u 3:50 PM
Rešeno
June 06, 2025 u 3:50 PM
During the downtime holylabs was migrated to our new Vast filesystem. During this migration data was synced from from the old system to the new and then a final sync was done at powerup while the systems were not available for use by end users.
We have since found an issue for some labs that will present itself in one of two ways (or both, for some):
1 - You are able to use your holylabs lab directory but you notice files that were previously deleted
2- You cannot use your holylabs lab directory because of over quota errors
Our storage team is working to resolve this issue by temporarily suspending quotas on holylabs to prevent over-quota errors and also to remove the data that should not have remained after the final sync(s) so as to get lab directories back below quota.
This work has already begun but will may take several days to complete. Updates will be posted here.

June 5-6 MGHPCC pod 7c cooling updates - See partition list below

Završeno
June 05, 2025 u 7:03 PM
Završeno
June 05, 2025 u 7:03 PM
The work on row 7c is complete. Returning idled nodes to normal service.
U toku
June 05, 2025 u 11:00 AM
U toku
June 05, 2025 u 11:00 AM
Maintenance is now in progress
Planirano
June 05, 2025 u 11:00 AM
Planirano
June 05, 2025 u 11:00 AM
There will be additional scheduled maintenance at MGHPCC between June 5th and 6th.
As part of the work during the MGHPCC Outage, one of the Cooling Distribution Unit (CDU) in Pod 7c will be replaced. This will allow for future expansion into this space.
This work will run from Thursday Jun 5th until the evening of Friday June 6th. This means nodes whose names begin with holy7c02, 04, 06, 08, 10, 12 will not come back online after the outage and will remain down until this CDU update is complete.
This impacts the following partitions. If you are using one of those partitions please use the public sapphire partition while your equipment is being serviced. These nodes will be returned to service once the CDU work is complete:
- blackhole
- blackhole_priority
- davies
- desai
- eddy
- huce_cascade
- huce_cascade_priority
- huttenhower
- jacobsen2
- janson
- janson_cascade
- ke
- lukin
- nguyen
- seas_compute
- shared
- tambe
- vishwanath
- whipple
- xlin

2025 MGHPCC power downtime June 2-4, 2025

Završeno
June 05, 2025 u 1:00 PM
Završeno
June 05, 2025 u 1:00 PM
Maintenance has completed successfully
U toku
June 02, 2025 u 1:00 PM
U toku
June 02, 2025 u 1:00 PM
Maintenance is now in progress
Planirano
June 02, 2025 u 1:00 PM
Planirano
June 02, 2025 u 1:00 PM
The yearly power downtime at our Holyoke data center, MGHPCC, has been scheduled.
This year's power downtime will take place on Tuesday June 3, 2025.
This will require FASRC to begin shutdown of our systems beginning at 9AM on Monday, June 2nd.
We have worked to reduce the total outage time this year.
We will begin power-up on Wednesday June 4th with an expected return to full service by 9AM Thursday June 5th.
- Monday June 2nd - Power-down begins at 9AM
- Tuesday June 3rd - Power out at MGHPCC
- Wednesday June 4th - Maintenance tasks and then power-up begins
- Thursday June 5th - Expected return to full service by 9AM
Maintenance:
During this downtime, Holylabs (/n/holylabs) will move to new hardware.
Starfish, Coldfront, and the Portal will be unavailable during the downtime.
For more details including a graphical timeline, please see: https://www.rc.fas.harvard.edu/events/2025-mghpcc-power-downtime/
Updates will be posted here on our status page: https://status.rc.fas.harvard.edu/
Note that you can subscribe to receive updates as they happen. On the status page, click Get Updates.
Notices and reminders will also be sent to all users via our mailing lists.

May 2025

Rešeno
May 30, 2025 u 7:59 PM
Rešeno
May 30, 2025 u 7:59 PM
This incident has been resolved.
Istražuje se
May 30, 2025 u 4:33 PM
Istražuje se
May 30, 2025 u 4:33 PM
Coldfront is undergoing maintenance and may experience service disruption in the course of this process
ETA for resolution is end of business on Friday.

Starfish upgrade Thursday, May 29th from 5PM-6PM

Završeno
May 30, 2025 u 2:27 PM
Završeno
May 30, 2025 u 2:27 PM
Maintenance has completed successfully.
Planirano
May 29, 2025 u 9:00 PM
Planirano
May 29, 2025 u 9:00 PM
Starfish upgrade Thursday, May 29th from 5PM-6PM. Starfish will be unavailable during that time

Rešeno
May 27, 2025 u 11:51 PM
Rešeno
May 27, 2025 u 11:51 PM
maintenance complete
Identifikovano
May 27, 2025 u 6:28 PM
Identifikovano
May 27, 2025 u 6:28 PM
Coldfront is in need of database maintenance. Expect return to service by 5PM.

MGHPCC power work 5/21 - 5/23 - Some partitions will be at half capacity

Završeno
May 23, 2025 u 7:00 PM
Završeno
May 23, 2025 u 7:00 PM
Maintenance has completed successfully
U toku
May 21, 2025 u 11:00 AM
U toku
May 21, 2025 u 11:00 AM
Maintenance is now in progress
Planirano
May 21, 2025 u 11:00 AM
Planirano
May 21, 2025 u 11:00 AM
The MGHPCC Holyoke data center will be performing power work on May 21st -23rd. This work will take out one half (or one 'side') of the power capacity for certain rows/racks including our compute rows. Because of our power draw, one side is not enough power to keep each full rack running.
As such, we will be adding a reservation to idle half the nodes in the partitions listed below. A reservation will cause nodes to drain as jobs complete and stop scheduling new jobs on those nodes if they cannot be completed before the outage. This will allow us to idle and power down those nodes prior to the work and avoid potential blackout/brownout on those racks.
This will mean that these partitions will be up and available, but that half the nodes from each will be down (assuming an even number of nodes).
This work is part of an on-going power capacity upgrade at MGHPCC. We expect this will be the last power work needed and the facility will then provide enough additional power for future expansion as well adding overhead for the current load.
The affected partitions are:
- arguelles_delgado
- bigmem_intermediate
- blackhole_gpu
- eddy gershman
- hejazi
- hernquist
- hoekstra
- huce_ice
- iaifi_gpu
- iaifi_gpu_requeue
- iaifi_priority
- jshapiro
- jshapiro_priority
- kempner
- kempner_requeue
- kempner_h100
- kempner_h100_priority
- kempner_h100_priority2
- kovac kozinsky
- kozinsky_gpu
- kozinsky_requeue
- ortegahernandez_ice
- rivas
- seas_compute
- seas_gpu
- siag_combo
- siag_gpu
- sur
- zhuang

Rešeno
May 06, 2025 u 2:30 PM
Rešeno
May 06, 2025 u 2:30 PM
We have reverted the changes to the Jupyter OOD app while we work on further development.
This incident is resolved.
Identifikovano
May 06, 2025 u 1:56 PM
Identifikovano
May 06, 2025 u 1:56 PM
The Jupyter app for OOD is failing to start for certain workflows that load older versions of python or modules. We are in the process of reverting Jupyter to fix the bug.

Apr 2025

Rešeno
April 28, 2025 u 8:36 PM
Rešeno
April 28, 2025 u 8:36 PM
Starfish is once again accepting logins.
Istražuje se
April 28, 2025 u 2:25 PM
Istražuje se
April 28, 2025 u 2:25 PM
Starfish is not allowing login due to a database issue. We have contacted the vendor and expect to resolve this soon.

Rešeno
April 27, 2025 u 12:41 PM
Rešeno
April 27, 2025 u 12:41 PM
Most nodes have been returned to service, both Cannon and FASSE are operable. Remaining down nodes will be remediated by FASRC staff over the next several days.
Istražuje se
April 27, 2025 u 11:02 AM
Istražuje se
April 27, 2025 u 11:02 AM
At 1:05am EDT MGHPCC lost power to compute due to cooling failure. The chiller has been restored and power is back as of 5:43am. As a result of this all compute is powered off and all running jobs were requeued. FASRC staff is triaging the down hardware and will bring it back online when it is ready. Jobs will start once FASRC staff have confirmed that everything is in good state.

Rešeno
April 16, 2025 u 2:40 PM
Rešeno
April 16, 2025 u 2:40 PM
holylabs is back up
Identifikovano
April 16, 2025 u 2:27 PM
Identifikovano
April 16, 2025 u 2:27 PM
holylabs must be rebooted to resolved a stability/funciton issue.

Rešeno
April 14, 2025 u 5:18 PM
Rešeno
April 14, 2025 u 5:18 PM
Cannon boslogin and FASSE login nodes are back up and operational.
All holylogin nodes are still down for repair, please see our posted incident for more updates: https://status.rc.fas.harvard.edu/cm97gyay90013dturk7fxg5pb
We apologize for the unexpected disruption.
Istražuje se
April 14, 2025 u 4:58 PM
Istražuje se
April 14, 2025 u 4:58 PM
Due to a configuration error, all cluster login nodes are rebooting and are temporarily unavailable. Please save any work immediately.

Rešeno
April 14, 2025 u 6:48 PM
Rešeno
April 14, 2025 u 6:48 PM
Hardware has been repaired and holyoke login nodes are back online. Thanks for your patience.
Nadgledanje
April 09, 2025 u 3:15 PM
Nadgledanje
April 09, 2025 u 3:15 PM
Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.
- holyoke login nodes (holylogin05-08) are down for hardware repair
- Only Boston login nodes available (ie, boslogin[05-08])
If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.
As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.
If you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or (once they are back in service) holylogin.rc.fas.harvard.edu (Holyoke).
See also: Command line access with Terminal (login nodes) – FASRC DOCS
Rešeno
April 09, 2025 u 3:01 PM
Rešeno
April 09, 2025 u 3:01 PM
This incident was posted by mistake.

holylogin01-04 were replaced by holylogin05-08 some time back.

As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.
Or if you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or holylogin.rc.fas.harvard.edu (Holyoke).
See also: Command line access with Terminal (login nodes) – FASRC DOCS
Istražuje se
April 07, 2025 u 7:33 PM
Istražuje se
April 07, 2025 u 7:33 PM
Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.
Audience:
- All cluster users
Impact:
- All holylogin** servers will be down till further notice
- Only Boston login nodes available (ie, boslogin[05-08])
If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.
Updates to follow as we have them.

Apr 2025 do Jun 2025

FAS Research Computing - Istorija obaveštenja

Istorija obaveštenja

Jun 2025

May 2025

Apr 2025