历史记录

性能下降

运行正常

6月 2025

已解决
六月 20, 2025 在下午 6:27
已解决
六月 20, 2025 在下午 6:27
Coldfront is back up. This incident has been resolved.
调查中
六月 20, 2025 在下午 4:58
调查中
六月 20, 2025 在下午 4:58
Coldfront may be inaccessible. We are currently investigating this incident.

已解决
六月 16, 2025 在下午 8:26
已解决
六月 16, 2025 在下午 8:26
Allowed one week for the message to propagate. Closing this incident.
已确认问题
六月 09, 2025 在下午 2:34
已确认问题
六月 09, 2025 在下午 2:34
While attempting to correct the over-quota/extra data issue on holylabs, an error in the sync command caused the deletion of newly created files since the re-open of the cluster (6/5/25 9AM) for 54 lab directories. We see no evidence that any other lab directories were affected.

Due to the large nature of the original cleanup and the error being discovered after the fact, regretfully these deleted files cannot be recovered.
A list follows of affected /n/holylabs lab directories. If your lab is not on that list, then it is not identified as being affected but this error:
acc_lab
alvarez_lab
avillar_lab
barnett_lab
bertoldi_lab
bol_lab
brenner_lab
cgolden_lab
charbonneau_lab
charrison_lab
chetty_lab
cnelya_lab
dam_lab
doshi-velez_lab
eisenstein_lab
enos_lab
eps_preceptors
glassman_lab
hanson_lab
hekstra_lab
holbrook_lab
iaifi_lab
idreos_lab
iebecker_lab
imai_lab
jacobsen_lab
jialiu_lab
junweil_lab
kaxiras_lab
kdbrantley_lab
kempner_dev
king_lab
kiyoul_lab
konkle_lab
koumoutsakos_lab
kozinsky_lab
kramer_lab
maustern_lab
nliu_lab
pallais_lab
park_lab
pierce_lab
protopapas_lab
pslade_lab
shro_lab
sitanc_lab
smousavih_lab
sneel_lab
snyder_lab
sompolinsky_lab
tamano_lab
ylei_lab
zickler_lab
调查中
六月 09, 2025 在下午 2:07
调查中
六月 09, 2025 在下午 2:07
We are currently investigating an issue on holylabs where some labs have noticed newly created files are missing.

We will update this incident with more info as soon as possible.

已解决
六月 06, 2025 在下午 3:50
已解决
六月 06, 2025 在下午 3:50
During the downtime holylabs was migrated to our new Vast filesystem. During this migration data was synced from from the old system to the new and then a final sync was done at powerup while the systems were not available for use by end users.
We have since found an issue for some labs that will present itself in one of two ways (or both, for some):
1 - You are able to use your holylabs lab directory but you notice files that were previously deleted
2- You cannot use your holylabs lab directory because of over quota errors
Our storage team is working to resolve this issue by temporarily suspending quotas on holylabs to prevent over-quota errors and also to remove the data that should not have remained after the final sync(s) so as to get lab directories back below quota.
This work has already begun but will may take several days to complete. Updates will be posted here.

June 5-6 MGHPCC pod 7c cooling updates - See partition list below

已完成
六月 05, 2025 在下午 7:03
已完成
六月 05, 2025 在下午 7:03
The work on row 7c is complete. Returning idled nodes to normal service.
进行中
六月 05, 2025 在上午 11:00
进行中
六月 05, 2025 在上午 11:00
Maintenance is now in progress
已计划
六月 05, 2025 在上午 11:00
已计划
六月 05, 2025 在上午 11:00
There will be additional scheduled maintenance at MGHPCC between June 5th and 6th.
As part of the work during the MGHPCC Outage, one of the Cooling Distribution Unit (CDU) in Pod 7c will be replaced. This will allow for future expansion into this space.
This work will run from Thursday Jun 5th until the evening of Friday June 6th. This means nodes whose names begin with holy7c02, 04, 06, 08, 10, 12 will not come back online after the outage and will remain down until this CDU update is complete.
This impacts the following partitions. If you are using one of those partitions please use the public sapphire partition while your equipment is being serviced. These nodes will be returned to service once the CDU work is complete:
- blackhole
- blackhole_priority
- davies
- desai
- eddy
- huce_cascade
- huce_cascade_priority
- huttenhower
- jacobsen2
- janson
- janson_cascade
- ke
- lukin
- nguyen
- seas_compute
- shared
- tambe
- vishwanath
- whipple
- xlin

2025 MGHPCC power downtime June 2-4, 2025

已完成
六月 05, 2025 在下午 1:00
已完成
六月 05, 2025 在下午 1:00
Maintenance has completed successfully
进行中
六月 02, 2025 在下午 1:00
进行中
六月 02, 2025 在下午 1:00
Maintenance is now in progress
已计划
六月 02, 2025 在下午 1:00
已计划
六月 02, 2025 在下午 1:00
The yearly power downtime at our Holyoke data center, MGHPCC, has been scheduled.
This year's power downtime will take place on Tuesday June 3, 2025.
This will require FASRC to begin shutdown of our systems beginning at 9AM on Monday, June 2nd.
We have worked to reduce the total outage time this year.
We will begin power-up on Wednesday June 4th with an expected return to full service by 9AM Thursday June 5th.
- Monday June 2nd - Power-down begins at 9AM
- Tuesday June 3rd - Power out at MGHPCC
- Wednesday June 4th - Maintenance tasks and then power-up begins
- Thursday June 5th - Expected return to full service by 9AM
Maintenance:
During this downtime, Holylabs (/n/holylabs) will move to new hardware.
Starfish, Coldfront, and the Portal will be unavailable during the downtime.
For more details including a graphical timeline, please see: https://www.rc.fas.harvard.edu/events/2025-mghpcc-power-downtime/
Updates will be posted here on our status page: https://status.rc.fas.harvard.edu/
Note that you can subscribe to receive updates as they happen. On the status page, click Get Updates.
Notices and reminders will also be sent to all users via our mailing lists.

5月 2025

已解决
五月 30, 2025 在下午 7:59
已解决
五月 30, 2025 在下午 7:59
This incident has been resolved.
调查中
五月 30, 2025 在下午 4:33
调查中
五月 30, 2025 在下午 4:33
Coldfront is undergoing maintenance and may experience service disruption in the course of this process
ETA for resolution is end of business on Friday.

Starfish upgrade Thursday, May 29th from 5PM-6PM

已完成
五月 30, 2025 在下午 2:27
已完成
五月 30, 2025 在下午 2:27
Maintenance has completed successfully.
已计划
五月 29, 2025 在下午 9:00
已计划
五月 29, 2025 在下午 9:00
Starfish upgrade Thursday, May 29th from 5PM-6PM. Starfish will be unavailable during that time

已解决
五月 27, 2025 在下午 11:51
已解决
五月 27, 2025 在下午 11:51
maintenance complete
已确认问题
五月 27, 2025 在下午 6:28
已确认问题
五月 27, 2025 在下午 6:28
Coldfront is in need of database maintenance. Expect return to service by 5PM.

MGHPCC power work 5/21 - 5/23 - Some partitions will be at half capacity

已完成
五月 23, 2025 在下午 7:00
已完成
五月 23, 2025 在下午 7:00
Maintenance has completed successfully
进行中
五月 21, 2025 在上午 11:00
进行中
五月 21, 2025 在上午 11:00
Maintenance is now in progress
已计划
五月 21, 2025 在上午 11:00
已计划
五月 21, 2025 在上午 11:00
The MGHPCC Holyoke data center will be performing power work on May 21st -23rd. This work will take out one half (or one 'side') of the power capacity for certain rows/racks including our compute rows. Because of our power draw, one side is not enough power to keep each full rack running.
As such, we will be adding a reservation to idle half the nodes in the partitions listed below. A reservation will cause nodes to drain as jobs complete and stop scheduling new jobs on those nodes if they cannot be completed before the outage. This will allow us to idle and power down those nodes prior to the work and avoid potential blackout/brownout on those racks.
This will mean that these partitions will be up and available, but that half the nodes from each will be down (assuming an even number of nodes).
This work is part of an on-going power capacity upgrade at MGHPCC. We expect this will be the last power work needed and the facility will then provide enough additional power for future expansion as well adding overhead for the current load.
The affected partitions are:
- arguelles_delgado
- bigmem_intermediate
- blackhole_gpu
- eddy gershman
- hejazi
- hernquist
- hoekstra
- huce_ice
- iaifi_gpu
- iaifi_gpu_requeue
- iaifi_priority
- jshapiro
- jshapiro_priority
- kempner
- kempner_requeue
- kempner_h100
- kempner_h100_priority
- kempner_h100_priority2
- kovac kozinsky
- kozinsky_gpu
- kozinsky_requeue
- ortegahernandez_ice
- rivas
- seas_compute
- seas_gpu
- siag_combo
- siag_gpu
- sur
- zhuang

已解决
五月 06, 2025 在下午 2:30
已解决
五月 06, 2025 在下午 2:30
We have reverted the changes to the Jupyter OOD app while we work on further development.
This incident is resolved.
已确认问题
五月 06, 2025 在下午 1:56
已确认问题
五月 06, 2025 在下午 1:56
The Jupyter app for OOD is failing to start for certain workflows that load older versions of python or modules. We are in the process of reverting Jupyter to fix the bug.

4月 2025

已解决
四月 28, 2025 在下午 8:36
已解决
四月 28, 2025 在下午 8:36
Starfish is once again accepting logins.
调查中
四月 28, 2025 在下午 2:25
调查中
四月 28, 2025 在下午 2:25
Starfish is not allowing login due to a database issue. We have contacted the vendor and expect to resolve this soon.

已解决
四月 27, 2025 在下午 12:41
已解决
四月 27, 2025 在下午 12:41
Most nodes have been returned to service, both Cannon and FASSE are operable. Remaining down nodes will be remediated by FASRC staff over the next several days.
调查中
四月 27, 2025 在上午 11:02
调查中
四月 27, 2025 在上午 11:02
At 1:05am EDT MGHPCC lost power to compute due to cooling failure. The chiller has been restored and power is back as of 5:43am. As a result of this all compute is powered off and all running jobs were requeued. FASRC staff is triaging the down hardware and will bring it back online when it is ready. Jobs will start once FASRC staff have confirmed that everything is in good state.

已解决
四月 16, 2025 在下午 2:40
已解决
四月 16, 2025 在下午 2:40
holylabs is back up
已确认问题
四月 16, 2025 在下午 2:27
已确认问题
四月 16, 2025 在下午 2:27
holylabs must be rebooted to resolved a stability/funciton issue.

已解决
四月 14, 2025 在下午 5:18
已解决
四月 14, 2025 在下午 5:18
Cannon boslogin and FASSE login nodes are back up and operational.
All holylogin nodes are still down for repair, please see our posted incident for more updates: https://status.rc.fas.harvard.edu/cm97gyay90013dturk7fxg5pb
We apologize for the unexpected disruption.
调查中
四月 14, 2025 在下午 4:58
调查中
四月 14, 2025 在下午 4:58
Due to a configuration error, all cluster login nodes are rebooting and are temporarily unavailable. Please save any work immediately.

已解决
四月 14, 2025 在下午 6:48
已解决
四月 14, 2025 在下午 6:48
Hardware has been repaired and holyoke login nodes are back online. Thanks for your patience.
持续监控中
四月 09, 2025 在下午 3:15
持续监控中
四月 09, 2025 在下午 3:15
Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.
- holyoke login nodes (holylogin05-08) are down for hardware repair
- Only Boston login nodes available (ie, boslogin[05-08])
If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.
As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.
If you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or (once they are back in service) holylogin.rc.fas.harvard.edu (Holyoke).
See also: Command line access with Terminal (login nodes) – FASRC DOCS
已解决
四月 09, 2025 在下午 3:01
已解决
四月 09, 2025 在下午 3:01
This incident was posted by mistake.

holylogin01-04 were replaced by holylogin05-08 some time back.

As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.
Or if you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or holylogin.rc.fas.harvard.edu (Holyoke).
See also: Command line access with Terminal (login nodes) – FASRC DOCS
调查中
四月 07, 2025 在下午 7:33
调查中
四月 07, 2025 在下午 7:33
Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.
Audience:
- All cluster users
Impact:
- All holylogin** servers will be down till further notice
- Only Boston login nodes available (ie, boslogin[05-08])
If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.
Updates to follow as we have them.

4月 2025 至 6月 2025

FAS Research Computing - 历史记录

历史记录

6月 2025

5月 2025

4月 2025