Notice history

Operational

May 2024

Resolved
June 03, 2024 at 2:53 PM
Resolved
June 03, 2024 at 2:53 PM
NESE maintenance has resolved and is now back in service
Monitoring
May 30, 2024 at 3:39 PM
Monitoring
May 30, 2024 at 3:39 PM
A note from NESE. Their maintenance has been delayed by hardware issues. ETA now Monday 6/3
Dear All,

NESE Tape system upgrade is currently in progress. While IBM hardware
team works on TS4500 library tape frame expansion and IBM software team
works on ESS and Archive software and firmware upgrades, the work
progress has been slowed down due to unforeseen hardware issues.We now expect to bring the tape service back into production this Monday
morning. We apologize for any inconvenience caused by the delay.
Identified
May 29, 2024 at 4:58 PM
Identified
May 29, 2024 at 4:58 PM
Due to maintenance at our tape partner, NESE (Northeast Storage Exchange), access to tape allocations will be unavailable until at least late Thursday (5/30). Normal operations will resume by Friday (5/31).
If you continue to have issues with a Globus tape endpoint on Friday, please contact FASRC or NESE

Annual MGHPCC/Holyoke data center power downtime - May 21-24 2024

Completed
May 24, 2024 at 9:50 PM
Completed
May 24, 2024 at 9:50 PM
2024 MGHPCC downtime complete
DOWNTIME COMPLETE
The annual multi-day power downtime at MGHPCC (https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/) is complete (with any exceptions noted below). Normal service resumes today (Friday May 24th) at 5pm.
The cluster has been updated to Rocky Linux 8.9. Several network, InfiniBand, computer, and storage firmware updates were installed. Available security updates were also installed.
CANNON NODES
More than 90% of nodes are up and all partitions are enabled. If your specialty partition has a downed node, we will attend to this on Tuesday.
FASSE OOD
Some updates are still propagating. If your FASSE Open OnDemand/VDI session does not work initially, please wait or retry your job/session.
POST-DOWNTIME SUPPORT
If you have any further concerns or unanswered questions please submit a help ticket (https://portal.rc.fas.harvard.edu/rcrt/submit_ticket) and we will do our best to respond quickly. Please bear in mind it is a long weekend, so lingering issues may not be dealt with until Tuesday.
Also, have a good long Memorial Day weekend!
Thanks,
FAS Research Computing
https://www.rc.fas.harvard.edu/
https://docs.rc.fas.harvard.edu/
https://status.rc.fas.harvard.edu/
rchelp@rc.fas.harvard.edu
Update
May 24, 2024 at 9:08 PM
In progress
May 24, 2024 at 9:08 PM
We are currently delayed opening the cluster due to some lingering issues.
We will re-open as soon as possible or update again at 6pm.
Update
May 24, 2024 at 1:47 PM
In progress
May 24, 2024 at 1:47 PM
Power work completed by facility. Currently on schedule for powerup and return to service. ETA 5pm.
In progress
May 21, 2024 at 1:00 PM
In progress
May 21, 2024 at 1:00 PM
Maintenance is now in progress
Planned
May 21, 2024 at 1:00 PM
Planned
May 21, 2024 at 1:00 PM
The 2024 MGHPCC data center annual power downtime will take place May 21-24, 2024.
We will begin our shutdown on Tuesday May 21st and expect a return to service by 5PM Friday May 24th.
- Jobs: Please plan ahead as all still running jobs on the morning of May 21st will be stopped and canceled and will need to be resubmitted after the downtime. Pending jobs will remain in the queue until the cluster returns to regular service on May 24th.
- Access: The cluster, scheduler, login, and OoD nodes will be unavailable for the duration of the downtime. New lab and account requests should wait until after the downtime.
- Storage: All Holyoke storage will be powered down and unavailable for the duration of the downtime. Boston storage will remain online, but your ability to access it may be impacted and network changes may briefly affect its availability.
Further details, an explanation for this year's change in scheduling, a visual timeline, and a list of maintenance tasks overview can be found at:
https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/
Progress of the downtime will be posted here on our status page during the event. Note that you can subscribe to receive updates as they happen. Click Get Updates in the upper right.
MAJOR TASK OVERVIEW
- OS upgrade to Rocky 8.9 - Point upgrade, no code rebuilds will be required. Switch from system OFED to Mellanox OFED on nodes for improved performance
- Infiniband (network) upgrades
- BIOS updates (various)
- Storage firmware updates
- Network Maintenance
- Decommission old nodes (targets contacted)
- Additional minor one-off updates and maintenance (cable swap, reboots, etc.)
Thanks,
FAS Research Computing
https://www.rc.fas.harvard.edu/
https://docs.rc.fas.harvard.edu/
https://status.rc.fas.harvard.edu/

Resolved
May 20, 2024 at 5:25 PM
Resolved
May 20, 2024 at 5:25 PM
This incident has been resolved.
Identified
May 20, 2024 at 5:08 PM
Identified
May 20, 2024 at 5:08 PM
Delayed email is the root of this issue. We are working to resolve or work around this issue.
Investigating
May 20, 2024 at 4:30 PM
Investigating
May 20, 2024 at 4:30 PM
The ticketing system (which uses HUIT services) is experiencing lag. New ticket autoreplies and ticket responses may be delayed. We are currently investigating this incident.

Resolved
May 18, 2024 at 3:44 AM
Resolved
May 18, 2024 at 3:44 AM
This incident has been resolved.
Investigating
May 17, 2024 at 8:51 PM
Investigating
May 17, 2024 at 8:51 PM
We are still unable to resolve the issue with these nodes and are working with the facility, networking, and our staff to find a solution. The affected partitions (noted in previous update below) will be resource-constrained and continue to be slow or unable to queue new jobs.

If you are using a partition that cannot queue new jobs, please consider adding additional partitions to your job: https://docs.rc.fas.harvard.edu/kb/running-jobs/#Slurm_partitions

Also, a reminder that the data center power downtime will begin Tuesday morning. So any new jobs scheduled for more than 3 days will not complete: https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/
Identified
May 17, 2024 at 5:23 PM
Identified
May 17, 2024 at 5:23 PM
We are still working on the root cause and resolution for these downed nodes.
Partitions with one or more affected nodes [involves multiple nodes unless denoted as (1) ]:
arguelles_delgado_gpu (1)
hsph
joonholee
jshapiro_sapphire
lichtmandce01
bigmem
gpu_requeue (1)
intermediate
sapphire
serial_requeue (1)
shared (1)
test
yao / yao_priority
use 'sinfo -p [partition name]' if you wish to see the down nodes in particular queue
Investigating
May 17, 2024 at 2:30 PM
Investigating
May 17, 2024 at 2:30 PM
We are currently investigating this incident. An unknown outage has downed many nodes in row 8A of our data center. More information to follow.
Includes nodes from the sapphire, test, gpu, and other partitions.

Resolved
May 15, 2024 at 2:32 PM
Resolved
May 15, 2024 at 2:32 PM
holystore01 is back to normal operation. Please report any continued issues with holystore01 to FASRC.
Investigating
May 15, 2024 at 1:00 AM
Investigating
May 15, 2024 at 1:00 AM
We are currently investigating a hardware problem on holystore01. This may affect file access, conda/mamba environments, and some logins.

Apr 2024

Resolved
April 30, 2024 at 2:32 PM
Resolved
April 30, 2024 at 2:32 PM
The high I/O job has been identified. Scratch is back to normal operaiton.
Investigating
April 30, 2024 at 1:35 PM
Investigating
April 30, 2024 at 1:35 PM
Scratch performance is degraded.
holyscratch01 is under abnormally high I/O load.
We are working to determine which job is the cause.

Resolved
May 15, 2024 at 1:55 PM
Resolved
May 15, 2024 at 1:55 PM
All holynfs01 shares have been migrated or evacuated. Closing this incident. Holynfs01 remains online in read-only mode, but no shares rely on it now.
Investigating
April 26, 2024 at 7:54 PM
Investigating
April 26, 2024 at 7:54 PM
holynfs01 has experienced a hardware failure. We have set all shares to read-only mode while we retrieve data from this server.
If your group requires a writable space in the interim, please reach out and we will set one up for you or use scratch space for now.
This only affects a handful of NCF shares, most inactive. But we are notifying all to cast a wide net.

Resolved
April 25, 2024 at 4:15 PM
Resolved
April 25, 2024 at 4:15 PM
This incident has been resolved.
Investigating
April 25, 2024 at 3:28 PM
Investigating
April 25, 2024 at 3:28 PM
Due to stale file mounts for multiple groups, holylogin[01-04] will need to be rebooted at noon. Please save all work prior to the reboot.

Resolved
April 25, 2024 at 3:17 PM
Resolved
April 25, 2024 at 3:17 PM
gpu_test and FASSE GPU partitions are back. We are looking into replacement power supply units to mitigate this issue.
Investigating
April 25, 2024 at 1:42 PM
Investigating
April 25, 2024 at 1:42 PM
gpu_test is down again due to an IB switch failing
the gpu partition in FASSE is also down

Resolved
April 24, 2024 at 2:58 PM
Resolved
April 24, 2024 at 2:58 PM
gpu_test is now operational and can be used.
This incident has been resolved.
Investigating
April 24, 2024 at 2:15 PM
Investigating
April 24, 2024 at 2:15 PM
An InfiniBand switch has failed in the cabinet housing the gpu_test nodes.
Please use gpu_requeue or gpu in the meantime.
We will update this incident when we have an ETA.

Mar 2024

Resolved
March 21, 2024 at 12:00 PM
Resolved
March 21, 2024 at 12:00 PM
There was a brief disruption in the connectivity from MGHPCC to the outside world/Internet. During this period outside attempts to connect to services hosted in Holyoke would time out. Internal data center network remained normal and no jobs were affected.

Resolved
March 19, 2024 at 1:35 PM
Resolved
March 19, 2024 at 1:35 PM
After making some hardware and network changes, we monitored holylabs yesterday and overnight and have determined that it is stable again.
We will continue to monitor but currently believe this issue has been resolved.
Investigating
March 18, 2024 at 1:51 PM
Investigating
March 18, 2024 at 1:51 PM
Holylabs was again unstable over the weekend. We are actively working this issue and will update as soon as possible.
Update
March 15, 2024 at 1:31 PM
Update
March 15, 2024 at 1:31 PM
Holylabs locked up again overnight. We are investigating.
Monitoring
March 14, 2024 at 1:38 PM
Monitoring
March 14, 2024 at 1:38 PM
Holylabs became stuck overnight and was restarted. It is back up and we are monitoring.

Resolved
March 18, 2024 at 3:21 PM
Resolved
March 18, 2024 at 3:21 PM
holyscratch01 is currently stable. However, it continues to be under heavy utilization as a matter of course due to job loads.
There is a plan underway to replace holyscratch01, but no ETA at this time as we are still qualifying and planning purchasing. We will notify the community when an ETA is known. Until then please be aware that load on scratch may continue along this trend. Thanks for your understanding.
Monitoring
March 13, 2024 at 2:15 PM
Monitoring
March 13, 2024 at 2:15 PM
Performance has improved, but we are still monitoring some high loads across object storage units.
Investigating
March 12, 2024 at 7:54 PM
Investigating
March 12, 2024 at 7:54 PM
holyscratch01 performance degraded
We are currently investigating this incident.

Resolved
March 04, 2024 at 5:55 PM
Resolved
March 04, 2024 at 5:55 PM
VMs are returning to operation. If you receive an error (on Portal, Coldfront, etc) please wait a few minutes and try again.
Ticket system is online and starting to receive delayed emails.
Investigating
March 04, 2024 at 4:41 PM
Investigating
March 04, 2024 at 4:41 PM
The RT ticket system is currently offline due to a VM issue.
Any emails sent to the ticket system will eventually be delivered once it recovers, but until then expect delayed response.

Resolved
March 04, 2024 at 9:44 PM
Resolved
March 04, 2024 at 9:44 PM
This incident has been resolved.
Monitoring
March 04, 2024 at 4:21 PM
Monitoring
March 04, 2024 at 4:21 PM
Informational Notice
The Slurm upgrade to 23.11.4 was completed successfully during maintenance. However a complication with the automation of Slurm's cryptographic keys occurred during the upgrade which caused nodes to lose the ability to talk to the Slurm master. The Slurm master therefore viewed those nodes as down and requeued their jobs.
All jobs on Cannon and FASSE were requeued.
This is deeply regrettable but the chain of events which caused this could not be foreseen.
To check the status of your jobs, see the common Slurm commands at:
https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/#Information_on_jobs
FAS Research Computing
https://docs.rc.fas.harvard.edu/
rchelp@rc.fas.harvard.edu

Mar 2024 to May 2024

FAS Research Computing - Notice history

Notice history

May 2024

2024 MGHPCC downtime complete

MAJOR TASK OVERVIEW

Apr 2024

Mar 2024