Ilmoitushistoria

Heikentynyt suorituskyky

Toiminnassa

kesä 2024

Ratkaistu
kesäkuuta 27, 2024 klo 18.40
Ratkaistu
kesäkuuta 27, 2024 klo 18.40
Holylabs has finished rebooting and should be accessible. This incident has been resolved.
Tutkitaan
kesäkuuta 27, 2024 klo 18.28
Tutkitaan
kesäkuuta 27, 2024 klo 18.28
Due to mounting and access issues, holylabs needs to be rebooted. It will be unavailable during this time. We will update once it is back up.

Ratkaistu
kesäkuuta 26, 2024 klo 17.52
Ratkaistu
kesäkuuta 26, 2024 klo 17.52
This incident has been resolved. Both sites are working normally.
Tutkitaan
kesäkuuta 26, 2024 klo 17.46
Tutkitaan
kesäkuuta 26, 2024 klo 17.46
https://www.rc.fas.harvard.edu/ and https://docs.rc.fas.harvard.edu/ are offline.
We are currently investigating this issue.

Ratkaistu
kesäkuuta 24, 2024 klo 15.11
Ratkaistu
kesäkuuta 24, 2024 klo 15.11
This incident has been resolved.
Tutkitaan
kesäkuuta 24, 2024 klo 14.30
Tutkitaan
kesäkuuta 24, 2024 klo 14.30
Global scratch (/n/holyscratch01) is currently experiencing instability.
We are currently investigating this incident.

MGHPCC Pod 8A Power Upgrade June 24 will idle some Cannon nodes

Valmistunut
kesäkuuta 25, 2024 klo 04.00
Valmistunut
kesäkuuta 25, 2024 klo 04.00
Maintenance has completed successfully
Meneillään
kesäkuuta 24, 2024 klo 16.01
Meneillään
kesäkuuta 24, 2024 klo 16.01
Maintenance is now in progress
Suunniteltu
kesäkuuta 24, 2024 klo 04.01
Suunniteltu
kesäkuuta 24, 2024 klo 04.01
MGHPCC will be performing power upgrades on Pod 8A in order to increase density and allow more nodes to be added in that Pod's rows. Similar to the May 13th work, this means that we will be idling half the nodes in 8A on two dates: June 17 and June 24th.
These are all day events, meaning that the nodes in question will not be available for the 24 hours of that day. This is being accomplished via reservations. So no jobs will be canceled but nodes will be drained and users may notice that their jobs may pend longer than normal as the scheduler idles these nodes.
Where possible, please use or include other partitions in your job scripts and plan accordingly for any new or long-running jobs during that period: https://docs.rc.fas.harvard.edu/kb/running-jobs/#Slurm_partitions
This affects the Cannon cluster. FASSE is not affected.
Impacted partitions are:
arguelles_delgado_gpu
bigmem_intermediate
bigmem
blackhole_gpu
eddy
enos
gershman gpu
hejazi hernquist_ice
hoekstra hsph
huce_ice
iaifi_gpu
iaifi_gpu_priority
iaifi_gpu_requeue
intermediate
itc_gpu
itc_gpu_requeue
joonholee
jshapiro
jshapiro_priority
jshapiro_sapphire
kempner
kempner_dev
kempner_h100
kempner_requeue
kempner_reservation
kovac
kozinsky
kozinsky_gpu
kozinsky_priority
kozinsky_requeue
murphy_ice
ortegahernandez_ice
sapphire
seas_compute
seas_gpu siag
siag_combo
siag_gpu
sur test
yao
yao_priority
zhuang

Ratkaistu
kesäkuuta 20, 2024 klo 21.13
Ratkaistu
kesäkuuta 20, 2024 klo 21.13
Coldfront is operating normally and to root cause has been determined and confirmed to be resolved.
Tutkitaan
kesäkuuta 20, 2024 klo 16.00
Tutkitaan
kesäkuuta 20, 2024 klo 16.00
Coldfront is currently inaccessible to users. We are investigating this incident.

touko 2024

Ratkaistu
kesäkuuta 03, 2024 klo 14.53
Ratkaistu
kesäkuuta 03, 2024 klo 14.53
NESE maintenance has resolved and is now back in service
Seurataan
toukokuuta 30, 2024 klo 15.39
Seurataan
toukokuuta 30, 2024 klo 15.39
A note from NESE. Their maintenance has been delayed by hardware issues. ETA now Monday 6/3
Dear All,

NESE Tape system upgrade is currently in progress. While IBM hardware
team works on TS4500 library tape frame expansion and IBM software team
works on ESS and Archive software and firmware upgrades, the work
progress has been slowed down due to unforeseen hardware issues.We now expect to bring the tape service back into production this Monday
morning. We apologize for any inconvenience caused by the delay.
Tunnistettu
toukokuuta 29, 2024 klo 16.58
Tunnistettu
toukokuuta 29, 2024 klo 16.58
Due to maintenance at our tape partner, NESE (Northeast Storage Exchange), access to tape allocations will be unavailable until at least late Thursday (5/30). Normal operations will resume by Friday (5/31).
If you continue to have issues with a Globus tape endpoint on Friday, please contact FASRC or NESE

Annual MGHPCC/Holyoke data center power downtime - May 21-24 2024

Valmistunut
toukokuuta 24, 2024 klo 21.50
Valmistunut
toukokuuta 24, 2024 klo 21.50
2024 MGHPCC downtime complete
DOWNTIME COMPLETE
The annual multi-day power downtime at MGHPCC (https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/) is complete (with any exceptions noted below). Normal service resumes today (Friday May 24th) at 5pm.
The cluster has been updated to Rocky Linux 8.9. Several network, InfiniBand, computer, and storage firmware updates were installed. Available security updates were also installed.
CANNON NODES
More than 90% of nodes are up and all partitions are enabled. If your specialty partition has a downed node, we will attend to this on Tuesday.
FASSE OOD
Some updates are still propagating. If your FASSE Open OnDemand/VDI session does not work initially, please wait or retry your job/session.
POST-DOWNTIME SUPPORT
If you have any further concerns or unanswered questions please submit a help ticket (https://portal.rc.fas.harvard.edu/rcrt/submit_ticket) and we will do our best to respond quickly. Please bear in mind it is a long weekend, so lingering issues may not be dealt with until Tuesday.
Also, have a good long Memorial Day weekend!
Thanks,
FAS Research Computing
https://www.rc.fas.harvard.edu/
https://docs.rc.fas.harvard.edu/
https://status.rc.fas.harvard.edu/
rchelp@rc.fas.harvard.edu
Päivitys
toukokuuta 24, 2024 klo 21.08
Päivitys
toukokuuta 24, 2024 klo 21.08
We are currently delayed opening the cluster due to some lingering issues.
We will re-open as soon as possible or update again at 6pm.
Päivitys
toukokuuta 24, 2024 klo 13.47
Päivitys
toukokuuta 24, 2024 klo 13.47
Power work completed by facility. Currently on schedule for powerup and return to service. ETA 5pm.
Meneillään
toukokuuta 21, 2024 klo 13.00
Meneillään
toukokuuta 21, 2024 klo 13.00
Maintenance is now in progress
Suunniteltu
toukokuuta 21, 2024 klo 13.00
Suunniteltu
toukokuuta 21, 2024 klo 13.00
The 2024 MGHPCC data center annual power downtime will take place May 21-24, 2024.
We will begin our shutdown on Tuesday May 21st and expect a return to service by 5PM Friday May 24th.
- Jobs: Please plan ahead as all still running jobs on the morning of May 21st will be stopped and canceled and will need to be resubmitted after the downtime. Pending jobs will remain in the queue until the cluster returns to regular service on May 24th.
- Access: The cluster, scheduler, login, and OoD nodes will be unavailable for the duration of the downtime. New lab and account requests should wait until after the downtime.
- Storage: All Holyoke storage will be powered down and unavailable for the duration of the downtime. Boston storage will remain online, but your ability to access it may be impacted and network changes may briefly affect its availability.
Further details, an explanation for this year's change in scheduling, a visual timeline, and a list of maintenance tasks overview can be found at:
https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/
Progress of the downtime will be posted here on our status page during the event. Note that you can subscribe to receive updates as they happen. Click Get Updates in the upper right.
MAJOR TASK OVERVIEW
- OS upgrade to Rocky 8.9 - Point upgrade, no code rebuilds will be required. Switch from system OFED to Mellanox OFED on nodes for improved performance
- Infiniband (network) upgrades
- BIOS updates (various)
- Storage firmware updates
- Network Maintenance
- Decommission old nodes (targets contacted)
- Additional minor one-off updates and maintenance (cable swap, reboots, etc.)
Thanks,
FAS Research Computing
https://www.rc.fas.harvard.edu/
https://docs.rc.fas.harvard.edu/
https://status.rc.fas.harvard.edu/

Ratkaistu
toukokuuta 20, 2024 klo 17.25
Ratkaistu
toukokuuta 20, 2024 klo 17.25
This incident has been resolved.
Tunnistettu
toukokuuta 20, 2024 klo 17.08
Tunnistettu
toukokuuta 20, 2024 klo 17.08
Delayed email is the root of this issue. We are working to resolve or work around this issue.
Tutkitaan
toukokuuta 20, 2024 klo 16.30
Tutkitaan
toukokuuta 20, 2024 klo 16.30
The ticketing system (which uses HUIT services) is experiencing lag. New ticket autoreplies and ticket responses may be delayed. We are currently investigating this incident.

Ratkaistu
toukokuuta 18, 2024 klo 03.44
Ratkaistu
toukokuuta 18, 2024 klo 03.44
This incident has been resolved.
Tutkitaan
toukokuuta 17, 2024 klo 20.51
Tutkitaan
toukokuuta 17, 2024 klo 20.51
We are still unable to resolve the issue with these nodes and are working with the facility, networking, and our staff to find a solution. The affected partitions (noted in previous update below) will be resource-constrained and continue to be slow or unable to queue new jobs.

If you are using a partition that cannot queue new jobs, please consider adding additional partitions to your job: https://docs.rc.fas.harvard.edu/kb/running-jobs/#Slurm_partitions

Also, a reminder that the data center power downtime will begin Tuesday morning. So any new jobs scheduled for more than 3 days will not complete: https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/
Tunnistettu
toukokuuta 17, 2024 klo 17.23
Tunnistettu
toukokuuta 17, 2024 klo 17.23
We are still working on the root cause and resolution for these downed nodes.
Partitions with one or more affected nodes [involves multiple nodes unless denoted as (1) ]:
arguelles_delgado_gpu (1)
hsph
joonholee
jshapiro_sapphire
lichtmandce01
bigmem
gpu_requeue (1)
intermediate
sapphire
serial_requeue (1)
shared (1)
test
yao / yao_priority
use 'sinfo -p [partition name]' if you wish to see the down nodes in particular queue
Tutkitaan
toukokuuta 17, 2024 klo 14.30
Tutkitaan
toukokuuta 17, 2024 klo 14.30
We are currently investigating this incident. An unknown outage has downed many nodes in row 8A of our data center. More information to follow.
Includes nodes from the sapphire, test, gpu, and other partitions.

Ratkaistu
toukokuuta 15, 2024 klo 14.32
Ratkaistu
toukokuuta 15, 2024 klo 14.32
holystore01 is back to normal operation. Please report any continued issues with holystore01 to FASRC.
Tutkitaan
toukokuuta 15, 2024 klo 01.00
Tutkitaan
toukokuuta 15, 2024 klo 01.00
We are currently investigating a hardware problem on holystore01. This may affect file access, conda/mamba environments, and some logins.

huhti 2024

Ratkaistu
huhtikuuta 30, 2024 klo 14.32
Ratkaistu
huhtikuuta 30, 2024 klo 14.32
The high I/O job has been identified. Scratch is back to normal operaiton.
Tutkitaan
huhtikuuta 30, 2024 klo 13.35
Tutkitaan
huhtikuuta 30, 2024 klo 13.35
Scratch performance is degraded.
holyscratch01 is under abnormally high I/O load.
We are working to determine which job is the cause.

Ratkaistu
toukokuuta 15, 2024 klo 13.55
Ratkaistu
toukokuuta 15, 2024 klo 13.55
All holynfs01 shares have been migrated or evacuated. Closing this incident. Holynfs01 remains online in read-only mode, but no shares rely on it now.
Tutkitaan
huhtikuuta 26, 2024 klo 19.54
Tutkitaan
huhtikuuta 26, 2024 klo 19.54
holynfs01 has experienced a hardware failure. We have set all shares to read-only mode while we retrieve data from this server.
If your group requires a writable space in the interim, please reach out and we will set one up for you or use scratch space for now.
This only affects a handful of NCF shares, most inactive. But we are notifying all to cast a wide net.

Ratkaistu
huhtikuuta 25, 2024 klo 16.15
Ratkaistu
huhtikuuta 25, 2024 klo 16.15
This incident has been resolved.
Tutkitaan
huhtikuuta 25, 2024 klo 15.28
Tutkitaan
huhtikuuta 25, 2024 klo 15.28
Due to stale file mounts for multiple groups, holylogin[01-04] will need to be rebooted at noon. Please save all work prior to the reboot.

Ratkaistu
huhtikuuta 25, 2024 klo 15.17
Ratkaistu
huhtikuuta 25, 2024 klo 15.17
gpu_test and FASSE GPU partitions are back. We are looking into replacement power supply units to mitigate this issue.
Tutkitaan
huhtikuuta 25, 2024 klo 13.42
Tutkitaan
huhtikuuta 25, 2024 klo 13.42
gpu_test is down again due to an IB switch failing
the gpu partition in FASSE is also down

Ratkaistu
huhtikuuta 24, 2024 klo 14.58
Ratkaistu
huhtikuuta 24, 2024 klo 14.58
gpu_test is now operational and can be used.
This incident has been resolved.
Tutkitaan
huhtikuuta 24, 2024 klo 14.15
Tutkitaan
huhtikuuta 24, 2024 klo 14.15
An InfiniBand switch has failed in the cabinet housing the gpu_test nodes.
Please use gpu_requeue or gpu in the meantime.
We will update this incident when we have an ETA.

huhti 2024 to kesä 2024

FAS Research Computing - Ilmoitushistoria

Ilmoitushistoria

kesä 2024

touko 2024

2024 MGHPCC downtime complete

MAJOR TASK OVERVIEW

huhti 2024