Notishistorik

I drift

juli 2023

Löst
juli 25, 2023 kl 16:08
Löst
juli 25, 2023 kl 16:08
The affected partitions are back up within acceptable operating percentages.
Identifierat
juli 25, 2023 kl 14:51
Identifierat
juli 25, 2023 kl 14:51
Numerous nodes are in a bad state (marked 'drng', but not those marked 'drain'). We are aware and are working to re-open these nodes as quickly as possible.

This includes but is not limited to nodes in: argelles-delgado, kozinsky, seas_compute, kempner, and other partitions.

Löst
juli 20, 2023 kl 19:48
Löst
juli 20, 2023 kl 19:48
Modules should be available again across Cannon and FASSE. If you have an open session, please log out and log in again to use modules.

This incident has been resolved.
Undersöker
juli 20, 2023 kl 19:12
Undersöker
juli 20, 2023 kl 19:12
An unexpected bug has broken software module availability at this time. Module commands such as module load, module spider, etc may not work.

We are currently investigating this incident.

Löst
juli 18, 2023 kl 15:35
Löst
juli 18, 2023 kl 15:35
This incident has been resolved.
Uppdatering
juli 17, 2023 kl 16:59
Uppdatering
juli 17, 2023 kl 16:59
We have brought the server back online in a "workaround" state and will be continuing to investigate the root cause. Access should be normal.
Undersöker
juli 17, 2023 kl 15:45
Undersöker
juli 17, 2023 kl 15:45
Lab Storage on holylabs file server is currently inaccessible. We are currently investigating this incident.

Löst
juli 13, 2023 kl 17:01
Löst
juli 13, 2023 kl 17:01
This incident has been resolved.
Undersöker
juli 13, 2023 kl 16:48
Undersöker
juli 13, 2023 kl 16:48
boslfs02 is currently inaccessible. We are investigating this incident.

Löst
augusti 17, 2023 kl 17:51
Löst
augusti 17, 2023 kl 17:51
Firmware updates have resolved this issue.
Uppdatering
juli 18, 2023 kl 19:05
Uppdatering
juli 18, 2023 kl 19:05
The proscribed driver update did not fix this issue.

We are working with Nvidia to find a fix. As a stop-gap nodes which get stuck/flagged will be marked in Slurm for reboot and rebooted once empty of jobs.

If a A100 GPU host your lab owns is stuck in a bad state, please let us know and we will mark and reboot it.
Identifierat
juli 12, 2023 kl 14:44
Identifierat
juli 12, 2023 kl 14:44
A100s are open but still experiencing some instability. Infrequently, you may hit the issue we noted earlier.

We are continuing to work on a solution.
Undersöker
juli 07, 2023 kl 15:00
Undersöker
juli 07, 2023 kl 15:00
An NVIDIA bug may be causing failures in A100 GPUs. nvidia-smi command is slow, or outputs "ERR!" or "No GPUs are found".

Jobs on affected nodes may fail.

We are investigating this issue.

juni 2023

Löst
juni 26, 2023 kl 18:50
Löst
juni 26, 2023 kl 18:50
This incident has been resolved.
Undersöker
juni 26, 2023 kl 12:00
Undersöker
juni 26, 2023 kl 12:00
Due to a HUIT networking issue, mounting storage from wired campus connections may not work.

We are working with HUIT to resolve this.

In the meantime, you should still be able to mount storage while connected to the RC VPN.

Emergency maintenance 6/20/23 - Cannon Slurm Scheduler

Planerat
juni 20, 2023 kl 11:00
Planerat
juni 20, 2023 kl 11:00
WHAT
Emergency Cannon cluster scheduler maintenance Tuesday June 20th from 7am - 11am (actual end time estimated). Running jobs will be paused during maintenance.

Note: FASSE is unaffected by this.

WHY
As most of you know, there are on-going issues with the Cannon scheduler after the downtime and OS upgrades. This behavior did not show up in our earlier testing and only manifested once we were under load from user jobs again.

After investigation we found that the scheduler is oscillating between two states every hour. During even hours the scheduler would enter a high thread state where all the traffic would pile up, causing any queries to either take a long time or time out. During odd hours the scheduler would enter a low thread state and operate normally. During periods where the scheduler is responding, jobs can be submitted and should then run normally.

Unfortunately the root cause is still not known, and we are working closely with the vendor, SchedMD, to find a solution to return the cluster to stability. The version of slurm that we are running is 22.05.7, which while very stable — we have been running on it for the past 5 months with no problems — appears to be having issues at scale on Rocky 8.

There is a newer version of the scheduler available, 23.02.3. This new version has a host of improvements. While the specific issue described above is not listed as one of the items fixed in this version, we anticipate that the upgrade may fix this issue owing to the various code improvements contained therein.

Barring any solutions found over the long weekend, we intend to hold an emergency maintenance period on Tuesday June 20th starting at 7am. Running jobs will be paused. We will return Cannon to service as soon as we've completed work. Status of this maintenance will be tracked here on our status page.

We thank you for your patience and understanding. Our team is hard at work answering your tickets and making sure that the cluster is back to full health. Also as a reminder that we hold additional Office Hours each week in June. See: https://www.rc.fas.harvard.edu/training/office-hours/

FAS Research Computing
https://www.rc.fas.harvard.edu/
https://docs.rc.fas.harvard.edu/
https://status.rc.fas.harvard.edu/
Slutfört
juni 20, 2023 kl 01:16
Slutfört
juni 20, 2023 kl 01:16
CANCELLED

Emergency maintenance has has been cancelled. A solution to the scheduler issue has been found and Slurm is once again operating normally.

We thank you for your patience and understanding.

Löst
juni 20, 2023 kl 01:28
Löst
juni 20, 2023 kl 01:28
This incident has been resolved.

A solution to the Slurm scheduler issue has been found and implemented. The scheduler is now operating normally.

Thank you for your patience and understanding.
Uppdatering
juni 16, 2023 kl 17:34
Uppdatering
juni 16, 2023 kl 17:34
PLEASE SEE EMERGENCY MAINTENANCE NOTICE BELOW
Emergency maintenance Tuesday June 20th 7am-11am. Running jobs will be paused.

We are still working on finding the root cause of the scheduler slowness and timeouts. We are working with SchedMD and have sent diagnostic information for analysis.

At this time the scheduler is up but may become unresponsive to your commands at times. You can wait and retry them.

Jobs, once launched, should run as expected.

OOD/VDI are working, but job submission may also be affected.

If you receive salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified, waiting a moment and re-submitting your job again should result in a successful submission.

Addendum: It was determined that the issue is cyclical and you will have the best luck during odd hours (e.g. - after 1pm, before 2pm) and less success interfacing with the scheduler during even hours (e.g. - after 2pm, before 3pm).

We regret the impact this is having on your work. Updates will follow as we have them.
Uppdatering
juni 14, 2023 kl 18:22
Uppdatering
juni 14, 2023 kl 18:22
We are still working on finding the root cause of the slowness and are working with the vendor to troubleshoot.

At this time the scheduler is up but may become responsive to your commands at times. Jobs, once launched, should run as expected.
Uppdatering
juni 13, 2023 kl 18:50
Uppdatering
juni 13, 2023 kl 18:50
Slurm is still being overwhelmed with large numbers of requests. VDI timeouts/502 errors, interactive job connections, job submissions, are all impacted.

We are in contact with the Slurm developers to find a better resolution.
Monitorerar
juni 13, 2023 kl 15:39
Monitorerar
juni 13, 2023 kl 15:39
We have implemented updates and the scheduler appears stable and responsive.
We will continue to monitor and update this incident.
Undersöker
juni 13, 2023 kl 14:19
Undersöker
juni 13, 2023 kl 14:19
We are currently investigating performance issues with the scheduler. This manifests in slowness when submitting jobs, job submission fails some percentage of the time, and timeout errors (including OOD/VDI).

Also, some jobs may not exit cleanly putting nodes into drain status.

We are working to resolve this issue. Updates to follow.

Löst
juni 13, 2023 kl 15:46
Löst
juni 13, 2023 kl 15:46
This incident has been resolved. FASSE and Cannon OOD are both available.
Monitorerar
juni 12, 2023 kl 20:59
Monitorerar
juni 12, 2023 kl 20:59
Cannon and FASSE OOD are not fully operational due to slurm being slow. With slurm slow, the apps wait too long to be launched and it can throw 502 errors
Identifierat
juni 12, 2023 kl 17:18
Identifierat
juni 12, 2023 kl 17:18
Cannon OOD is operational.

FASSE OOD is still experiencing issues and showing errors for new connections. Existing connections are working.
WIP
Undersöker
juni 12, 2023 kl 15:11
Undersöker
juni 12, 2023 kl 15:11
We are currently investigating issues with OOD (OpenOnDemand aka VDI).

Löst
juni 12, 2023 kl 15:55
Löst
juni 12, 2023 kl 15:55
Login to fasselogin is working. Additional login nodes are not in the rotation and will be added back once they are healthy again. But normal login is working.
Undersöker
juni 12, 2023 kl 14:56
Undersöker
juni 12, 2023 kl 14:56
We are currently investigating an issue with fasse login nodes (fasselogin) timing out or not connectring.

maj 2023

Inga notiser rapporterade denna månad

maj 2023 till juli 2023

FAS Research Computing - Notishistorik

Notishistorik

juli 2023

juni 2023

maj 2023