お知らせ履歴

稼働中

6月 2023

解決済み
6月 26, 2023 の 18:50
解決済み
6月 26, 2023 の 18:50
This incident has been resolved.
調査中
6月 26, 2023 の 12:00
調査中
6月 26, 2023 の 12:00
Due to a HUIT networking issue, mounting storage from wired campus connections may not work.

We are working with HUIT to resolve this.

In the meantime, you should still be able to mount storage while connected to the RC VPN.

Emergency maintenance 6/20/23 - Cannon Slurm Scheduler

予定
6月 20, 2023 の 11:00
予定
6月 20, 2023 の 11:00
WHAT
Emergency Cannon cluster scheduler maintenance Tuesday June 20th from 7am - 11am (actual end time estimated). Running jobs will be paused during maintenance.

Note: FASSE is unaffected by this.

WHY
As most of you know, there are on-going issues with the Cannon scheduler after the downtime and OS upgrades. This behavior did not show up in our earlier testing and only manifested once we were under load from user jobs again.

After investigation we found that the scheduler is oscillating between two states every hour. During even hours the scheduler would enter a high thread state where all the traffic would pile up, causing any queries to either take a long time or time out. During odd hours the scheduler would enter a low thread state and operate normally. During periods where the scheduler is responding, jobs can be submitted and should then run normally.

Unfortunately the root cause is still not known, and we are working closely with the vendor, SchedMD, to find a solution to return the cluster to stability. The version of slurm that we are running is 22.05.7, which while very stable — we have been running on it for the past 5 months with no problems — appears to be having issues at scale on Rocky 8.

There is a newer version of the scheduler available, 23.02.3. This new version has a host of improvements. While the specific issue described above is not listed as one of the items fixed in this version, we anticipate that the upgrade may fix this issue owing to the various code improvements contained therein.

Barring any solutions found over the long weekend, we intend to hold an emergency maintenance period on Tuesday June 20th starting at 7am. Running jobs will be paused. We will return Cannon to service as soon as we've completed work. Status of this maintenance will be tracked here on our status page.

We thank you for your patience and understanding. Our team is hard at work answering your tickets and making sure that the cluster is back to full health. Also as a reminder that we hold additional Office Hours each week in June. See: https://www.rc.fas.harvard.edu/training/office-hours/

FAS Research Computing
https://www.rc.fas.harvard.edu/
https://docs.rc.fas.harvard.edu/
https://status.rc.fas.harvard.edu/
完了
6月 20, 2023 の 1:16
完了
6月 20, 2023 の 1:16
CANCELLED

Emergency maintenance has has been cancelled. A solution to the scheduler issue has been found and Slurm is once again operating normally.

We thank you for your patience and understanding.

解決済み
6月 20, 2023 の 1:28
解決済み
6月 20, 2023 の 1:28
This incident has been resolved.

A solution to the Slurm scheduler issue has been found and implemented. The scheduler is now operating normally.

Thank you for your patience and understanding.
更新
6月 16, 2023 の 17:34
更新
6月 16, 2023 の 17:34
PLEASE SEE EMERGENCY MAINTENANCE NOTICE BELOW
Emergency maintenance Tuesday June 20th 7am-11am. Running jobs will be paused.

We are still working on finding the root cause of the scheduler slowness and timeouts. We are working with SchedMD and have sent diagnostic information for analysis.

At this time the scheduler is up but may become unresponsive to your commands at times. You can wait and retry them.

Jobs, once launched, should run as expected.

OOD/VDI are working, but job submission may also be affected.

If you receive salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified, waiting a moment and re-submitting your job again should result in a successful submission.

Addendum: It was determined that the issue is cyclical and you will have the best luck during odd hours (e.g. - after 1pm, before 2pm) and less success interfacing with the scheduler during even hours (e.g. - after 2pm, before 3pm).

We regret the impact this is having on your work. Updates will follow as we have them.
更新
6月 14, 2023 の 18:22
更新
6月 14, 2023 の 18:22
We are still working on finding the root cause of the slowness and are working with the vendor to troubleshoot.

At this time the scheduler is up but may become responsive to your commands at times. Jobs, once launched, should run as expected.
更新
6月 13, 2023 の 18:50
更新
6月 13, 2023 の 18:50
Slurm is still being overwhelmed with large numbers of requests. VDI timeouts/502 errors, interactive job connections, job submissions, are all impacted.

We are in contact with the Slurm developers to find a better resolution.
監視中
6月 13, 2023 の 15:39
監視中
6月 13, 2023 の 15:39
We have implemented updates and the scheduler appears stable and responsive.
We will continue to monitor and update this incident.
調査中
6月 13, 2023 の 14:19
調査中
6月 13, 2023 の 14:19
We are currently investigating performance issues with the scheduler. This manifests in slowness when submitting jobs, job submission fails some percentage of the time, and timeout errors (including OOD/VDI).

Also, some jobs may not exit cleanly putting nodes into drain status.

We are working to resolve this issue. Updates to follow.

解決済み
6月 13, 2023 の 15:46
解決済み
6月 13, 2023 の 15:46
This incident has been resolved. FASSE and Cannon OOD are both available.
監視中
6月 12, 2023 の 20:59
監視中
6月 12, 2023 の 20:59
Cannon and FASSE OOD are not fully operational due to slurm being slow. With slurm slow, the apps wait too long to be launched and it can throw 502 errors
原因判明済み
6月 12, 2023 の 17:18
原因判明済み
6月 12, 2023 の 17:18
Cannon OOD is operational.

FASSE OOD is still experiencing issues and showing errors for new connections. Existing connections are working.
WIP
調査中
6月 12, 2023 の 15:11
調査中
6月 12, 2023 の 15:11
We are currently investigating issues with OOD (OpenOnDemand aka VDI).

解決済み
6月 12, 2023 の 15:55
解決済み
6月 12, 2023 の 15:55
Login to fasselogin is working. Additional login nodes are not in the rotation and will be added back once they are healthy again. But normal login is working.
調査中
6月 12, 2023 の 14:56
調査中
6月 12, 2023 の 14:56
We are currently investigating an issue with fasse login nodes (fasselogin) timing out or not connectring.

5月 2023

この月のお知らせはありません

4月 2023

解決済み
4月 22, 2023 の 17:25
解決済み
4月 22, 2023 の 17:25
This incident has been resolved.

Network admins found that one of the line cards on the Nexus core switch was not functioning correctly and dropping IPv4 traffic randomly. The card will be replaced as soon as possible. In the meantime, it has been isolated and connectivity has returned to normal

Thanks for your patience.
FASRC
調査中
4月 22, 2023 の 14:21
調査中
4月 22, 2023 の 14:21
We are currently investigating this incident. The majority of services appear to be working normally, but reports of random storage or service connectivity indicate an issue with networking or power at MGHPCC.

Updates to follow as we know more.

解決済み
4月 10, 2023 の 0:18
解決済み
4月 10, 2023 の 0:18
As far as we are aware, Harvard's DNS issues are resolved.
原因判明済み
4月 08, 2023 の 18:50
原因判明済み
4月 08, 2023 の 18:50
Harvard's DNS (name resolution system which translates hostnames to IP addresses) continues to experience degradation and periodic outages. This issue may affect access to any hostname/address ending in harvard.edu.

While this should not affect jobs inside the cluster, it will impact other resources users are trying to access including reaching login nodes, VPN, websites, drive mapping, this status page, etc.

This is a university-wide issue and outside FASRC's control. We will continue to monitor the situation, but for mire information please see Harvard's status page (which may also be inaccessible at times): https://status.huit.harvard.edu/

FASRC Monthly Maintenance April 3rd 2023 7am-11am

完了
4月 03, 2023 の 15:00
完了
4月 03, 2023 の 15:00
Maintenance has completed successfully
進行中
4月 03, 2023 の 11:00
進行中
4月 03, 2023 の 11:00
Maintenance is now in progress
予定
4月 03, 2023 の 11:00
予定
4月 03, 2023 の 11:00
FASRC Monthly Maintenance will occur April 3rd 2023 7am-11am
Please note that because of the June downtime, we will not hold monthly maintenance in May and July.

JUNE DOWNTIME AND OS/SOFTWARE CHANGES
The MGHPCC yearly power downtime will run June 5th-8th with return to full service the morning of Friday June 9th. For timeline and initially details on the downtime, our switch Rocky Linux, and changes to how we do software, please see: https://www.rc.fas.harvard.edu/blog/2023-downtime/

Also see our Rocky 8 Transition Guide for information on the OS and upcoming test cluster. We will also offer training and consulting in the coming months (TBA). https://docs.rc.fas.harvard.edu/kb/rocky-8-transition-guide/

Notices about the downtime with links to training, docs, and FAQs will commence in a few weeks.

TRAINING CALENDAR
Upcoming new user training and other sessions can be found on our training calendar: https://www.rc.fas.harvard.edu/upcoming-training/

APRIL 3rd MAINTENANCE
- Login node and VDI reboots
  -- Audience: Anyone logged into a a login node or VDI/OOD node
  -- Impact: Login and VDI/OOD nodes will rebooted during this maintenance window
- RCSMB server network changes
  -- Audience: Mounted Samba shares connected at "rcsmb.rc.fas.harvard.edu"
  -- Impact: None expected. The possibility of brief disconnects
  is possible but unlikely.
- Squid internet proxy update
  -- Audience: Any users/session/process using the internet proxy (https://docs.rc.fas.harvard.edu/kb/proxy-settings/)
  -- Impact: The proxy will be unavailable while updating
- Scratch cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ )
  -- Audience: Cluster users
  -- Impact: Files older than 90 days will be removed.
  planning for a scheduled maintenance during that time.
Thanks,
FAS Research Computing
Department and Service Catalog: https://www.rc.fas.harvard.edu/
Documentation: https://docs.rc.fas.harvard.edu/
Status Page: https://status.rc.fas.harvard.edu/

4月 2023 ～ 6月 2023

FAS Research Computing - お知らせ履歴

お知らせ履歴

6月 2023

5月 2023

4月 2023