SLURM Scheduler - Cannon - 稼働中
SLURM Scheduler - Cannon
Cannon Compute Cluster (Holyoke) - 稼働中
Cannon Compute Cluster (Holyoke)
Boston Compute Nodes - 稼働中
Boston Compute Nodes
GPU nodes (Holyoke) - 稼働中
GPU nodes (Holyoke)
seas_compute - 稼働中
seas_compute
SLURM Scheduler - FASSE - 稼働中
SLURM Scheduler - FASSE
FASSE Compute Cluster (Holyoke) - 稼働中
FASSE Compute Cluster (Holyoke)
Kempner Cluster CPU - 稼働中
Kempner Cluster CPU
Kempner Cluster GPU - 稼働中
Kempner Cluster GPU
FASSE login nodes - 稼働中
FASSE login nodes
Cannon Open OnDemand - 稼働中
Cannon Open OnDemand
FASSE Open OnDemand - 稼働中
FASSE Open OnDemand
Netscratch (Global Scratch) - 稼働中
Netscratch (Global Scratch)
Home Directory Storage - Boston - 稼働中
Home Directory Storage - Boston
Tape - (Tier 3) - 稼働中
Tape - (Tier 3)
Holylabs - 稼働中
Holylabs
Isilon Storage Holyoke (Tier 1) - 稼働中
Isilon Storage Holyoke (Tier 1)
Holystore01 (Tier 0) - 稼働中
Holystore01 (Tier 0)
HolyLFS04 (Tier 0) - 稼働中
HolyLFS04 (Tier 0)
HolyLFS05 (Tier 0) - 稼働中
HolyLFS05 (Tier 0)
HolyLFS06 (Tier 0) - 稼働中
HolyLFS06 (Tier 0)
Holyoke Tier 2 NFS (new) - 稼働中
Holyoke Tier 2 NFS (new)
Holyoke Specialty Storage - 稼働中
Holyoke Specialty Storage
holECS - 稼働中
holECS
Isilon Storage Boston (Tier 1) - 稼働中
Isilon Storage Boston (Tier 1)
BosLFS02 (Tier 0) - 稼働中
BosLFS02 (Tier 0)
Boston Tier 2 NFS (new) - 稼働中
Boston Tier 2 NFS (new)
CEPH Storage Boston (Tier 2) - 稼働中
CEPH Storage Boston (Tier 2)
Boston Specialty Storage - 稼働中
Boston Specialty Storage
bosECS - 稼働中
bosECS
Samba Cluster - 稼働中
Samba Cluster
Globus Data Transfer - 稼働中
Globus Data Transfer
お知らせ履歴
6月 2023
- 予定6月 20, 2023 の 11:00予定6月 20, 2023 の 11:00
WHAT
Emergency Cannon cluster scheduler maintenance Tuesday June 20th from 7am - 11am (actual end time estimated). Running jobs will be paused during maintenance.Note: FASSE is unaffected by this.
WHY
As most of you know, there are on-going issues with the Cannon scheduler after the downtime and OS upgrades. This behavior did not show up in our earlier testing and only manifested once we were under load from user jobs again.After investigation we found that the scheduler is oscillating between two states every hour. During even hours the scheduler would enter a high thread state where all the traffic would pile up, causing any queries to either take a long time or time out. During odd hours the scheduler would enter a low thread state and operate normally. During periods where the scheduler is responding, jobs can be submitted and should then run normally.
Unfortunately the root cause is still not known, and we are working closely with the vendor, SchedMD, to find a solution to return the cluster to stability. The version of slurm that we are running is 22.05.7, which while very stable — we have been running on it for the past 5 months with no problems — appears to be having issues at scale on Rocky 8.
There is a newer version of the scheduler available, 23.02.3. This new version has a host of improvements. While the specific issue described above is not listed as one of the items fixed in this version, we anticipate that the upgrade may fix this issue owing to the various code improvements contained therein.
Barring any solutions found over the long weekend, we intend to hold an emergency maintenance period on Tuesday June 20th starting at 7am. Running jobs will be paused. We will return Cannon to service as soon as we've completed work. Status of this maintenance will be tracked here on our status page.
We thank you for your patience and understanding. Our team is hard at work answering your tickets and making sure that the cluster is back to full health. Also as a reminder that we hold additional Office Hours each week in June. See: https://www.rc.fas.harvard.edu/training/office-hours/
FAS Research Computing
https://www.rc.fas.harvard.edu/
https://docs.rc.fas.harvard.edu/
https://status.rc.fas.harvard.edu/ - 完了6月 20, 2023 の 1:16完了6月 20, 2023 の 1:16
CANCELLED
Emergency maintenance has has been cancelled. A solution to the scheduler issue has been found and Slurm is once again operating normally.
We thank you for your patience and understanding.
5月 2023
この月のお知らせはありません
4月 2023
- 解決済み解決済み
As far as we are aware, Harvard's DNS issues are resolved.
- 原因判明済み原因判明済み
Harvard's DNS (name resolution system which translates hostnames to IP addresses) continues to experience degradation and periodic outages. This issue may affect access to any hostname/address ending in harvard.edu.
While this should not affect jobs inside the cluster, it will impact other resources users are trying to access including reaching login nodes, VPN, websites, drive mapping, this status page, etc.
This is a university-wide issue and outside FASRC's control. We will continue to monitor the situation, but for mire information please see Harvard's status page (which may also be inaccessible at times): https://status.huit.harvard.edu/
- 完了4月 03, 2023 の 15:00完了4月 03, 2023 の 15:00
Maintenance has completed successfully
- 進行中4月 03, 2023 の 11:00進行中4月 03, 2023 の 11:00
Maintenance is now in progress
- 予定4月 03, 2023 の 11:00予定4月 03, 2023 の 11:00
FASRC Monthly Maintenance will occur April 3rd 2023 7am-11am
Please note that because of the June downtime, we will not hold monthly maintenance in May and July.JUNE DOWNTIME AND OS/SOFTWARE CHANGES
The MGHPCC yearly power downtime will run June 5th-8th with return to full service the morning of Friday June 9th. For timeline and initially details on the downtime, our switch Rocky Linux, and changes to how we do software, please see: https://www.rc.fas.harvard.edu/blog/2023-downtime/Also see our Rocky 8 Transition Guide for information on the OS and upcoming test cluster. We will also offer training and consulting in the coming months (TBA). https://docs.rc.fas.harvard.edu/kb/rocky-8-transition-guide/
Notices about the downtime with links to training, docs, and FAQs will commence in a few weeks.
TRAINING CALENDAR
Upcoming new user training and other sessions can be found on our training calendar: https://www.rc.fas.harvard.edu/upcoming-training/APRIL 3rd MAINTENANCE
Login node and VDI reboots
-- Audience: Anyone logged into a a login node or VDI/OOD node
-- Impact: Login and VDI/OOD nodes will rebooted during this maintenance windowRCSMB server network changes
-- Audience: Mounted Samba shares connected at "rcsmb.rc.fas.harvard.edu"
-- Impact: None expected. The possibility of brief disconnects
is possible but unlikely.Squid internet proxy update
-- Audience: Any users/session/process using the internet proxy (https://docs.rc.fas.harvard.edu/kb/proxy-settings/)
-- Impact: The proxy will be unavailable while updatingScratch cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ )
-- Audience: Cluster users
-- Impact: Files older than 90 days will be removed.
planning for a scheduled maintenance during that time.
Thanks,
FAS Research Computing
Department and Service Catalog: https://www.rc.fas.harvard.edu/
Documentation: https://docs.rc.fas.harvard.edu/
Status Page: https://status.rc.fas.harvard.edu/

