Monthly maintenance Dec 5th 2022 7am-11am

Completed

December 05, 2022 at 7:50 PM

Completed

December 05, 2022 at 7:50 PM

Maintenance has completed successfully.

In progress

December 05, 2022 at 2:55 PM

In progress

December 05, 2022 at 2:55 PM

Apologies. The maintenance event on the status page did not start automatically. Maintenance is already underway and will complete at 11am.

Planned

December 05, 2022 at 12:00 PM

Planned

December 05, 2022 at 12:00 PM

NOTICES

GPUTEST and REMOTEVIZ PARTITIONS Due to failed nodes, the gputest partition is down to 2 nodes and the (single node) remoteviz partition is down at the moment. We are working with the vendor to replace hardware, but this is still unresolved and no ETA at this time. Updates and QoS changes on our status page when we have them: https://status.rc.fas.harvard.edu/cl8a94kcf17664hvoj8oksxanx

GLOBUS PERSONAL CLIENT - UPDATE BY DEC 17
If you are using the Globus Connect Personal client on your machine, please ensure you have updated and are running version 3.2 or greater by December 17th, 2022. You will not be able to use version 3.1 or below after that date. https://docs.globus.org/ca-update-2022/#globusconnectpersonal

HOLIDAY NOTICES NOVEMBER:
Office Hours will be held on 11/23 prior to the Thanksgiving break, but will run only from 12-2pm. FASRC staff will be unavailable Nov. 16th from 12-3pm for a staff event. Thur/Fri Nov. 24th and 25th are university holidays (Thanksgiving).

HOLIDAY NOTICES DECEMBER:
Office Hours will not be held on Dec. 21st and will resume Jan. 4th, 2023. Winter break runs Dec. 23 - Jan. 2nd. FASRC will monitor for emergencies during this time, but general questions/tickets will be held until we return on Jan. 3rd, 2023.

SLUMR SCHEDULER UPDATE NOTES:
Given a bug in the previous versions of Slurm, this upgrade will create a situation where jobs launched on the previous version will get stuck in COMPLETING state until the node is rebooted (see: https://bugs.schedmd.com/show_bug.cgi?id=15078). This means in the week(s) following the upgrade there will rolling reboots of the nodes to clear these stuck jobs.

Users should be aware that any jobs stuck in COMPLETING state will remain so until the node the job lives on is rebooted, and any node that is labelled COMPLETING will not be able to receive jobs until it is rebooted. This is due to a Slurm bug and nothing to do with the users code or jobs and thus the users cannot do anything to clear this state faster. FASRC admins will reboot nodes as soon as they are clear of work to fix this issue.

GENERAL MAINTENANCE

Slurm scheduler update (22.05.x) - See notes above
-- Audience: All cluster job users
-- Impact: =See notes above.= The scheduler and job will be paused during upgrade.
Partition decommissioning
-- Audience: narayan and holymeissner partition users
-- Impact: This partition(s) will no longer be available
Domain controller DHCP updates
-- Audience: All users
-- Impact: No impact expected
Holyscratch01 firmware updates
-- Audience: All users of scratch
-- Impact: Scratch will be unavailable for short periods
Login node and VDI node reboots and firmware updates -- Audience: Anyone logged into a a login node or VDI/OOD node -- Impact: Login and VDI/OOD nodes will be unavailable while updating and rebooting
Scratch cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ ) -- Audience: Cluster users -- Impact: Files older than 90 days will be removed.

Reminder: Scratch 90-day file retention purging runs occur regularly not just during maintenance periods.

Updates on our status page: https://status.rc.fas.harvard.edu

Thanks!
FAS Research Computing
Department and Service Catalog: https://www.rc.fas.harvard.edu/
Documentation: https://docs.rc.fas.harvard.edu/
Status Page: https://status.rc.fas.harvard.edu/

FAS Research Computing - Monthly maintenance Dec 5th 2022 7am-11am – Maintenance details