Notice history

Operational

Feb 2023

Mar 2023

Apr 2023

Operational

Apr 2023

Resolved
April 22, 2023 at 5:25 PM
Resolved
April 22, 2023 at 5:25 PM
This incident has been resolved.

Network admins found that one of the line cards on the Nexus core switch was not functioning correctly and dropping IPv4 traffic randomly. The card will be replaced as soon as possible. In the meantime, it has been isolated and connectivity has returned to normal

Thanks for your patience.
FASRC
Investigating
April 22, 2023 at 2:21 PM
Investigating
April 22, 2023 at 2:21 PM
We are currently investigating this incident. The majority of services appear to be working normally, but reports of random storage or service connectivity indicate an issue with networking or power at MGHPCC.

Updates to follow as we know more.

Resolved
April 10, 2023 at 12:18 AM
Resolved
April 10, 2023 at 12:18 AM
As far as we are aware, Harvard's DNS issues are resolved.
Identified
April 08, 2023 at 6:50 PM
Identified
April 08, 2023 at 6:50 PM
Harvard's DNS (name resolution system which translates hostnames to IP addresses) continues to experience degradation and periodic outages. This issue may affect access to any hostname/address ending in harvard.edu.

While this should not affect jobs inside the cluster, it will impact other resources users are trying to access including reaching login nodes, VPN, websites, drive mapping, this status page, etc.

This is a university-wide issue and outside FASRC's control. We will continue to monitor the situation, but for mire information please see Harvard's status page (which may also be inaccessible at times): https://status.huit.harvard.edu/

FASRC Monthly Maintenance April 3rd 2023 7am-11am

Completed
April 03, 2023 at 3:00 PM
Completed
April 03, 2023 at 3:00 PM
Maintenance has completed successfully
In progress
April 03, 2023 at 11:00 AM
In progress
April 03, 2023 at 11:00 AM
Maintenance is now in progress
Planned
April 03, 2023 at 11:00 AM
Planned
April 03, 2023 at 11:00 AM
FASRC Monthly Maintenance will occur April 3rd 2023 7am-11am
Please note that because of the June downtime, we will not hold monthly maintenance in May and July.

JUNE DOWNTIME AND OS/SOFTWARE CHANGES
The MGHPCC yearly power downtime will run June 5th-8th with return to full service the morning of Friday June 9th. For timeline and initially details on the downtime, our switch Rocky Linux, and changes to how we do software, please see: https://www.rc.fas.harvard.edu/blog/2023-downtime/

Also see our Rocky 8 Transition Guide for information on the OS and upcoming test cluster. We will also offer training and consulting in the coming months (TBA). https://docs.rc.fas.harvard.edu/kb/rocky-8-transition-guide/

Notices about the downtime with links to training, docs, and FAQs will commence in a few weeks.

TRAINING CALENDAR
Upcoming new user training and other sessions can be found on our training calendar: https://www.rc.fas.harvard.edu/upcoming-training/

APRIL 3rd MAINTENANCE
- Login node and VDI reboots
  -- Audience: Anyone logged into a a login node or VDI/OOD node
  -- Impact: Login and VDI/OOD nodes will rebooted during this maintenance window
- RCSMB server network changes
  -- Audience: Mounted Samba shares connected at "rcsmb.rc.fas.harvard.edu"
  -- Impact: None expected. The possibility of brief disconnects
  is possible but unlikely.
- Squid internet proxy update
  -- Audience: Any users/session/process using the internet proxy (https://docs.rc.fas.harvard.edu/kb/proxy-settings/)
  -- Impact: The proxy will be unavailable while updating
- Scratch cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ )
  -- Audience: Cluster users
  -- Impact: Files older than 90 days will be removed.
  planning for a scheduled maintenance during that time.
Thanks,
FAS Research Computing
Department and Service Catalog: https://www.rc.fas.harvard.edu/
Documentation: https://docs.rc.fas.harvard.edu/
Status Page: https://status.rc.fas.harvard.edu/

Mar 2023

Resolved
March 16, 2023 at 9:07 PM
Resolved
March 16, 2023 at 9:07 PM
The SSL cert issue has been fixed. Portal is accessible again. You may need to force-reload the page if you browser still complains (hold dorn Ctrl or Cmd while clicking reload).
Investigating
March 16, 2023 at 8:30 PM
Investigating
March 16, 2023 at 8:30 PM
Users are unable to access Portal.rc.fas.harvard.edu
We are currently investigating this problem.

Monthly Maintenance March 6th, 2023 7am-11am

Completed
March 06, 2023 at 5:06 PM
Completed
March 06, 2023 at 5:06 PM
Maintenance has completed successfully at 12:00PM.
In progress
March 06, 2023 at 4:30 PM
In progress
March 06, 2023 at 4:30 PM
Maintenance is still in progress as of 11:30AM as one of our vendors needs to complete some hardware work.

Access to the Slurm scheduler and all compute are still paused, but access to storage and other services has been restored.

We appreciate your patience.
Completed
March 06, 2023 at 4:00 PM
Completed
March 06, 2023 at 4:00 PM
Maintenance has completed successfully
In progress
March 06, 2023 at 12:00 PM
In progress
March 06, 2023 at 12:00 PM
Maintenance is now in progress
Planned
March 06, 2023 at 12:00 PM
Planned
March 06, 2023 at 12:00 PM
NOTICES

The annual MGHPCC power downtime will take place June 5th-8th, 2023
Calendar Event: https://www.rc.fas.harvard.edu/events/mghpcc-power-shutdown-2023/
Blog Post: https://www.rc.fas.harvard.edu/blog/2023-downtime/

GENERAL MAINTENANCE
- NOTE: All jobs will be paused during maintenance to reduce heat load and allow data center cooling maintenance to take place.
- Login node updates and reboots, VDI reboots
  Audience: VDI/OpenOnDemand users
  Impact: VDI will be unavailable during this and the above Slurm upgrade
- RCSMB (samba) Boston network changes
  Audience: RCSMB shares mounted out of Boston
  Impact: Could cause brief share disconnects during updates
- UPDATE: Nexus control plane supervisor switchover - ETA 5 minutes, short network disconnect while restarting
- Login node updates/reboot and VDI node reboots
  Audience: Anyone logged into a a login node or VDI/OOD node
  Impact: Login and VDI/OOD nodes will be unavailable while updating and rebooting
- Scratch cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ )
  Audience: Cluster users
  Impact: Files older than 90 days will be removed.
  Reminder: Scratch 90-day file retention purging runs occur regularly not just during maintenance periods.
SECURITY UPDATES
HUIT and the CIO Council have set a goal of reducing risk across all schools within the University. All schools are looking to reduce their outstanding vulnerability count 75% by June 2023. These numbers are based on HUIT security scans of our infrastructure.
We at FAS Research Computing are responsible for thousands of physical and virtual machines. To make progress in reducing our total open vulnerabilty count, we’re going to update internal and user facing systems as part of scheduled monthly maintenance windows and on a rolling basis outside of these windows. This will generally mean running OS and security updates as needed and rebooting these nodes when required.

This month, these hosts will get updates as part of our scheduled maintenance:
boslogin01 - boslogin04
holylogin01 - holylogin04
holydtn01 - holydtn04
xdmod4.rc.fas.harvard.edu
rchelp.rc.fas.harvard.edu (our ticket system)
rcsmtp.rc.fas.harvard.edu (our mail system)

Thanks!
FAS Research Computing
Department and Service Catalog: https://www.rc.fas.harvard.edu/
Documentation: https://docs.rc.fas.harvard.edu/
Status Page: https://status.rc.fas.harvard.edu/

Resolved
March 05, 2023 at 10:21 PM
Resolved
March 05, 2023 at 10:21 PM
Hardware maintenance was completed.
Identified
March 03, 2023 at 5:00 PM
Identified
March 03, 2023 at 5:00 PM
Due to an immediate need, we are doing some work on holylfs04 right now to replace some batteries in the controllers. Files that are on the impacted controllers will be inaccessible until they are back on line. The work should be completed later this afternoon.

Feb 2023

Resolved
February 16, 2023 at 2:44 PM
Resolved
February 16, 2023 at 2:44 PM
Normal VDI operation has resumed.
Investigating
February 16, 2023 at 2:42 PM
Investigating
February 16, 2023 at 2:42 PM
VDI is unresponsive or results in an error. We are currently investigating this issue.

Monthly Maintenance Feb. 6th, 2023 7am-11am

Completed
February 06, 2023 at 4:00 PM
Completed
February 06, 2023 at 4:00 PM
Maintenance has completed successfully
In progress
February 06, 2023 at 12:00 PM
In progress
February 06, 2023 at 12:00 PM
Maintenance is now in progress
Planned
February 06, 2023 at 12:00 PM
Planned
February 06, 2023 at 12:00 PM
NOTICES

GPU PARTITIONS
The gpu_test partition is back in service. Job limits are now 64 cores, 8 GPU's, and 750G of RAM. Users can run up to 2 jobs.

HOLIDAY NOTICE
February 20th is a university holiday (Presidents' Day)

GENERAL MAINTENANCE
- OnDemand Version upgrade to 2.0.29
  Audience: VDI/OpenOnDemand users
  Impact: VDI will be unavailable during this and the above Slurm upgrade
- Domain controller updates
  Audience: All cluster
  Impact: Could briefly impact some older systems, otherwise no impact expected
- Login node and VDI node reboots and firmware updates
  Audience: Anyone logged into a a login node or VDI/OOD node
  Impact: Login and VDI/OOD nodes will be unavailable while updating and rebooting
- Scratch cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ )
  Audience: Cluster users
  Impact: Files older than 90 days will be removed.
Reminder: Scratch 90-day file retention purging runs occur regularly not just during maintenance periods.

Thanks!
FAS Research Computing
Department and Service Catalog: https://www.rc.fas.harvard.edu/
Documentation: https://docs.rc.fas.harvard.edu/
Status Page: https://status.rc.fas.harvard.edu/

Feb 2023 to Apr 2023

FAS Research Computing - Notice history

Notice history

Apr 2023

Mar 2023

Feb 2023