Experiencing partially degraded performance

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE | Academic


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu | Email: rchelp@rc.fas.harvard.edu


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

FASRC Monthly Maintenance - March 4th, 2024 7am-11am

Completed
Scheduled for March 04, 2024 at 12:00 PM – 4:15 PM

Affects

Cannon Cluster
SLURM Scheduler - Cannon
Cannon Compute Cluster (Holyoke)
Boston Compute Nodes
GPU nodes (Holyoke)
SEAS compute partition
Updates
  • Update
    March 04, 2024 at 4:19 PM
    Completed
    March 04, 2024 at 4:19 PM

    The Slurm upgrade to 23.11.4 was completed successfully during maintenance. However a complication with the automation of Slurm's cryptographic keys occurred during the upgrade which caused nodes to lose the ability to talk to the Slurm master. The Slurm master therefore viewed those nodes as down and requeued their jobs.

    All jobs on Cannon and FASSE were requeued.

    This is deeply regrettable but the chain of events which caused this could not be foreseen.

    To check the status of your jobs, see the common Slurm commands at:

    https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/#Information_on_jobs

    FAS Research Computing

    https://docs.rc.fas.harvard.edu/

    rchelp@rc.fas.harvard.edu

  • Completed
    March 04, 2024 at 4:15 PM
    Completed
    March 04, 2024 at 4:15 PM
    Maintenance has completed successfully
  • Update
    March 04, 2024 at 3:56 PM
    In progress
    March 04, 2024 at 3:56 PM

    Due to unforeseen issues with updates, maintenance will run long. ETA 11:15

  • In progress
    March 04, 2024 at 12:00 PM
    In progress
    March 04, 2024 at 12:00 PM
    Maintenance is now in progress
  • Planned
    March 04, 2024 at 12:00 PM
    Planned
    March 04, 2024 at 12:00 PM

    FASRC monthly maintenance will take place Monday March 4th, 2024 from 7am-11am

    NOTICES

    2024 training sessions are available. Topics so far include New User Training, Getting Started on FASRC with CLI, Getting Started on FASRC with OpenOnDemand, Installing and using software, GPU Computing, Advanced cluster usage, Parallel job workflows, and other advanced topics. To see current and future training sessions, view our calendar at: https://www.rc.fas.harvard.edu/upcoming-training/

    For these and other events such as Office Hours, view our entire events calendar at: https://www.rc.fas.harvard.edu/upcoming-events/

    TIP: Subscribe to receive status page notices at https://status.rc.fas.harvard.edu/ (click Get Updates for options).

    MAINTENANCE TASKS

    Cannon cluster will be paused during this maintenance?: YES

    FASSE cluster will be paused during this maintenance?: YES

    Slurm upgrade to v 23.11.4

    -- Audience: All cluster users

    -- Impact: The scheduler will be unavailable during maintenance. This is likely to require the entire maintenance window. There is a small possibility that this may go over time. Please watch the Status Page for updates.

    -- Jobs will be paused to accommodate this upgrade

    Infiniband Nvidia MQM9700 firmware update to the latest version 3.11.3002

    -- Audience: All cluster users

    -- Impact: No direct impact (this is networking firmware, not GPU firmware)

    -- Jobs will be paused to accommodate this update

    Internal services updates

    -- Audience: FASRC internal services including: foreman, atlantis, puppet, terraform

    -- Impact: No user-facing impact

    Login node and OOD/VDI reboots

    -- Audience: Anyone logged into a login node or VDI/OOD node

    -- Impact: Login and VDI/OOD nodes will rebooted during this maintenance window

    Scratch cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ )

    -- Audience: Cluster users

    -- Impact: Files older than 90 days will be removed. Please note that retention cleanup can run at any time, not just during the maintenance window.

    Thanks,

    FAS Research Computing

    Dept. Website: https://www.rc.fas.harvard.edu/

    Documentation: https://docs.rc.fas.harvard.edu/

    Status Page: https://status.rc.fas.harvard.edu/