FAS Research Computing - Job scheduling issue on Cannon/FASSE – Incident details

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu | Email: rchelp@rc.fas.harvard.edu


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Job scheduling issue on Cannon/FASSE

Resolved
Major outage
Started 2 days agoLasted about 2 hours

Affected

Cannon Cluster

Major outage from 3:32 PM to 5:09 PM

SLURM Scheduler - Cannon

Major outage from 3:32 PM to 5:09 PM

FASSE Cluster

Major outage from 3:32 PM to 5:09 PM, Operational from 3:32 PM to 5:09 PM

SLURM Scheduler - FASSE

Major outage from 3:32 PM to 5:09 PM

FASSE Compute Cluster (Holyoke)

Operational from 3:32 PM to 5:09 PM

Updates
  • Resolved
    Resolved

    The root cause is related to a filesystem outage. The filesystem is available again and load has dropped. The clusters and OOD are responding again. If a job failed, you will need to restart it.

  • Investigating
    Investigating

    Due to high traffic and a job completion storm on Slurm, new jobs are not being scheduled properly. Jobs can be submitted but may not start at this time.

    We are currently investigating this incident.