Experiencing partially degraded performance

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE | Academic


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu | Email: rchelp@rc.fas.harvard.edu


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Cannon Cluster

Operational

SLURM Scheduler - Cannon

Operational

Cannon Compute Cluster (Holyoke)

Operational

Boston Compute Nodes

Operational

GPU nodes (Holyoke)

Operational

SEAS compute partition

Operational

FASSE Cluster

Operational

SLURM Scheduler - FASSE

Operational

FASSE Compute Cluster (Holyoke)

Operational

Kempner Cluster

Operational

Kempner Cluster CPU

Operational

Kempner Cluster GPU

Operational

Login Nodes

Operational

Login Nodes - Boston

Operational

Login Nodes - Holyoke

Operational

FASSE login nodes

Operational

VDI/OpenOnDemand

Operational

Cannon VDI (Open OnDemand)

Operational

FASSE VDI (Open OnDemand)

Operational

Storage

Degraded performance

Holyscratch01 (Global Scratch)

Degraded performance

Home Directory Storage - Boston

Operational

HolyLFS03 (Tier 0)

Operational

HolyLFS04 (Tier 0)

Operational

HolyLFS05 (Tier 0)

Operational

Holystore01 (Tier 0)

Operational

Holylabs

Operational

BosLFS02 (Tier 0)

Operational

Isilon Storage Boston (Tier 1)

Operational

Isilon Storage Holyoke (Tier 1)

Operational

CEPH Storage Boston (Tier 2)

Operational

Tape - (Tier 3)

Operational

Boston Specialty Storage

Operational

Holyoke Specialty Storage

Degraded performance

Samba Cluster

Operational

Globus Data Transfer

Operational

bosECS

Operational

holECS

Operational

Notice history

Aug 2023

Ceph instability - Affects Boston VMs (Virtual Machines) and Tier2 Ceph shares
  • Resolved
    Resolved

    The Ceph instability has been resolved. Caeph Tier2 shares, VDI, and VMs should be back to their normal state.

    If your VM, /net/fs-[labname] share, or VDI session is still impacted, please contact rchelp@rc.fas.harvard.edu

  • Identified
    Identified

    The infrastructure behind Tier2 Ceph shares and VMs is unstable.
    This also affects VDI/OOD which relies on virtual machines.

    /net/fs-[labname] shares, new OOD/VDI sessions, and VMs are affected and may will be inaccessible until this is resolved.

    Thanks for your patience.

FASRC Monthly maintenance August 7, 2023 7am-1pm *NOTE EXTENDED TIME*
  • Completed
    August 07, 2023 at 1:42 PM
    Completed
    August 07, 2023 at 1:42 PM

    Due to a vendor error we were unable to complete holyscratch01 disk shelf replacement. We will work with the vendor to reschedule.

    All other maintenance tasks have completed.

  • In progress
    August 07, 2023 at 11:00 AM
    In progress
    August 07, 2023 at 11:00 AM

    Maintenance is now in progress

  • Planned
    August 07, 2023 at 11:00 AM
    Planned
    August 07, 2023 at 11:00 AM

    August maintenance will run August 7, 2023 from 7am-1pm.

    Please note the extended timeframe.
    See tasks section below for explanation.

    NOTICES

    • CentOS 7 Support EOL: We will be dropping support for CentOS 7 support in September. If your machine or VM is CentOS 7 and connects with Slurm please contact FASRC to discuss options.

    • Test Partition Changes: We are changing test partitions based on changing needs and increasing max time to 12hrs instead of 8 hrs. A reminder that this partition is not for running jobs.

    MAINTENANCE TASKS

    • holyscratch01 Disk Shelf Replacement  All Jobs Will Be Paused
      -- Audience : All cluster and scratch users - Cannon and FASSE
      -- Impact:  Hardware issues with holyscratch01 necessitate the replacement of one of the disk shelves. As a result all jobs and scratch will need to be paused for the duration.
      -- ETA: This swap is expected to take 3-4 hours, but pausing the cluster, vendor interactions, and allowing a margin for over-run requires that we extend maintenance by 2 hours  (7am-1pm)

    • Login node and OOD/VDI reboots
      -- Audience: Anyone logged into a a login node or VDI/OOD node
      -- Impact: Login and VDI/OOD nodes will rebooted during this maintenance window  

    • Scratch cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ )
      -- Audience: Cluster users
      -- Impact: Files older than 90 days will be removed.

    Thanks,
    FAS Research Computing
    Department and Service Catalog: https://www.rc.fas.harvard.edu/
    Documentation: https://docs.rc.fas.harvard.edu/
    Status Page: https://status.rc.fas.harvard.edu/

Ceph instability - Affects Boston VMs (Virtual Machines) and Tier2 Ceph shares
  • Resolved
    Resolved

    The Ceph instability has been resolved. Caeph Tier2 shares, VDI, and VMs should be back to their normal state.

    If your VM, /net/fs-[labname] share, or VDI session is still impacted, please contact rchelp@rc.fas.harvard.edu

  • Identified
    Identified

    The infrastructure behind Tier2 Ceph shares and VMs is unstable.
    This also affects VDI/OOD which relies on virtual machines.

    /net/fs-[labname] shares, new OOD/VDI sessions, and VMs are affected and may will be inaccessible until this is resolved.

    Thanks for your patience.

Jul 2023

Jun 2023

Emergency maintenance 6/20/23 - Cannon Slurm Scheduler
  • Planned
    June 20, 2023 at 11:00 AM
    Planned
    June 20, 2023 at 11:00 AM

    WHAT
    Emergency Cannon cluster scheduler maintenance Tuesday June 20th from 7am - 11am (actual end time estimated). Running jobs will be paused during maintenance.

    Note: FASSE is unaffected by this.

    WHY
    As most of you know, there are on-going issues with the Cannon scheduler after the downtime and OS upgrades. This behavior did not show up in our earlier testing and only manifested once we were under load from user jobs again.

    After investigation we found that the scheduler is oscillating between two states every hour. During even hours the scheduler would enter a high thread state where all the traffic would pile up, causing any queries to either take a long time or time out. During odd hours the scheduler would enter a low thread state and operate normally. During periods where the scheduler is responding, jobs can be submitted and should then run normally.

    Unfortunately the root cause is still not known, and we are working closely with the vendor, SchedMD, to find a solution to return the cluster to stability. The version of slurm that we are running is 22.05.7, which while very stable — we have been running on it for the past 5 months with no problems — appears to be having issues at scale on Rocky 8.

    There is a newer version of the scheduler available, 23.02.3. This new version has a host of improvements. While the specific issue described above is not listed as one of the items fixed in this version, we anticipate that the upgrade may fix this issue owing to the various code improvements contained therein.

    Barring any solutions found over the long weekend, we intend to hold an emergency maintenance period on Tuesday June 20th starting at 7am. Running jobs will be paused. We will return Cannon to service as soon as we've completed work. Status of this maintenance will be tracked here on our status page.

    We thank you for your patience and understanding. Our team is hard at work answering your tickets and making sure that the cluster is back to full health. Also as a reminder that we hold additional Office Hours each week in June. See: https://www.rc.fas.harvard.edu/training/office-hours/

    FAS Research Computing
    https://www.rc.fas.harvard.edu/
    https://docs.rc.fas.harvard.edu/
    https://status.rc.fas.harvard.edu/

  • Completed
    June 20, 2023 at 1:16 AM
    Completed
    June 20, 2023 at 1:16 AM

    CANCELLED

    Emergency maintenance has has been cancelled. A solution to the scheduler issue has been found and Slurm is once again operating normally.

    We thank you for your patience and understanding.

Slurm scheduler slowness/delays
  • Resolved
    Resolved

    This incident has been resolved.

    A solution to the Slurm scheduler issue has been found and implemented. The scheduler is now operating normally.

    Thank you for your patience and understanding.

  • Monitoring
    Update

    PLEASE SEE EMERGENCY MAINTENANCE NOTICE BELOW
    Emergency maintenance Tuesday June 20th 7am-11am. Running jobs will be paused.


    We are still working on finding the root cause of the scheduler slowness and timeouts. We are working with SchedMD and have sent diagnostic information for analysis.

    At this time the scheduler is up but may become unresponsive to your commands at times. You can wait and retry them.

    Jobs, once launched, should run as expected.

    OOD/VDI are working, but job submission may also be affected.

    If you receive salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified, waiting a moment and re-submitting your job again should result in a successful submission.

    Addendum: It was determined that the issue is cyclical and you will have the best luck during odd hours (e.g. - after 1pm, before 2pm) and less success interfacing with the scheduler during even hours (e.g. - after 2pm, before 3pm).

    We regret the impact this is having on your work. Updates will follow as we have them.

  • Monitoring
    Update

    We are still working on finding the root cause of the slowness and are working with the vendor to troubleshoot.

    At this time the scheduler is up but may become responsive to your commands at times. Jobs, once launched, should run as expected.

  • Monitoring
    Update

    Slurm is still being overwhelmed with large numbers of requests. VDI timeouts/502 errors, interactive job connections, job submissions, are all impacted.

    We are in contact with the Slurm developers to find a better resolution.

  • Monitoring
    Monitoring

    We have implemented updates and the scheduler appears stable and responsive.
    We will continue to monitor and update this incident.

  • Investigating
    Investigating

    We are currently investigating performance issues with the scheduler. This manifests in slowness when submitting jobs, job submission fails some percentage of the time, and timeout errors (including OOD/VDI).

    Also, some jobs may not exit cleanly putting nodes into drain status.

    We are working to resolve this issue. Updates to follow.

Jun 2023 to Aug 2023