FAS Research Computing - Historial de avisos

Sistemas funcionando con normalidad

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
Documentation: https://docs.rc.fas.harvard.edu | Account Portal https://portal.rc.fas.harvard.edu
Email: rchelp@rc.fas.harvard.edu | Support Hours


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

En funcionamiento

SLURM Scheduler - Cannon - En funcionamiento

Cannon Compute Cluster (Holyoke) - En funcionamiento

Boston Compute Nodes - En funcionamiento

GPU nodes (Holyoke) - En funcionamiento

seas_compute - En funcionamiento

En funcionamiento

SLURM Scheduler - FASSE - En funcionamiento

FASSE Compute Cluster (Holyoke) - En funcionamiento

En funcionamiento

Kempner Cluster CPU - En funcionamiento

Kempner Cluster GPU - En funcionamiento

En funcionamiento

FASSE login nodes - En funcionamiento

En funcionamiento

Cannon Open OnDemand - En funcionamiento

FASSE Open OnDemand - En funcionamiento

En funcionamiento

Netscratch (Global Scratch) - En funcionamiento

Home Directory Storage - Boston - En funcionamiento

Tape - (Tier 3) - En funcionamiento

Holylabs - En funcionamiento

Isilon Storage Holyoke (Tier 1) - En funcionamiento

Holystore01 (Tier 0) - En funcionamiento

HolyLFS04 (Tier 0) - En funcionamiento

HolyLFS05 (Tier 0) - En funcionamiento

HolyLFS06 (Tier 0) - En funcionamiento

Holyoke Tier 2 NFS (new) - En funcionamiento

Holyoke Specialty Storage - En funcionamiento

holECS - En funcionamiento

Isilon Storage Boston (Tier 1) - En funcionamiento

BosLFS02 (Tier 0) - En funcionamiento

Boston Tier 2 NFS (new) - En funcionamiento

CEPH Storage Boston (Tier 2) - En funcionamiento

Boston Specialty Storage - En funcionamiento

bosECS - En funcionamiento

Samba Cluster - En funcionamiento

Globus Data Transfer - En funcionamiento

Historial de avisos

jul 2023

jun 2023

Emergency maintenance 6/20/23 - Cannon Slurm Scheduler
  • Programado
    junio 20, 2023 a 11:00
    Programado
    junio 20, 2023 a 11:00

    WHAT
    Emergency Cannon cluster scheduler maintenance Tuesday June 20th from 7am - 11am (actual end time estimated). Running jobs will be paused during maintenance.

    Note: FASSE is unaffected by this.

    WHY
    As most of you know, there are on-going issues with the Cannon scheduler after the downtime and OS upgrades. This behavior did not show up in our earlier testing and only manifested once we were under load from user jobs again.

    After investigation we found that the scheduler is oscillating between two states every hour. During even hours the scheduler would enter a high thread state where all the traffic would pile up, causing any queries to either take a long time or time out. During odd hours the scheduler would enter a low thread state and operate normally. During periods where the scheduler is responding, jobs can be submitted and should then run normally.

    Unfortunately the root cause is still not known, and we are working closely with the vendor, SchedMD, to find a solution to return the cluster to stability. The version of slurm that we are running is 22.05.7, which while very stable — we have been running on it for the past 5 months with no problems — appears to be having issues at scale on Rocky 8.

    There is a newer version of the scheduler available, 23.02.3. This new version has a host of improvements. While the specific issue described above is not listed as one of the items fixed in this version, we anticipate that the upgrade may fix this issue owing to the various code improvements contained therein.

    Barring any solutions found over the long weekend, we intend to hold an emergency maintenance period on Tuesday June 20th starting at 7am. Running jobs will be paused. We will return Cannon to service as soon as we've completed work. Status of this maintenance will be tracked here on our status page.

    We thank you for your patience and understanding. Our team is hard at work answering your tickets and making sure that the cluster is back to full health. Also as a reminder that we hold additional Office Hours each week in June. See: https://www.rc.fas.harvard.edu/training/office-hours/

    FAS Research Computing
    https://www.rc.fas.harvard.edu/
    https://docs.rc.fas.harvard.edu/
    https://status.rc.fas.harvard.edu/

  • Completado
    junio 20, 2023 a 01:16
    Completado
    junio 20, 2023 a 01:16

    CANCELLED

    Emergency maintenance has has been cancelled. A solution to the scheduler issue has been found and Slurm is once again operating normally.

    We thank you for your patience and understanding.

Slurm scheduler slowness/delays
  • Resuelto
    Resuelto

    This incident has been resolved.

    A solution to the Slurm scheduler issue has been found and implemented. The scheduler is now operating normally.

    Thank you for your patience and understanding.

  • Actualizar
    Actualizar

    PLEASE SEE EMERGENCY MAINTENANCE NOTICE BELOW
    Emergency maintenance Tuesday June 20th 7am-11am. Running jobs will be paused.


    We are still working on finding the root cause of the scheduler slowness and timeouts. We are working with SchedMD and have sent diagnostic information for analysis.

    At this time the scheduler is up but may become unresponsive to your commands at times. You can wait and retry them.

    Jobs, once launched, should run as expected.

    OOD/VDI are working, but job submission may also be affected.

    If you receive salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified, waiting a moment and re-submitting your job again should result in a successful submission.

    Addendum: It was determined that the issue is cyclical and you will have the best luck during odd hours (e.g. - after 1pm, before 2pm) and less success interfacing with the scheduler during even hours (e.g. - after 2pm, before 3pm).

    We regret the impact this is having on your work. Updates will follow as we have them.

  • Actualizar
    Actualizar

    We are still working on finding the root cause of the slowness and are working with the vendor to troubleshoot.

    At this time the scheduler is up but may become responsive to your commands at times. Jobs, once launched, should run as expected.

  • Actualizar
    Actualizar

    Slurm is still being overwhelmed with large numbers of requests. VDI timeouts/502 errors, interactive job connections, job submissions, are all impacted.

    We are in contact with the Slurm developers to find a better resolution.

  • Supervisando
    Supervisando

    We have implemented updates and the scheduler appears stable and responsive.
    We will continue to monitor and update this incident.

  • Investigando
    Investigando

    We are currently investigating performance issues with the scheduler. This manifests in slowness when submitting jobs, job submission fails some percentage of the time, and timeout errors (including OOD/VDI).

    Also, some jobs may not exit cleanly putting nodes into drain status.

    We are working to resolve this issue. Updates to follow.

may 2023

No se reportaron avisos este mes

may 2023 a jul 2023

Siguiente