FAS Research Computing - Slurm scheduler slowness/delays – 故障详情

目前部分性能下降

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
Documentation: https://docs.rc.fas.harvard.edu | Account Portal https://portal.rc.fas.harvard.edu
Email: rchelp@rc.fas.harvard.edu | Support Hours


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Slurm scheduler slowness/delays

已解决
运行正常
开始于 大约 3 年前持续 6 天

受到影响

Cannon Cluster

运行正常 从 2:19 PM 至 1:28 AM

SLURM Scheduler - Cannon

运行正常 从 2:19 PM 至 1:28 AM

Cannon Compute Cluster (Holyoke)

运行正常 从 2:19 PM 至 1:28 AM

OpenOnDemand/OOD

运行正常 从 2:19 PM 至 1:28 AM

Cannon Open OnDemand

运行正常 从 2:19 PM 至 1:28 AM

更新
  • 已解决
    已解决

    This incident has been resolved.

    A solution to the Slurm scheduler issue has been found and implemented. The scheduler is now operating normally.

    Thank you for your patience and understanding.

  • 更新
    更新

    PLEASE SEE EMERGENCY MAINTENANCE NOTICE BELOW
    Emergency maintenance Tuesday June 20th 7am-11am. Running jobs will be paused.


    We are still working on finding the root cause of the scheduler slowness and timeouts. We are working with SchedMD and have sent diagnostic information for analysis.

    At this time the scheduler is up but may become unresponsive to your commands at times. You can wait and retry them.

    Jobs, once launched, should run as expected.

    OOD/VDI are working, but job submission may also be affected.

    If you receive salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified, waiting a moment and re-submitting your job again should result in a successful submission.

    Addendum: It was determined that the issue is cyclical and you will have the best luck during odd hours (e.g. - after 1pm, before 2pm) and less success interfacing with the scheduler during even hours (e.g. - after 2pm, before 3pm).

    We regret the impact this is having on your work. Updates will follow as we have them.

  • 更新
    更新

    We are still working on finding the root cause of the slowness and are working with the vendor to troubleshoot.

    At this time the scheduler is up but may become responsive to your commands at times. Jobs, once launched, should run as expected.

  • 更新
    更新

    Slurm is still being overwhelmed with large numbers of requests. VDI timeouts/502 errors, interactive job connections, job submissions, are all impacted.

    We are in contact with the Slurm developers to find a better resolution.

  • 持续监控中
    持续监控中

    We have implemented updates and the scheduler appears stable and responsive.
    We will continue to monitor and update this incident.

  • 调查中
    调查中

    We are currently investigating performance issues with the scheduler. This manifests in slowness when submitting jobs, job submission fails some percentage of the time, and timeout errors (including OOD/VDI).

    Also, some jobs may not exit cleanly putting nodes into drain status.

    We are working to resolve this issue. Updates to follow.