FAS Research Computing - Lịch sử thông báo

Trải qua hiệu suất bị giảm sút một phần

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
Documentation: https://docs.rc.fas.harvard.edu | Account Portal https://portal.rc.fas.harvard.edu
Email: rchelp@rc.fas.harvard.edu | Support Hours


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Hiệu suất giảm sút

SLURM Scheduler - Cannon - Hiệu suất giảm sút

Cannon Compute Cluster (Holyoke) - Hiệu suất giảm sút

Boston Compute Nodes - Hiệu suất giảm sút

GPU nodes (Holyoke) - Hiệu suất giảm sút

seas_compute - Hiệu suất giảm sút

Đang hoạt động

SLURM Scheduler - FASSE - Đang hoạt động

FASSE Compute Cluster (Holyoke) - Đang hoạt động

Đang hoạt động

Kempner Cluster CPU - Đang hoạt động

Kempner Cluster GPU - Đang hoạt động

Đang hoạt động

FASSE login nodes - Đang hoạt động

Đang hoạt động

Cannon Open OnDemand - Đang hoạt động

FASSE Open OnDemand - Đang hoạt động

Đang hoạt động

Netscratch (Global Scratch) - Đang hoạt động

Home Directory Storage - Boston - Đang hoạt động

Tape - (Tier 3) - Đang hoạt động

Holylabs - Đang hoạt động

Isilon Storage Holyoke (Tier 1) - Đang hoạt động

Holystore01 (Tier 0) - Đang hoạt động

HolyLFS04 (Tier 0) - Đang hoạt động

HolyLFS05 (Tier 0) - Đang hoạt động

HolyLFS06 (Tier 0) - Đang hoạt động

Holyoke Tier 2 NFS (new) - Đang hoạt động

Holyoke Specialty Storage - Đang hoạt động

holECS - Đang hoạt động

Isilon Storage Boston (Tier 1) - Đang hoạt động

BosLFS02 (Tier 0) - Đang hoạt động

Boston Tier 2 NFS (new) - Đang hoạt động

CEPH Storage Boston (Tier 2) - Đang hoạt động

Boston Specialty Storage - Đang hoạt động

bosECS - Đang hoạt động

Samba Cluster - Đang hoạt động

Globus Data Transfer - Đang hoạt động

Lịch sử thông báo

thg 3 2025

FASRC monthly maintenance - Monday March 3rd, 2025 from 9am-1pm
  • Cập nhật
    tháng 03 03, 2025 tại 18:00
    Cập nhật
    tháng 03 03, 2025 tại 18:00
    Maintenance has completed successfully
  • Hoàn thành
    tháng 03 03, 2025 tại 18:00
    Hoàn thành
    tháng 03 03, 2025 tại 18:00
    Maintenance has completed successfully
  • Đang tiến hành
    tháng 03 03, 2025 tại 14:00
    Đang tiến hành
    tháng 03 03, 2025 tại 14:00
    Maintenance is now in progress
  • Chưa bắt đầu
    tháng 03 03, 2025 tại 14:00
    Chưa bắt đầu
    tháng 03 03, 2025 tại 14:00

    PLEASE NOTE - New time window going forward - 9am-1pm

    FASRC monthly maintenance will take place Monday March 3rd, 2025 from 9am-1pm

    NOTICES

    • Training: Upcoming training from FASRC and other sources can be found on our Training Calendar. at https://www.rc.fas.harvard.edu/upcoming-training/

    • Status Page: You can subscribe to our status to receive notifications of maintenance, incidents, and their resolution at https://status.rc.fas.harvard.edu/ (click Get Updates for options).

    • Upcoming holidays: Memorial Day - Monday, May 26

    • You can subscribe to our status page using the Get Updates button in the upper right

    MAINTENANCE TASKS
    Cannon cluster will be paused during this maintenance?: YES
    FASSE cluster will be paused during this maintenance?: YES

    • Slurm Upgrade to 24.11.2 - Crucial Update

      • Audience: All cluster users

      • Impact: Jobs and the scheduler will be paused during this upgrade

    • Open Ondemand (OOD) reboots

      • Audience: All OOD users

      • Impact: All Open OnDemand (aka OOD/VDI/RCOOD) nodes will be rebooted

    • Login node reboots

      • Audience: Anyone logged into a FASRC Cannon or FASSE login node

      • Impact: Login nodes will rebooted during this maintenance window

    • bos-Isilon firmware updates

      • Audience: bos-isilon users

      • Impact: No noticeable impact for storage users

    • Netscratch retention/cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ )

      • Audience: Cluster users

      • Impact: Files older than 90 days will be removed. Please note that retention cleanup can and does run at any time, not just during the maintenance window.

    Thank you,
    FAS Research Computing
    https://www.rc.fas.harvard.edu/
    https://docs.rc.fas.harvard.edu/

thg 2 2025

Slurm performance issues - detailed report
  • Đã khắc phục
    Đã khắc phục

    An emergency patch of the scheduler has resolved the Multiple Partition issue

  • Đang điều tra
    Đang điều tra

    Since mid-January we've been seeing some strange issues with the scheduler which caused periodic stalls or unresponsiveness in the scheduler. We had hoped that the Slurm upgrade to 24.11.1 would resolve those issues due to various architecture changes in the communications backend. Unfortunately they did not; we have since opened an issue with SchedMD (our service vendor for the scheduler). This has since spiraled into finding several other issues with the scheduler which we are working to remediate. Below is a status report regarding these issues:

    1. High Agent Load Stall (RESOLVED): This was reported in https://support.schedmd.com/show_bug.cgi?id=21975 The scheduler would stall due to being oversaturated with blocking requests. This turned out to be due to a new Slurm feature called stepmgr which we had enabled to handle jobs with many steps. Unfortunately this feature also increased the load on the scheduler for array jobs exiting at the same time which caused the stall. Since we tend not to have many users that use many steps we opted to disable the stepmgr function. This resolved the High Agent Load issue. Users that have many steps in their job may still turn on the stepmgr for their specific job by adding #SBATCH --stepmgr (https://slurm.schedmd.com/sbatch.html#OPT_stepmgr)

    2. Scheduler Thrashing (MONITORING): We discovered this while working on the previous bug and continued to work on it in the same bug report: https://support.schedmd.com/show_bug.cgi?id=21975 Under high load, the scheduler would get into a thrashing state where the scheduler would effectively go heads down and ignore incoming requests in order to focus on scheduling jobs. To users this would look like the scheduler was unresponsive as the scheduler was ignoring their requests to deal with higher priority traffic. To remediate this we increased the thread count for the scheduler and implemented a throttle to slow things down so that the scheduler could respond to all the requests with out impacting scheduler throughput. This is in place now and appears to have resolved the issue. We are continuing to monitor the scheduler to tune this throttle.

    3. --test-only requeue crash (RESOLVED): During this investigation we also ran into another bug reported by another group related to jobs that were submitted using --test-only that would in theory preempt other jobs (see: https://support.schedmd.com/show_bug.cgi?id=21997). This caused the scheduler to crash. Given the severity of the bug we emergency patched the scheduler on Feb 12th to resolve this issue.

    4. Multiple Partition Jobs Labelled with Wrong Partition (IN PROGRESS): This is a new issue identified on 2/13 related to jobs that submit to multiple partitions at once (https://support.schedmd.com/show_bug.cgi?id=22076). When the job schedules it may run in one partition but be labelled as being in another. This can lead to job preemption issues as the jobs are labelled as being in partitions that cannot be preempted even though they were originally scheduled in partitions that could be. This was identified earlier by another group and SchedMD is working on a patch. Depending on the timing FASRC will either emergency patch the scheduler for this issue or wait for the formal release of 24.11.2. Note that this issue really only impacts preemption and the scheduler is working fine otherwise. If you see jobs that you think should be preempted but are not and are blocking your work please let us know and we will investigate.


    Thank you for your patience as we work through these issues.

thg 1 2025

Network issues affecting VPN, portal, ond potentially other services
  • Đã khắc phục
    Đã khắc phục
    The network issues have been resolved.
  • Đã nhận diện
    Đã nhận diện

    Most services restored. Some VPN connectivity or lag may still exist for some user.

    Networking expects to have this fully resolved very soon.

  • Đang điều tra
    Đang điều tra

    We are currently investigating this issue.

    We've identified some are unable to connect to VPN.

    OOD/OpenOnDemand access is affected.

    Other symptoms are the FASRC websites, portal.rc.fas.harvard.edu and other internet-facing sites (coldfront, spinal, minilims, etc.) are not accessible


    SSH to/from nodes or login may be affected or laggy.

    Networking is investigating.

Portal is partially unavailable
  • Đã khắc phục
    Đã khắc phục
    Portal is operating normally.
  • Đang theo dõi
    Đang theo dõi

    Portal is online, but requires brief maintenance before approvers can use.

  • Đang điều tra
    Đang điều tra

    portal.rc.fas.harvard.edu is unavailable. We are currently investigating this issue.

thg 1 2025 đến thg 3 2025

Sau