FAS Research Computing - Ilmoitushistoria

Osittain heikentynyt suorituskyky

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
Documentation: https://docs.rc.fas.harvard.edu | Account Portal https://portal.rc.fas.harvard.edu
Email: rchelp@rc.fas.harvard.edu | Support Hours


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Heikentynyt suorituskyky

SLURM Scheduler - Cannon - Heikentynyt suorituskyky

Cannon Compute Cluster (Holyoke) - Heikentynyt suorituskyky

Boston Compute Nodes - Heikentynyt suorituskyky

GPU nodes (Holyoke) - Heikentynyt suorituskyky

seas_compute - Heikentynyt suorituskyky

Toiminnassa

SLURM Scheduler - FASSE - Toiminnassa

FASSE Compute Cluster (Holyoke) - Toiminnassa

Toiminnassa

Kempner Cluster CPU - Toiminnassa

Kempner Cluster GPU - Toiminnassa

Toiminnassa

FASSE login nodes - Toiminnassa

Toiminnassa

Cannon Open OnDemand - Toiminnassa

FASSE Open OnDemand - Toiminnassa

Toiminnassa

Netscratch (Global Scratch) - Toiminnassa

Home Directory Storage - Boston - Toiminnassa

Tape - (Tier 3) - Toiminnassa

Holylabs - Toiminnassa

Isilon Storage Holyoke (Tier 1) - Toiminnassa

Holystore01 (Tier 0) - Toiminnassa

HolyLFS04 (Tier 0) - Toiminnassa

HolyLFS05 (Tier 0) - Toiminnassa

HolyLFS06 (Tier 0) - Toiminnassa

Holyoke Tier 2 NFS (new) - Toiminnassa

Holyoke Specialty Storage - Toiminnassa

holECS - Toiminnassa

Isilon Storage Boston (Tier 1) - Toiminnassa

BosLFS02 (Tier 0) - Toiminnassa

Boston Tier 2 NFS (new) - Toiminnassa

CEPH Storage Boston (Tier 2) - Toiminnassa

Boston Specialty Storage - Toiminnassa

bosECS - Toiminnassa

Samba Cluster - Toiminnassa

Globus Data Transfer - Toiminnassa

Ilmoitushistoria

kesä 2024

FASRC websites unavailable
  • Ratkaistu
    Ratkaistu

    This incident has been resolved. Both sites are working normally.

  • Tutkitaan
    Tutkitaan

    https://www.rc.fas.harvard.edu/ and https://docs.rc.fas.harvard.edu/ are offline.

    We are currently investigating this issue.

MGHPCC Pod 8A Power Upgrade June 24 will idle some Cannon nodes
  • Valmistunut
    kesäkuuta 25, 2024 klo 04.00
    Valmistunut
    kesäkuuta 25, 2024 klo 04.00
    Maintenance has completed successfully
  • Meneillään
    kesäkuuta 24, 2024 klo 16.01
    Meneillään
    kesäkuuta 24, 2024 klo 16.01
    Maintenance is now in progress
  • Suunniteltu
    kesäkuuta 24, 2024 klo 04.01
    Suunniteltu
    kesäkuuta 24, 2024 klo 04.01

    MGHPCC will be performing power upgrades on Pod 8A in order to increase density and allow more nodes to be added in that Pod's rows.  Similar to the May 13th work, this means that we will be idling half the nodes in 8A on two dates: June 17 and June 24th.

    These are all day events, meaning that the nodes in question will not be available for the 24 hours of that day.  This is being accomplished via reservations. So no jobs will be canceled but nodes will be drained and users may notice that their jobs may pend longer than normal as the scheduler idles these nodes.

    Where possible, please use or include other partitions in your job scripts and plan accordingly for any new or long-running jobs during that period: https://docs.rc.fas.harvard.edu/kb/running-jobs/#Slurm_partitions

    This affects the Cannon cluster. FASSE is not affected.

    Impacted partitions are:

    arguelles_delgado_gpu

    bigmem_intermediate

    bigmem

    blackhole_gpu

    eddy

    enos

    gershman gpu

    hejazi hernquist_ice

    hoekstra hsph

    huce_ice

    iaifi_gpu

    iaifi_gpu_priority

    iaifi_gpu_requeue

    intermediate

    itc_gpu

    itc_gpu_requeue

    joonholee

    jshapiro

    jshapiro_priority

    jshapiro_sapphire

    kempner

    kempner_dev

    kempner_h100

    kempner_requeue

    kempner_reservation

    kovac

    kozinsky

    kozinsky_gpu

    kozinsky_priority

    kozinsky_requeue

    murphy_ice

    ortegahernandez_ice

    sapphire

    seas_compute

    seas_gpu siag

    siag_combo

    siag_gpu

    sur test

    yao

    yao_priority

    zhuang

touko 2024

NESE Tape unavailable due to maintenance, ETA now Monday
  • Ratkaistu
    Ratkaistu

    NESE maintenance has resolved and is now back in service

  • Seurataan
    Seurataan

    A note from NESE. Their maintenance has been delayed by hardware issues. ETA now Monday 6/3

    Dear All,

    NESE Tape system upgrade is currently in progress. While IBM hardware
    team works on TS4500 library tape frame expansion and IBM software team
    works on ESS and Archive software and firmware upgrades, the work
    progress has been slowed down due to unforeseen hardware issues.We now expect to bring the tape service back into production this Monday
    morning. We apologize for any inconvenience caused by the delay.

  • Tunnistettu
    Tunnistettu

    Due to maintenance at our tape partner, NESE (Northeast Storage Exchange), access to tape allocations will be unavailable until at least late Thursday (5/30). Normal operations will resume by Friday (5/31).

    If you continue to have issues with a Globus tape endpoint on Friday, please contact FASRC or NESE

Annual MGHPCC/Holyoke data center power downtime - May 21-24 2024
  • Valmistunut
    toukokuuta 24, 2024 klo 21.50
    Valmistunut
    toukokuuta 24, 2024 klo 21.50

    2024 MGHPCC downtime complete

    DOWNTIME COMPLETE

    The annual multi-day power downtime at MGHPCC (https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/) is complete (with any exceptions noted below). Normal service resumes today (Friday May 24th) at 5pm.

    The cluster has been updated to Rocky Linux 8.9. Several network, InfiniBand, computer, and storage firmware updates were installed. Available security updates were also installed.

    CANNON NODES

    More than 90% of nodes are up and all partitions are enabled. If your specialty partition has a downed node, we will attend to this on Tuesday.

    FASSE OOD

    Some updates are still propagating. If your FASSE Open OnDemand/VDI session does not work initially, please wait or retry your job/session.

    POST-DOWNTIME SUPPORT

    If you have any further concerns or unanswered questions please submit a help ticket (https://portal.rc.fas.harvard.edu/rcrt/submit_ticket) and we will do our best to respond quickly. Please bear in mind it is a long weekend, so lingering issues may not be dealt with until Tuesday.

    Also, have a good long Memorial Day weekend!

    Thanks,

    FAS Research Computing

    https://www.rc.fas.harvard.edu/

    https://docs.rc.fas.harvard.edu/

    https://status.rc.fas.harvard.edu/

    rchelp@rc.fas.harvard.edu  

  • Päivitys
    toukokuuta 24, 2024 klo 21.08
    Päivitys
    toukokuuta 24, 2024 klo 21.08

    We are currently delayed opening the cluster due to some lingering issues.

    We will re-open as soon as possible or update again at 6pm.

  • Päivitys
    toukokuuta 24, 2024 klo 13.47
    Päivitys
    toukokuuta 24, 2024 klo 13.47

    Power work completed by facility. Currently on schedule for powerup and return to service. ETA 5pm.

  • Meneillään
    toukokuuta 21, 2024 klo 13.00
    Meneillään
    toukokuuta 21, 2024 klo 13.00
    Maintenance is now in progress
  • Suunniteltu
    toukokuuta 21, 2024 klo 13.00
    Suunniteltu
    toukokuuta 21, 2024 klo 13.00

    The 2024 MGHPCC data center annual power downtime will take place May 21-24, 2024.

    We will begin our shutdown on Tuesday May 21st and expect a return to service by 5PM Friday May 24th.

    - Jobs: Please plan ahead as all still running jobs on the morning of May 21st will be stopped and canceled and will need to be resubmitted after the downtime. Pending jobs will remain in the queue until the cluster returns to regular service on May 24th.

    - Access: The cluster, scheduler, login, and OoD nodes will be unavailable for the duration of the downtime. New lab and account requests should wait until after the downtime.

    - Storage: All Holyoke storage will be powered down and unavailable for the duration of the downtime. Boston storage will remain online, but your ability to access it may be impacted and network changes may briefly affect its availability.

    Further details, an explanation for this year's change in scheduling, a visual timeline, and a list of maintenance tasks overview can be found at:

    https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/

    Progress of the downtime will be posted here on our status page during the event. Note that you can subscribe to receive updates as they happen. Click Get Updates in the upper right.

    MAJOR TASK OVERVIEW

    • OS upgrade to Rocky 8.9 - Point upgrade, no code rebuilds will be required. Switch from system OFED to Mellanox OFED on nodes for improved performance

    • Infiniband (network) upgrades

    • BIOS updates (various)

    • Storage firmware updates

    • Network Maintenance

    • Decommission old nodes (targets contacted)

    • Additional minor one-off updates and maintenance (cable swap, reboots, etc.)

    Thanks,

    FAS Research Computing

    https://www.rc.fas.harvard.edu/

    https://docs.rc.fas.harvard.edu/

    https://status.rc.fas.harvard.edu/

Many nodes in 8A down - affects sapphire, test, bigmem, and other partitions
  • Ratkaistu
    Ratkaistu
    This incident has been resolved.
  • Tutkitaan
    Tutkitaan

    We are still unable to resolve the issue with these nodes and are working with the facility, networking, and our staff to find a solution. The affected partitions (noted in previous update below) will be resource-constrained and continue to be slow or unable to queue new jobs.


    If you are using a partition that cannot queue new jobs, please consider adding additional partitions to your job: https://docs.rc.fas.harvard.edu/kb/running-jobs/#Slurm_partitions

    Also, a reminder that the data center power downtime will begin Tuesday morning. So any new jobs scheduled for more than 3 days will not complete: https://www.rc.fas.harvard.edu/blog/2024-mghpcc-power-downtime/

  • Tunnistettu
    Tunnistettu

    We are still working on the root cause and resolution for these downed nodes.

    Partitions with one or more affected nodes [involves multiple nodes unless denoted as (1) ]:

    arguelles_delgado_gpu (1)

    hsph

    joonholee

    jshapiro_sapphire

    lichtmandce01

    bigmem

    gpu_requeue (1)

    intermediate

    sapphire

    serial_requeue (1)

    shared (1)

    test

    yao / yao_priority

    use 'sinfo -p [partition name]' if you wish to see the down nodes in particular queue

  • Tutkitaan
    Tutkitaan

    We are currently investigating this incident. An unknown outage has downed many nodes in row 8A of our data center. More information to follow.

    Includes nodes from the sapphire, test, gpu, and other partitions.

huhti 2024

huhti 2024 to kesä 2024

Seuraava