FAS Research Computing - Legg merke til historikk

Opplever delvis svekket ytelse

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
Documentation: https://docs.rc.fas.harvard.edu | Account Portal https://portal.rc.fas.harvard.edu
Email: rchelp@rc.fas.harvard.edu | Support Hours


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Nedsatt ytelse

SLURM Scheduler - Cannon - Nedsatt ytelse

Cannon Compute Cluster (Holyoke) - Nedsatt ytelse

Boston Compute Nodes - Nedsatt ytelse

GPU nodes (Holyoke) - Nedsatt ytelse

seas_compute - Nedsatt ytelse

Operasjonell

SLURM Scheduler - FASSE - Operasjonell

FASSE Compute Cluster (Holyoke) - Operasjonell

Operasjonell

Kempner Cluster CPU - Operasjonell

Kempner Cluster GPU - Operasjonell

Operasjonell

FASSE login nodes - Operasjonell

Operasjonell

Cannon Open OnDemand/VDI - Operasjonell

FASSE Open OnDemand/VDI - Operasjonell

Operasjonell

Netscratch (Global Scratch) - Operasjonell

Home Directory Storage - Boston - Operasjonell

Tape - (Tier 3) - Operasjonell

Holylabs - Operasjonell

Isilon Storage Holyoke (Tier 1) - Operasjonell

Holystore01 (Tier 0) - Operasjonell

HolyLFS04 (Tier 0) - Operasjonell

HolyLFS05 (Tier 0) - Operasjonell

HolyLFS06 (Tier 0) - Operasjonell

Holyoke Tier 2 NFS (new) - Operasjonell

Holyoke Specialty Storage - Operasjonell

holECS - Operasjonell

Isilon Storage Boston (Tier 1) - Operasjonell

BosLFS02 (Tier 0) - Operasjonell

Boston Tier 2 NFS (new) - Operasjonell

CEPH Storage Boston (Tier 2) - Operasjonell

Boston Specialty Storage - Operasjonell

bosECS - Operasjonell

Samba Cluster - Operasjonell

Globus Data Transfer - Operasjonell

Legg merke til historikk

juni 2025

holylabs - New data missing
  • Løst
    Løst

    Allowed one week for the message to propagate. Closing this incident.

  • Identifisert
    Identifisert

    While attempting to correct the over-quota/extra data issue on holylabs, an error in the sync command caused the deletion of newly created files since the re-open of the cluster (6/5/25 9AM) for 54 lab directories. We see no evidence that any other lab directories were affected.


    Due to the large nature of the original cleanup and the error being discovered after the fact, regretfully these deleted files cannot be recovered.

    A list follows of affected /n/holylabs lab directories. If your lab is not on that list, then it is not identified as being affected but this error:
    acc_lab
    alvarez_lab
    avillar_lab
    barnett_lab
    bertoldi_lab
    bol_lab
    brenner_lab
    cgolden_lab
    charbonneau_lab
    charrison_lab
    chetty_lab
    cnelya_lab
    dam_lab
    doshi-velez_lab
    eisenstein_lab
    enos_lab
    eps_preceptors
    glassman_lab
    hanson_lab
    hekstra_lab
    holbrook_lab
    iaifi_lab
    idreos_lab
    iebecker_lab
    imai_lab
    jacobsen_lab
    jialiu_lab
    junweil_lab
    kaxiras_lab
    kdbrantley_lab
    kempner_dev
    king_lab
    kiyoul_lab
    konkle_lab
    koumoutsakos_lab
    kozinsky_lab
    kramer_lab
    maustern_lab
    nliu_lab
    pallais_lab
    park_lab
    pierce_lab
    protopapas_lab
    pslade_lab
    shro_lab
    sitanc_lab
    smousavih_lab
    sneel_lab
    snyder_lab
    sompolinsky_lab
    tamano_lab
    ylei_lab
    zickler_lab

  • Undersøker
    Undersøker

    We are currently investigating an issue on holylabs where some labs have noticed newly created files are missing.

    We will update this incident with more info as soon as possible.

2025 MGHPCC power downtime June 2-4, 2025
  • Fullført
    juni 05, 2025 kl. 13:00
    Fullført
    juni 05, 2025 kl. 13:00
    Maintenance has completed successfully
  • Pågår
    juni 02, 2025 kl. 13:00
    Pågår
    juni 02, 2025 kl. 13:00
    Maintenance is now in progress
  • Ikke startet ennå
    juni 02, 2025 kl. 13:00
    Ikke startet ennå
    juni 02, 2025 kl. 13:00

    The yearly power downtime at our Holyoke data center, MGHPCC, has been scheduled. 
    This year's power downtime will take place on Tuesday June 3, 2025. 

    This will require FASRC to begin shutdown of our systems beginning at 9AM on Monday, June 2nd.
    We have worked to reduce the total outage time this year.
    We will begin power-up on Wednesday June 4th with an expected return to full service by 9AM Thursday June 5th.

    • Monday June 2nd -  Power-down begins at 9AM

    • Tuesday June 3rd - Power out at MGHPCC

    • Wednesday June 4th - Maintenance tasks and then power-up begins

    • Thursday June 5th - Expected return to full service by 9AM

    Maintenance:
    During this downtime, Holylabs (/n/holylabs) will move to new hardware.
    Starfish, Coldfront, and the Portal will be unavailable during the downtime.

    For more details including a graphical timeline, please see: https://www.rc.fas.harvard.edu/events/2025-mghpcc-power-downtime/

    Updates will be posted here on our status page: https://status.rc.fas.harvard.edu/
    Note that you can subscribe to receive updates as they happen. On the status page, click Get Updates.

    Notices and reminders will also be sent to all users via our mailing lists.

mai 2025

MGHPCC power work 5/21 - 5/23 - Some partitions will be at half capacity
  • Fullført
    mai 23, 2025 kl. 19:00
    Fullført
    mai 23, 2025 kl. 19:00
    Maintenance has completed successfully
  • Pågår
    mai 21, 2025 kl. 11:00
    Pågår
    mai 21, 2025 kl. 11:00
    Maintenance is now in progress
  • Ikke startet ennå
    mai 21, 2025 kl. 11:00
    Ikke startet ennå
    mai 21, 2025 kl. 11:00

    The MGHPCC Holyoke data center will be performing power work on May 21st -23rd. This work will take out one half (or one 'side') of the power capacity for certain rows/racks including our compute rows. Because of our power draw, one side is not enough power to keep each full rack running.

    As such, we will be adding a reservation to idle half the nodes in the partitions listed below. A reservation will cause nodes to drain as jobs complete and stop scheduling new jobs on those nodes if they cannot be completed before the outage. This will allow us to idle and power down those nodes prior to the work and avoid potential blackout/brownout on those racks.

    This will mean that these partitions will be up and available, but that half the nodes from each will be down (assuming an even number of nodes).

    This work is part of an on-going power capacity upgrade at MGHPCC. We expect this will be the last power work needed and the facility will then provide enough additional power for future expansion as well adding overhead for the current load.

    The affected partitions are:

    • arguelles_delgado

    • bigmem_intermediate

    • blackhole_gpu

    • eddy gershman

    • hejazi

    • hernquist

    • hoekstra

    • huce_ice

    • iaifi_gpu

    • iaifi_gpu_requeue

    • iaifi_priority

    • jshapiro

    • jshapiro_priority

    • kempner

    • kempner_requeue

    • kempner_h100

    • kempner_h100_priority

    • kempner_h100_priority2

    • kovac kozinsky

    • kozinsky_gpu

    • kozinsky_requeue

    • ortegahernandez_ice

    • rivas

    • seas_compute

    • seas_gpu

    • siag_combo

    • siag_gpu

    • sur

    • zhuang

apr. 2025

Login nodes temporarily down
  • Løst
    Løst

    Cannon boslogin and FASSE login nodes are back up and operational.

    All holylogin nodes are still down for repair, please see our posted incident for more updates: https://status.rc.fas.harvard.edu/cm97gyay90013dturk7fxg5pb

    We apologize for the unexpected disruption.

  • Undersøker
    Undersøker

    Due to a configuration error, all cluster login nodes are rebooting and are temporarily unavailable. Please save any work immediately.

holylogin[05-08] down
  • Løst
    Løst
    Hardware has been repaired and holyoke login nodes are back online. Thanks for your patience.
  • Overvåker
    Overvåker

    Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.

    • holyoke login nodes (holylogin05-08) are down for hardware repair

    • Only Boston login nodes available (ie, boslogin[05-08])

    If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.

    As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.

    If you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or (once they are back in service) holylogin.rc.fas.harvard.edu (Holyoke).

    See also: Command line access with Terminal (login nodes) – FASRC DOCS

  • Løst
    Løst

    This incident was posted by mistake.

    holylogin01-04 were replaced by holylogin05-08 some time back.

    As always, the best method for obtaining a login node is usinglogin.rc.fas.harvard.edu which will pick a node for you.

    Or if you require a login node in a specific data center, use boslogin.rc.fas.harvard.edu (Boston) or holylogin.rc.fas.harvard.edu (Holyoke).

    See also: Command line access with Terminal (login nodes) – FASRC DOCS

  • Undersøker
    Undersøker

    Holylogin chassis repair during maintenance was unsuccessful and replacement parts have been ordered.

    Audience:

    • All cluster users

    Impact:

    • All holylogin** servers will be down till further notice

    • Only Boston login nodes available (ie, boslogin[05-08])

    If you have holylogin hard-coded in your scripts, please update to login.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu for the time being, which will redirect you to an available login node.

    Updates to follow as we have them.

apr. 2025 til juni 2025

Neste