All systems are operational

About This Site

GETTING HELP

https://docs.rc.fas.harvard.edu | https://portal.rc.fas.harvard.edu/rcrt/submit_ticket | Email: rchelp@rc.fas.harvard.edu


Status page for the Harvard FAS Research Computing cluster and other resources.

Please scroll down to see details on any Incidents or maintenance notices.

Past Incidents

16th April 2021

CEPH Storage Boston (Tier 2) CEPH/VM performance issues

We are currently investigating a performance issue which affects our CEPH infrastructure. This will impact some fileshares (the scope is TBD) as well as many/most VMs.

  • CEPH storage and VMs are back online. The root cause of this issue has not been identified. We have determined how to mitigate this when it happens, but a permanent solution is still being investigated.

  • 15th April 2021

    SLURM Scheduler Scheduler Thrashing

    Starting around noon April 15th the scheduler has been periodically overwhelmed with traffic. We believe this is happening due to jobs exiting immediately after launch which causes the scheduler to get stuck in a loop of scheduling, which is called thrashing. We have slowed the scheduler down in terms of throughput to order to hopefully slow the churn. This means jobs will not go through as quickly as normal.

    As a reminder please ensure jobs last for more than 10 minutes and do not exit immediately. Also be sure not to poll the scheduler incessantly for data about jobs. If you must poll for information please only poll up to once a minute (preferably once every 10 minutes). The more queries the scheduler has to handle less time it has to schedule work and the more traffic it has to drop.

  • This issue has been resolved. Thanks for your patience.

  • We have a better idea now the behavior profile of this issue. It seems that every other hour for an hour the scheduler gets overwhelmed with traffic and becomes slow. After that hour has passed it returns to normal. We have opened up a ticket with SchedMD: https://bugs.schedmd.com/show_bug.cgi?id=11397

    The scheduler will be in and out every other hour until this is fixed. Please be patient with it. No jobs should be lost though the scheduler may be slow to respond during high load periods.

  • 11th April 2021

    Virtual Infrastructure - Boston CEPH/VM performance issues

    We are currently investigating a performance issue which affects our CEPH infrastructure. This will impact some fileshares (the scope is TBD) as well as many/most VMs.

  • Storage VMs and CEPH shares back online.

  • VMs (including VDI, RC ticket system, etc.) are operational.

    Storage VMs and CEPH storage coming back online now.

  • UPDATE: The following services are back in operation - VDI, two-factor.rc.fas.harvard.edu, and the FASRC ticket system.

  • CEPH storage nodes and some VMs are being paused to ensure the integrity of the cluster while we investigate.

    If your filesystem stops responding or complains of a mount issue, it will re-mount once the storage is back in service.

    Please note: This affects VDI, two-factor.rc.fas.harvard.edu, and the FASRC ticket system.

  • 31st March 2021

    HolyLFS (Tier 0) HOLYLFS - reboot required

    HOLYLFS is hung and needs to be rebooted in order to bring it back to normal. ETA 30 minutes.

  • HOLYLFS is back up and operational.

  • 29th March 2021

    Samba Cluster Samba cluster issues

    The Samba cluster continues to experience performance issues. As of 4:30PM March 29th the cluster is in a major outage state. We are investigating. We hope to return the cluster to operation soon, but do expect the performance issues to continue until the root cause is identified.

  • Cluster is now running on multiple nodes again. Thank you for your patience.

  • The samba cluster is back in service.

    Please do bear in mind that there is still an on-going performance issue. Due to a recently introduced bug the cluster is relying on one primary node, so performance for connected shares may be impacted.

  • Affected shares: andersonfs arlottafs bellonofs bertoldifs capassofs capellinifs1 chettylabfs dasch-smb denicfs dobbiefs dulacfs fischerfs friedmanfs girguisfs gradfs guenettefs hausmannfs haysfs hbsdoctoral hoekstrafs huhfs illumina jessicacohenfs jschwartzfs1 kahnefs keutschfs kominersfs kramerfs lichtmanfs liranhbsfs maggiorifs1 masonfs1 mczfs miratrixfs murrayfs1 murraylab nagaifs nallamalafs nmrfs olveczky pallaisfs1 pkelleyfs1 qbrc reecefs rjwoodfs rubinfs rushfs schnappfs seeberfs sequencing srivastavafs sunfs surbeckfs voolfs wagersfs1 whipplefs whitedfs yacobyfs yaufs yuanfs yufs1 zhuangfs

  • Cannon Cluster Compute MGHPCC utility power loss - Affects compute, but not storage and other UPS-supplied resources

    MGHPCC (Holyoke) has lost a leg of utility power. This affects the rows where the Canon compute nodes reside. It does not affect UPS-supplies resources like storage, VMs, etc.

    We are awaiting further information and power restoration.

    Additional Info: Compute nodes are not on UPS by design as part of the green computing initiative at MGHPCC. Putting compute on UPS requires much more power and investment. We will move to a new sub-station at MGHPCC in the coming months and, hopefully, this will mitigate the utility power issue.

  • Returning status to Operational.

  • The cluster is back in operation.

    A few other stand-alone systems are being addressed individually (including, but not limited to, SEAS DGX).

  • ETA for return to normal is by 3PM. Email sent to cluster users list follows:

    Shortly before noon today there was a utility power interruption at MGHPCC/Holyoke due to high winds which shut down power to the Cannon cluster compute nodes. All running jobs will have died and therefore will be requeued. When the cluster is back to normal these jobs will start again.

    This event did not affect storage, VMs, or other services hosted at Holyoke which are on UPS power. Only the cluster compute nodes. Part of the green strategy for MGHPCC is that systems like compute nodes are not on redundant power.

    We are working to return the cluster to normal operation which requires some rolling reboots and resets. Lab-owned nodes which do not have management interfaces (e.g. - SeasDGX and a couple of others) will have to be addressed in-person so may take longer.

    Best guess ETA for return to normal is by 3 PM. Check our status page for incident updates: https://status.rc.fas.harvard.edu

  • 5th March 2021

    Samba Cluster Samba unavailable

    The SAMBA cluster is down as of 11:30am. This is affecting multiple users and shares that utilize mounted storage from different types of hosts. FASRC engineers are working towards a solution.

    In the meantime, please utilize Filezilla: https://docs.rc.fas.harvard.edu/kb/sftp-file-transfer/

  • The samba cluster is back online. Only one of the four samba nodes is online, so performance may be degraded, but all functionality is available.

  • For issues not shown here, please contact FASRC via
    https://portal.rc.fas.harvard.edu or email rchelp@rc.fas.harvard.edu