FAS Research Computing - A100 GPU issue – جزئیات حادثه

با افت جزئی عملکرد مواجه است

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE


Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
Documentation: https://docs.rc.fas.harvard.edu | Account Portal https://portal.rc.fas.harvard.edu
Email: rchelp@rc.fas.harvard.edu | Support Hours


The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

A100 GPU issue

حل شد
عملیاتی
آغاز شد نزدیک 3 سال قبلطول کشید حدود 1 ماه

تحت تأثیر

Cannon Cluster

عملیاتی از 3:00 PM تا 5:51 PM

GPU nodes (Holyoke)

عملیاتی از 3:00 PM تا 5:51 PM

به‌روزرسانی‌ها
  • حل شد
    حل شد

    Firmware updates have resolved this issue.

  • به‌روزرسانی
    به‌روزرسانی

    The proscribed driver update did not fix this issue.

    We are working with Nvidia to find a fix. As a stop-gap nodes which get stuck/flagged will be marked in Slurm for reboot and rebooted once empty of jobs.

    If a A100 GPU host your lab owns is stuck in a bad state, please let us know and we will mark and reboot it.

  • شناسایی شد
    شناسایی شد

    A100s are open but still experiencing some instability. Infrequently, you may hit the issue we noted earlier.

    We are continuing to work on a solution.

  • در حال بررسی
    در حال بررسی

    An NVIDIA bug may be causing failures in A100 GPUs. nvidia-smi command is slow, or outputs "ERR!" or "No GPUs are found".

    Jobs on affected nodes may fail.

    We are investigating this issue.