A100 GPU issue

Status page for the Harvard FAS Research Computing cluster and other resources.

Cluster Utilization (VPN and FASRC login required): Cannon | FASSE

Please scroll down to see details on any Incidents or maintenance notices.
Monthly maintenance occurs on the first Monday of the month (except holidays).

GETTING HELP
Documentation: https://docs.rc.fas.harvard.edu | Account Portal https://portal.rc.fas.harvard.edu
Email: rchelp@rc.fas.harvard.edu | Support Hours

The colors shown in the bars below were chosen to increase visibility for color-blind visitors.
For higher contrast, switch to light mode at the bottom of this page if the background is dark and colors are muted.

Résolu

Opérationnel

A commencé il y a environ 3 ansjuillet 07, 2023A duré environ 1 moisjuillet 07, 202315:00août 17, 202317:51UTC

Concernés

Cannon Cluster

GPU nodes (Holyoke)

Mises à jour

Résolu
août 17, 2023 à 17:51UTC
Résolu
août 17, 2023 à 17:51UTC
Firmware updates have resolved this issue.
Mettre à jour
juillet 18, 2023 à 19:05UTC
Mettre à jour
juillet 18, 2023 à 19:05UTC
The proscribed driver update did not fix this issue.

We are working with Nvidia to find a fix. As a stop-gap nodes which get stuck/flagged will be marked in Slurm for reboot and rebooted once empty of jobs.

If a A100 GPU host your lab owns is stuck in a bad state, please let us know and we will mark and reboot it.
Identifié
juillet 12, 2023 à 14:44UTC
Identifié
juillet 12, 2023 à 14:44UTC
A100s are open but still experiencing some instability. Infrequently, you may hit the issue we noted earlier.

We are continuing to work on a solution.
Détecté
juillet 07, 2023 à 15:00UTC
Détecté
juillet 07, 2023 à 15:00UTC
An NVIDIA bug may be causing failures in A100 GPUs. nvidia-smi command is slow, or outputs "ERR!" or "No GPUs are found".

Jobs on affected nodes may fail.

We are investigating this issue.

FAS Research Computing - A100 GPU issue – Détails de l'incident

A100 GPU issue