A100 GPU issue

به‌روزرسانی‌ها

حل شد
آگوست 17, 2023 در 5:51 ب.ظ.UTC
حل شد
آگوست 17, 2023 در 5:51 ب.ظ.UTC
Firmware updates have resolved this issue.
به‌روزرسانی
جولای 18, 2023 در 7:05 ب.ظ.UTC
به‌روزرسانی
جولای 18, 2023 در 7:05 ب.ظ.UTC
The proscribed driver update did not fix this issue.

We are working with Nvidia to find a fix. As a stop-gap nodes which get stuck/flagged will be marked in Slurm for reboot and rebooted once empty of jobs.

If a A100 GPU host your lab owns is stuck in a bad state, please let us know and we will mark and reboot it.
شناسایی شد
جولای 12, 2023 در 2:44 ب.ظ.UTC
شناسایی شد
جولای 12, 2023 در 2:44 ب.ظ.UTC
A100s are open but still experiencing some instability. Infrequently, you may hit the issue we noted earlier.

We are continuing to work on a solution.
در حال بررسی
جولای 07, 2023 در 3:00 ب.ظ.UTC
در حال بررسی
جولای 07, 2023 در 3:00 ب.ظ.UTC
An NVIDIA bug may be causing failures in A100 GPUs. nvidia-smi command is slow, or outputs "ERR!" or "No GPUs are found".

Jobs on affected nodes may fail.

We are investigating this issue.

FAS Research Computing - A100 GPU issue – جزئیات حادثه