UPDATE 9:50PM: Replacement network modules have been installed and nodes have been resumed. Jobs are being scheduled and we are confident this issue is resolved. Thanks for your patience and understanding.
UPDATE 6PM: This issue has been narrowed down to an Infiniband fiber switch. Nodes connected or adjacent to this switch are affected. Replacement parts are in route. Please note that the rest of the cluster is work, just certain nodes are unavailable including quite a few GPU nodes.
We recently saw a number of compute nodes become unreachable by the scheduler and the jobs there dying.
We are investigating the cause and will update. If this has affected your jobs, check to see if they are re-queued and pending.
If they are not you will need to re-submit them. But it would be prudent to wait until we update this message before doing so.