История уведомлений

Работает

сент. 2023

Решено
сентября 29, 2023 в 16:53
Решено
сентября 29, 2023 в 16:53
The Ceph instability has been resolved. Caeph Tier2 shares, VDI, and VMs should be back to their normal state.

If your VM, /net/fs-[labname] share, or VDI session is still impacted, please contact rchelp@rc.fas.harvard.edu
Определено
сентября 29, 2023 в 15:16
Определено
сентября 29, 2023 в 15:16
The infrastructure behind Tier2 Ceph shares and VMs is unstable.
This also affects VDI/OOD which relies on virtual machines.

/net/fs-[labname] shares, new OOD/VDI sessions, and VMs are affected and may will be inaccessible until this is resolved.

Thanks for your patience.

Решено
сентября 25, 2023 в 11:36
Решено
сентября 25, 2023 в 11:36
The Ceph instability has been resolved. Caeph Tier2 shares, VDI, and VMs should be back to their normal state.

If your VM, /net/fs-[labname] share, or VDI session is still impacted, please contact rchelp@rc.fas.harvard.edu
Определено
сентября 25, 2023 в 5:19
Определено
сентября 25, 2023 в 5:19
The infrastructure behind Tier2 Ceph shares and VMs is unstable.
This also affects VDI/OOD which relies on virtual machines.

/net/fs-[labname] shares, new OOD/VDI sessions, and VMs are affected and may will be inaccessible until this is resolved.

Thanks for your patience.

Решено
сентября 25, 2023 в 0:53
Решено
сентября 25, 2023 в 0:53
The Ceph instability has been resolved. Caeph Tier2 shares, VDI, and VMs should be back to their normal state.

If your VM, /net/fs-[labname] share, or VDI session is still impacted, please contact rchelp@rc.fas.harvard.edu
Определено
сентября 24, 2023 в 20:49
Определено
сентября 24, 2023 в 20:49
The infrastructure behind Tier2 Ceph shares and VMs is unstable.
This also affects VDI/OOD which relies on virtual machines.

/net/fs-[labname] shares, new OOD/VDI sessions, and VMs are affected and may will be inaccessible until this is resolved.

Thanks for your patience.

Решено
сентября 23, 2023 в 13:12
Решено
сентября 23, 2023 в 13:12
The Ceph instability has been resolved. Caeph Tier2 shares, VDI, and VMs should be back to their normal state.

If your VM, /net/fs-[labname] share, or VDI session is still impacted, please contact rchelp@rc.fas.harvard.edu
Определено
сентября 23, 2023 в 5:31
Определено
сентября 23, 2023 в 5:31
The infrastructure behind Tier2 Ceph shares and VMs is unstable.
This also affects VDI/OOD which relies on virtual machines.

/net/fs-[labname] shares, new OOD/VDI sessions, and VMs are affected and may will be inaccessible until this is resolved.

Thanks for your patience.

Решено
сентября 22, 2023 в 14:31
Решено
сентября 22, 2023 в 14:31
The Ceph instability has been resolved. Caeph Tier2 shares, VDI, and VMs should be back to their normal state.

If your VM, /net/fs-[labname] share, or VDI session is still impacted, please contact rchelp@rc.fas.harvard.edu
Определено
сентября 22, 2023 в 12:28
Определено
сентября 22, 2023 в 12:28
The infrastructure behind Tier2 Ceph shares and VMs is unstable.
This also affects VDI/OOD which relies on virtual machines.

/net/fs-[labname] shares, new OOD/VDI sessions, and VMs are affected and may will be inaccessible until this is resolved.

Thanks for your patience.

авг. 2023

Решено
сентября 01, 2023 в 16:02
Решено
сентября 01, 2023 в 16:02
We have reverted the routing systems involved in causing this issue to their previous version/state. We will continue to investigate why this issue occurred and what will allow us to upgrade these systems at a later date.
Изучается
сентября 01, 2023 в 3:56
Изучается
сентября 01, 2023 в 3:56
The issue with holylabs (and potentially other lustre filesystems) has recurred. This may have effects on jobs and any process using this and potentially other lustre filesystems.

No ETA at this time.
Решено
сентября 01, 2023 в 0:59
Решено
сентября 01, 2023 в 0:59
We have restored access to holylabs and the cluster/jobs are no longer paused.

We have identified a root cause which we will be working to remediate to prevent this issue in future.
Определено
августа 31, 2023 в 23:30
Определено
августа 31, 2023 в 23:30
The scheduler and all jobs have been paused in order to reduce the load on holylabs.

We are continuing to work on a fix for this incident.
Изучается
августа 31, 2023 в 18:00
Изучается
августа 31, 2023 в 18:00
The holylabs filesystem is currently down due to high load.

OOD, software, and modules are all functional but if your workflow uses holylabs for storage, scripts, or jobs it may hang or fail. Our engineers are investigating this issue further.

Решено
августа 26, 2023 в 13:59
Решено
августа 26, 2023 в 13:59
The Ceph instability has been resolved. Caeph Tier2 shares, VDI, and VMs should be back to their normal state.

If your VM, /net/fs-[labname] share, or VDI session is still impacted, please contact rchelp@rc.fas.harvard.edu
Определено
августа 26, 2023 в 2:02
Определено
августа 26, 2023 в 2:02
The infrastructure behind Tier2 Ceph shares and VMs is unstable.
This also affects VDI/OOD which relies on virtual machines.

/net/fs-[labname] shares, new OOD/VDI sessions, and VMs are affected and may will be inaccessible until this is resolved.

Thanks for your patience.

Решено
августа 14, 2023 в 14:56
Решено
августа 14, 2023 в 14:56
This incident has been resolved.
Изучается
августа 11, 2023 в 15:48
Изучается
августа 11, 2023 в 15:48
holyscratch01 is in a degraded state currently. A group of improperly architected jobs are hammering the filesystem which is impeding access for other users. We are in the process of identifying and stopping these jobs.

Until then, we recommend starting an interactive session and working from there as those will have the lowest impact. However, performance will still be slow until we are able to stop the problematic jobs.

FASRC Monthly maintenance August 7, 2023 7am-1pm *NOTE EXTENDED TIME*

Завершено
августа 07, 2023 в 13:42
Завершено
августа 07, 2023 в 13:42
Due to a vendor error we were unable to complete holyscratch01 disk shelf replacement. We will work with the vendor to reschedule.

All other maintenance tasks have completed.
В ходе выполнения
августа 07, 2023 в 11:00
В ходе выполнения
августа 07, 2023 в 11:00
Maintenance is now in progress
Еще не началось
августа 07, 2023 в 11:00
Еще не началось
августа 07, 2023 в 11:00
August maintenance will run August 7, 2023 from 7am-1pm.

Please note the extended timeframe.
See tasks section below for explanation.

NOTICES
- CentOS 7 Support EOL: We will be dropping support for CentOS 7 support in September. If your machine or VM is CentOS 7 and connects with Slurm please contact FASRC to discuss options.
- Test Partition Changes: We are changing test partitions based on changing needs and increasing max time to 12hrs instead of 8 hrs. A reminder that this partition is not for running jobs.
MAINTENANCE TASKS
- holyscratch01 Disk Shelf Replacement All Jobs Will Be Paused
  -- Audience : All cluster and scratch users - Cannon and FASSE
  -- Impact: Hardware issues with holyscratch01 necessitate the replacement of one of the disk shelves. As a result all jobs and scratch will need to be paused for the duration.
  -- ETA: This swap is expected to take 3-4 hours, but pausing the cluster, vendor interactions, and allowing a margin for over-run requires that we extend maintenance by 2 hours (7am-1pm)
- Login node and OOD/VDI reboots
  -- Audience: Anyone logged into a a login node or VDI/OOD node
  -- Impact: Login and VDI/OOD nodes will rebooted during this maintenance window
- Scratch cleanup ( https://docs.rc.fas.harvard.edu/kb/policy-scratch/ )
  -- Audience: Cluster users
  -- Impact: Files older than 90 days will be removed.
Thanks,
FAS Research Computing
Department and Service Catalog: https://www.rc.fas.harvard.edu/
Documentation: https://docs.rc.fas.harvard.edu/
Status Page: https://status.rc.fas.harvard.edu/

Решено
августа 06, 2023 в 15:05
Решено
августа 06, 2023 в 15:05
The Ceph instability has been resolved. Caeph Tier2 shares, VDI, and VMs should be back to their normal state.

If your VM, /net/fs-[labname] share, or VDI session is still impacted, please contact rchelp@rc.fas.harvard.edu
Определено
августа 04, 2023 в 14:56
Определено
августа 04, 2023 в 14:56
The infrastructure behind Tier2 Ceph shares and VMs is unstable.
This also affects VDI/OOD which relies on virtual machines.

/net/fs-[labname] shares, new OOD/VDI sessions, and VMs are affected and may will be inaccessible until this is resolved.

Thanks for your patience.

июл. 2023

Решено
июля 25, 2023 в 16:08
Решено
июля 25, 2023 в 16:08
The affected partitions are back up within acceptable operating percentages.
Определено
июля 25, 2023 в 14:51
Определено
июля 25, 2023 в 14:51
Numerous nodes are in a bad state (marked 'drng', but not those marked 'drain'). We are aware and are working to re-open these nodes as quickly as possible.

This includes but is not limited to nodes in: argelles-delgado, kozinsky, seas_compute, kempner, and other partitions.

Решено
июля 20, 2023 в 19:48
Решено
июля 20, 2023 в 19:48
Modules should be available again across Cannon and FASSE. If you have an open session, please log out and log in again to use modules.

This incident has been resolved.
Изучается
июля 20, 2023 в 19:12
Изучается
июля 20, 2023 в 19:12
An unexpected bug has broken software module availability at this time. Module commands such as module load, module spider, etc may not work.

We are currently investigating this incident.

Решено
июля 18, 2023 в 15:35
Решено
июля 18, 2023 в 15:35
This incident has been resolved.
Обновление
июля 17, 2023 в 16:59
Обновление
июля 17, 2023 в 16:59
We have brought the server back online in a "workaround" state and will be continuing to investigate the root cause. Access should be normal.
Изучается
июля 17, 2023 в 15:45
Изучается
июля 17, 2023 в 15:45
Lab Storage on holylabs file server is currently inaccessible. We are currently investigating this incident.

Решено
июля 13, 2023 в 17:01
Решено
июля 13, 2023 в 17:01
This incident has been resolved.
Изучается
июля 13, 2023 в 16:48
Изучается
июля 13, 2023 в 16:48
boslfs02 is currently inaccessible. We are investigating this incident.

Решено
августа 17, 2023 в 17:51
Решено
августа 17, 2023 в 17:51
Firmware updates have resolved this issue.
Обновление
июля 18, 2023 в 19:05
Обновление
июля 18, 2023 в 19:05
The proscribed driver update did not fix this issue.

We are working with Nvidia to find a fix. As a stop-gap nodes which get stuck/flagged will be marked in Slurm for reboot and rebooted once empty of jobs.

If a A100 GPU host your lab owns is stuck in a bad state, please let us know and we will mark and reboot it.
Определено
июля 12, 2023 в 14:44
Определено
июля 12, 2023 в 14:44
A100s are open but still experiencing some instability. Infrequently, you may hit the issue we noted earlier.

We are continuing to work on a solution.
Изучается
июля 07, 2023 в 15:00
Изучается
июля 07, 2023 в 15:00
An NVIDIA bug may be causing failures in A100 GPUs. nvidia-smi command is slow, or outputs "ERR!" or "No GPUs are found".

Jobs on affected nodes may fail.

We are investigating this issue.

июл. 2023 до сент. 2023

FAS Research Computing - История уведомлений

История уведомлений

сент. 2023

авг. 2023

июл. 2023