21st May 2018

MGHPCC data center annual power downtime

We are currently in the midst of the annual power maintenance at our main datacenter so most of our systems are powered down.

Please refer to this link for more information https://www.rc.fas.harvard.edu/mghpcc-shutdown-2018

The cluster should be back online tomorrow (Wednesday 23rd) by 8pm.

UPDATE 5/23 11:30AM: Everything is proceeding as planned. Please be aware that some services may appear to be up at times, but please do not resume normal use until we send the all-clear email and clear status here. Many services have dependencies that must be met before they are back to normal operation.

UPDATE 5/23 5:30PM: Final preparations for re-opening the scheduler are underway. All major operations and updates have been successful and final cleanup, startup, and triage is underway. A final reboot of all compute nodes and login nodes will occur first before the scheduler start coming up. We foresee no delay in returning to normal operation by or before 8PM. A final update will go out here and via email once the all-clear is given by all involved staff.

UPDATE 5/23 8:00PM: The scheduler is re-opened and any lingering nodes will re-join their respective queues if they are still catching up. An email will go out in a moment with the following:


Odyssey3, the newly upgraded CentOS 7 cluster, is now live and upgraded. This includes all login nodes and compute nodes (except NCF and ATLAS). 

One issue that will affect almost all cluster users is the change in SSH key fingerprints. You will need to clear your local SSH known_hosts before you connect to Odyssey3 over SSH. Please see: https://www.rc.fas.harvard.edu/resources/faq/ssh-key-error/

As we’ve made numerous upgrades, improvements, and additions, please take a moment to read the following. Also, please be aware that inevitably some compute nodes will have small issues such as missing mounts, so please let us know right away if you find such an issue.

WHAT YOU NEED TO KNOW ABOUT ODYSSEY3:
* SSH key error fix. Many of you will experience this, and the fix is easy: https://www.rc.fas.harvard.edu/resources/faq/ssh-key-error/
* CentOS 7 transition FAQ https://www.rc.fas.harvard.edu/resources/faq/centos-7-transition-faq/
* CentOS 7 overview  https://www.rc.fas.harvard.edu/odyssey-3-the-next-generation/
* One exception to the CentOS 7 upgrade is NX/NoMachine which is nearing end of life. Since it cannot be upgraded,
 there are some things to know: https://www.rc.fas.harvard.edu/resources/access-and-login/#NX_and_CentOS7
* Additional office hours tomorrow at 38 Oxford and next week at 38 Oxford and HSPH https://www.rc.fas.harvard.edu/training/office-hours/
* Overnight tickets tonight will be addressed tomorrow (Thursday) once staff has had time to rest up
* New service: We now fully support Singularity containers! https://www.rc.fas.harvard.edu/resources/documentation/software/singularity-on-odyssey/
* Self-service TensorFlow via pip install: https://www.rc.fas.harvard.edu/tensorflow-on-odyssey/
* All version of CUDA are installed globally: https://www.rc.fas.harvard.edu/resources/documentation/gpgpu-computing-on-odyssey/

WHAT WE DID DURING THIS PERIOD
* Powered off 12 PB of storage and thousands of compute nodes at MGHPCC and brought them all back up successfully
* Upgraded 2,000 compute nodes to more modern CentOS 7 for stability and security (except NCF and ATLAS)
* Upgraded login nodes to more modern CentOS 7 for stability, security, and environmental homogeneity 
* Upgraded firmware on 2,000 Infiniband network adapters and the core InfiniBand Director Switch to ensure homogeneity and speed
* Upgraded firmware of network switches to prepare for the future and consolidate on current revisions
* Upgraded MGHPCC network core to 100Gbit and updated firewall for faster transfers, lower latency, and future expansion
* Prepped for new 100Gbit firewall installation in MGHPCC to reduce bottlenecks and increase overall performance
* Made major network routing changes so that 60 Ox decommission will not require additional downtime later
* Created streamlined, detailed plan for downtime with help from HUIT PMO to optimize our time resulting in more work done in less time

UPCOMING UPDATES IN FUTURE MAINTENANCE WINDOWS:
* MGHPCC 100Gbit firewall in MGHPCC and cluster of Data Transfer Nodes for faster networking and data transfers
* New web-based login through Open OnDemand.  See OSC intro video for an example: https://youtu.be/DfK7CppI-IU
* Next Regal scratch 90-day retention cleanup will happen June 3rd, 2017
* Next regular monthly maintenance day, including Regal retention run, is Monday, July 2nd from 7am - 11am

Thanks,
FAS Research Computing
https://www.rc.fas.harvard.edu
https://status.rc.fas.harvard.edu```