UW-Madison Machine Room Cooling Loss
Incident Report for OSG Consortium
Resolved
This incident has been resolved.
Posted Nov 20, 2021 - 03:45 UTC
Monitoring
Cooling returned to the data center at approximately 6:00PM Central. Over the last two hours, we've been inspecting the cluster worker nodes and restarting the infrastructure. As of 8:30PM Central, services are beginning to be restored. We expect the last services to be restored over the next 15 minutes.

The facilities team has notified us there will be another short follow-up cooling outage at 7:00am Central on Monday of up to 2 hours in order to finalize the maintenance performed today.
Posted Nov 20, 2021 - 03:01 UTC
Identified
We have restored the yum repo mirror list, which will allow yum installations and updates to succeed.
Posted Nov 19, 2021 - 18:49 UTC
Investigating
During maintenance on the cooling systems in the UW-Madison machine room, the temporary cooling system has failed to provide the expected capacity. This has resulted in several hosts automatically shutting down due to temperature alarms and a number of unplanned service outages.

We are investigating whether anything can be brought back up safely; if not, the maintenance is expected to finish at 5PM central today.
Posted Nov 19, 2021 - 14:23 UTC
This incident affected: Software Repositories (Yum Repos), Websites (Display, Topology), Hosted GlideinWMS (IGWN GWMS Frontend, JLAB GWMS Frontend, GLUEX GWMS Frontend, UCSD CMS GWMS Frontend, UCSD CMS VO Collector), and Hosted CEs (Hosted CE Infrastructure).