GRACC Degragation

Incident Report for OSG Consortium

Postmortem

Timeline (all times central):

9:35 AM - OSG’s Alerting noticed and alerted operators of GRACC’s failure to respond to requests
10:12 AM - The OSG Operator began debugging GRACC outage and and posted a downtime on the status page. At this time, GRACC was only returning partial results. The operator disabled access to GRACC’s frontend and replaced it with the GRACC Downtime page.
10:30 AM - GRACC database reached “yellow” state of recovery which returned full results. The GRACC Downtime page was removed.
11:40 AM - GRACC database reached “green” state, 100% recovered and functioning normally.

The underlying hosting provider, Anvil at the University of Nebraska, had a disk fail in the CEPH storage. The disk flipped from available to unavailable rapidly, which caused IO timeouts for GRACC. After CEPH marked the hard drive as bad, CEPH aggressively recovered the data, which can cause further IO timeouts. Elasticsearch is sensitive to IO timeouts and caused the cluster to reset, then scan all of the existing data to check for issues.

GRACC recovered to 100% without operator intervention, though the operator was monitoring progress and updating the status page. No data was lost.

Posted Jan 26, 2021 - 17:59 UTC

Resolved

GRACC has fully recovered and is in "green" state. Please report any issues you observe to support@opensciencegrid.org

Posted Jan 26, 2021 - 17:41 UTC

Monitoring

The GRACC backend database has reached "yellow" state. In "yellow", the database will return complete results, but the data is not fully replicated and therefore queries will have reduced performance. New data will be ingested and displayed.

We are continuing to monitor the recovery. The GRACC downtime page has been removed, but expect the frontend to be slower for approximately another hour.

Posted Jan 26, 2021 - 16:35 UTC

Identified

GRACC's backend data servers had a momentary network outage that caused the database to reset. It is currently verifying all data in stored in the database and recovering. We turned on the downtime webpage on GRACC to prevent returning partial results while the cluster is recovering.

No data loss is expected.

Posted Jan 26, 2021 - 16:29 UTC

Investigating

Alerts for the OSG Accounting Service (GRACC) are showing that the backend data is in a unhealthy state. We are investigating the issue.

Posted Jan 26, 2021 - 16:15 UTC

This incident affected: Accounting (GRACC Frontend, GRACC Backend).