Timeline (all times central):
The underlying hosting provider, Anvil at the University of Nebraska, had a disk fail in the CEPH storage. The disk flipped from available to unavailable rapidly, which caused IO timeouts for GRACC. After CEPH marked the hard drive as bad, CEPH aggressively recovered the data, which can cause further IO timeouts. Elasticsearch is sensitive to IO timeouts and caused the cluster to reset, then scan all of the existing data to check for issues.
GRACC recovered to 100% without operator intervention, though the operator was monitoring progress and updating the status page. No data was lost.