2024-04-10 Post Mortem

At aprox. 16:36UTC, Floofy.tech suffered a total outage of our Mastodon instance and related services. This was noticed at 16:39:38UTC by our status page and notified the admins. Recovery took about an hour, and involved scaling down our Redis Sentinel controlled Redis replicas and rebooting our Kubernetes control and worker nodes. We fully recovered by 17:36:30UTC.

Status Updates Posted

Interrupted Service

What happened?

A plan to add another worker node to the Kubernetes cluster was made, and executed by cloning an existing worker node. This new node immediately booted and attempted to connect to our cluster with the same configuration as the original, resulting in confusion in the control plane about which worker node was the original. The new node was quickly shut down, but the control plane ended up in an inconsistant state and failed to recognise some pods were running, including our primary Redis instance. We use Redis Sentinel to watch and failover Redis instances as needed, but this requires them to communicate among themselves to achieve a quorom. At this point, we aren't totally certain what happened, however pods were unable to communicate with other pods scheduled on other nodes, meaning DNS queries were unable to be completed or, if they could be, the pods were unable to communicate with the internal IP that was resolved. This meant that our Mastodon pods were unable to communicate with the database (we use pgbouncer to balance queries) or Redis. At the same time, since Redis pods were unable to communicate with each other, they could not form a quorom and were unable to come back online, resulting in a deadlock.

How was it resolved?

We rebooted each node in turn, including our control plane nodes. Since we rely on Talos, this ensured that nodes were able to return to a known good state and were gradually able to recover on their own. We also removed two of our three Redis instances to ensure no split brain or reliance on other dead pods to bring an instance up. Once we had one instance running and communicating with Mastodon, we were able to bring up the other two, which promptly replicated from the master instance.

How will we avoid this?

We will avoid cloning existing nodes to create new ones. We're learned the impact of this the hard way, and it isn't something we want to repeat. We believe that the impact stems entirely from this, and since it's a manual and rare action there isn't much else we can take away from this. We also learned a bit about Redis Sentinel recovery, wherein DNS failure can be a major issue and could result in a split brain if each Redis Sentinel instance decides it is the master. In this case, we'll scale back down to one replica and scale back up when it's judged safe.