2024-07-29 Post Mortem

At aprox 13:52UTC, the Floofy.tech Mastodon instance went offline, returning a 503 error when requested. Recovery took around 25 minutes, with the instance coming back online at aprox 14:15UTC, and involved troubleshooting our Redis (now Redict) pods and building a custom Redict image for compatability with the Bitnami Redis Helm chart.

Status Updates Posted

None

What happened?

A decision was made to update our Redis instances to ensure we were running the latest available version. This went smoothly, and motivated by this a move to Redict was also attempted to see if compatability issues with the Redis charts provided by Bitnami were fully resolved. When the pods restarted with the new container, they were marked as healthy by Kubernetes with a passing health check, however container logs indicated scripts for configuring the instance fully were unable to be run, resulting in a technically running but only partially configured Redict instance. An attempt to roll back was made, however the Redict image selected contained a higher "Redis Database" (RDB) snapshot file schema, which Redis/Redict uses to restore a database at a point in time, avoiding long "Append-Only File" (AOF) replays or replica syncing. This meant that a rollback was out of the question, as the older Redis codebase could not read the newer format correctly. Since Mastodon relies heavily on Redis/Redict, this resulted in a total outage for the instance and runners.

How was it resolved?

A quick investigation of the Bitnami provided Redis images revealed some custom scripts that needed to be included in the Redict images to be compatible with the Helm chart. These were fetched and built into a custom Redict image, which was pushed up to a temporary registry to bring the service back up.

How will we avoid this?

There were a few failures identified during this outage. The primary being that the healthchecks defined in the Kubernetes manifests for our Redict pods are clearly insufficient for deciding whether the service is actually ready. We will be looking at the healthchecks defined and making adjustments as we can. We will also be looking at the startup scripts that attempt to be run and ensure that if they do fail, the error is properly surfaced and the pod is marked as failed so that a bad image or update can't roll out to the other pods.