Title: Disaster day

-------------------------------------------------
Title: Disaster day
Date: 2022-03-09
Device: Laptop
Mood: Exhausted
-------------------------------------------------

The last few weeks at work I've spent planning a
disaster recovery drill. Turns out I didn't need
to wait that long.

We had a huge outage at work yesterday. One of our
junior engineers was looking to send out some
nginx configurations to a service which we use
just to terminate SSL and do some simple redirects
(eg domain.com to www.domain.com). He'd written
the configuration change, pushed it through the PR
process, and then was ready to deploy.

Unfortunately, due to lack of documentation and
process, he ran the entire Ansible task list
against all the servers in the inventory, not just
the one which he intended.

Half-way through this, he realised his error, and
^C'd his way back to 'safety'. Unfortunately this
left the system in an inconsistent state, and some
of the nginx upgrades which we applied to other
parts of the estate got applied to our live
infrastructure (to be honest, I'm still a little
unclear on this part and why this happened). But
anyway, the life infrastructure was left with no
nginx running, and a set of configuration
directives which left it unable to start.

Unfortunately, I was AFK when this began, and
after my phone blew up with alerts, I got back to
my desk about 30 minutes after the incident began.
I took control and started to calm everyone down
and ask questions, but by this time some of the
useful output was lost, and the team on the
problem had already had their 'fuck it, lets
reboot' moment (sigh), and a I had to rebuild a
lot of useful context from questioning the team
about EXACTLY the steps they took. After about 90
minutes I think we had a handle on the problem,
but it was still unclear what to do next.

I ended up splitting the engineers on the call
into two teams; one to work on the live
infrastructure to repair the nginx configurations,
and the other to start Plan Z directives, which
involves rebuilding the configuration from
scratch. Thankfully, the first team quickly
identified some modifications we could do to the
configuration files to get nginx running again
(note that the problem was never our application
containers, they were happily running, just we
couldn't route traffic to them). Once we had that
figure out it was just a case of manually fixing
them, testing, and notifying customers.

The final outage in total was about 4 hours, which
is the worst I've dealt with in more than a
decade. I think we have a lot of learning to do
over the next few weeks while we recover. There's
obvious engineering work now to remove the
footguns in Ansible which led to this happening in
the first place -- those are easy to fix. But the
more complex problem is how to move our customers
over to a model where they can have hot failover
to a spare. That's new territory for us here
(though not for me, but I've never led that effort
before).

I've spent most of today trying to reassure the
engineer in question that this wasn't anything to
take personally. I think all engineers have SOME
story about the time they killed production
systems; it's part of the journey for most. But he
is young, and I don't want to let this incident
weight too heavily on him.

I'm glad this didn't happen next week when I'll be
on holiday! I doubt I could have contributed much
from the phone in the mountains.

--C