Something has just happened and you're looking at the smoldering wreckage of one of your production systems. Everyone's looking at you.
Step up, let everyone know that you've got this (and as much as you can provide about what this might be).
Stop the Bleeding
Is the system still down? Get someone to bring it back up. Make sure that this is someone's primary focus.
That's what we call it, anyways. Who's going to coordinate all the different paths of investigation?
If this is you, identifying those paths of investigation is part of this step. Make a list of owners and figure out the cadence for meetings amongst this team.
In meetings amongst the team you'll be getting a handle for progress and viability for each approach. Investigate and remove blockers, and help the team to decide when a fix has been rolled out at the crisis is over.
The work threads might look like this:
John will see if we can roll back our last release. It included a change that might have caused this, but also DB schema changes.
Alice will see about code changes to remove the problem.
Lisa will reach out to see if impacted customers can work around the issue.
Something I forget often, but think about who's going to get something out of being the quarterback. Maybe someone could use either the experience or the exposure?
Who needs to receive updates regarding the issue? The list might include:
- Your management.
- Impacted customers.
- Related teams.
Identify the communication cadence and the flow - particularly for external communications this may include who's going to draft them, who needs to review (if anyone) and who will be sending them. My management always wants to know about the work threads.
You've got everything that you need. Keep checking on the work thread progress, that should churn along. Maybe one will finish or you'll think of a new one. Great!
Hopefully one of your swimlanes results in a win. You're done, call it a day!
Retro / Root Cause Analysis
Avoid blaming anyone or any team!
The goal is to identify where your process let you down, and how to improve the process so this doesn't happen again.