Incident management DnD
Any software system will eventually experience a degradation, outage or a similar incident. Depending on the type of business you're in this can lead to a loss of revenue and clients or even have legal repercussions. That is why it's important to have a good incident management process for resolving incidents, reporting on their impact and preventing them from happening again.
In this blog post I'll describe my personal experience with some challenges of establishing the incident management process and why I built a tabletop role-playing game Deployments and Disasters to deal with them.
My perspective
I work at Infobip. We are a tech company with over 300 developers and additional 200 customer support engineers. Our developers are divided into teams of around 5 to 10 people. Teams own their services and are responsible for both building and maintaining them in production. Support engineers monitor key business metrics and maintain contact with clients. In total that's over 500 people and each one of them can be called upon to participate in resolving an ongoing incident.
One benefit of this approach is that people most familiar with a given code-base will directly work on resolving the incident. Additionally, trained support personnel are in contact with clients thus allowing developers to focus on fixing the issue.
On the other hand, there are drawbacks to this approach. Not everyone will work equally well under pressure, different people have different levels of experience, each team might develop their own procedures, use different tools, etc. Most of these can be addressed with a fine-tuned incident management procedure and a set of common tools to back it up. Some good ideas include chatops, centralized logging and metrics, premade dashboards and alerting. I will not go into details on these things here.
Challenges to tackle
What I'd like to focus on instead are the following three challenges that emerge after management procedure and all of the tools are in place:
- All involved with incident management should familiarize themselves with the procedure and available tools. The more people the bigger this issue is. At one point even awareness of the process can become a problem.
- Incident management process involves different roles: customer support, programmers, sysadmins, database administrators, devops engineers, etc. They all need to work together, despite having different objectives at any given time during the incident.
- Many roles involved in incident management are technical. They view resolving the incident as their objective and are focused on detecting and removing the immediate cause of the issue. As a result they may not think of affected customers and thus miss out on opportunities to notify them of the impact, or even alleviate parts of it sooner.
Awareness of the procedure
Educating people about procedures and tools can generally be achieved with incident management training. Roughly speaking, there are 2 approaches to it:
- Simulating the incident realistically, with participants using actual tools, interacting with high fidelity data and directly applying their real life skills.
- Keeping the training abstract and basing it on gaming techniques. I've found success with adopting the mechanics of tabletop role-playing games.
Picking a game based approach has several advantages. For one, it reduces the prohibitive cost of recreating the data required to realistically simulate the incident. It allows for addressing the other two challenges, namely the empathy towards other roles and customer centric mentality.
However, the killer feature of game based training is that it's fun. Especially when compared to reading procedure documentation and how-to guides, or attending seminars. The benefits of this are twofold. First it makes the exercise more memorable for the attendants. Secondly, it helps with organizing future sessions, as people are more interested in attending.
There's one additional benefit of role-playing based approach. It turns entire exercise into a structured storytelling experience. This structure provides a safe environment for all attendants to share their insights and knowledge with each other. The benefits are most noticeable with introverted players.
Empathy for other roles
At any one time during the incident, each different role might have a different objective. For example, support engineer needs to inform the clients of exact impact of the incident. On the other hand programmers need to identify the cause of the issue. In this situation support needs information on client facing API from the development team that is focused on debugging the backend. This can create tension between those two roles.
One thing that games excel at is placing players into other people's shoes. In Deployments and Disasters I facilitate this by defining specific roles with unique mechanical characteristics. When starting the game session I make sure that players shuffle the roles so that they don't play the same one they have in real world. For example, I encourage developers to play the role of customer support.
This has two benefits:
- Players get to experience what incident looks like from the perspective of other roles. This builds empathy by making players go through the tough choices and strive for hard to reach objectives that their colleagues usually experience.
- It also encourages players to share their knowledge and practices. It reverses the real world dependencies between the roles. For example, if developers usually depend on database administrators for optimizing their databases then inverting the roles will make admins more sympathetic towards the other role's needs.
Customer centric mindset
I'd like my developers to approach incident management with more of a customer centric mindset. Other teams, companies or situations may require some other adjustments. Fortunately, game mechanics are well suited for this.
In games, players regularly receive and accomplish arbitrary objectives. By carefully picking stated objectives and mechanical incentives game designers can impact player mindset.
In Deployments and Disasters I achieve this with a few rules:
- The main objective of the game session is to resolve the incident within a set number of turns, represented by an incident clock. At the beginning of the game players have 6 turns to resolve the issue. However, If they devote time to communicate the issues to the clients their time doubles to a total of 12 turns.
- Clients are active actors in the game (controlled by the DM) and they can impact the state of the system. For example, they can escalate the problem by attempting to fix it themselves. Alternatively they can be used to reveal valuable information and hints.
- During the course of the game some important (gold / platinum) clients can contact the players and ask for status updates or request special attention. This can be used to illustrate different types of clients.
- Incident scenario starts with only some clients impacted. Players can still escalate the situation and spread the impact to other clients. Or, they can proceed with caution and reduce the impact as they go along.
Work so far
So far I've set up basic set of rules and game mechanics for Deployments and Disasters which you can find on GitHub. The game presented there is early sample of a work in progress. One significant ommition is the lack of incident scenarios. I've created one of them for test runs I've played at work, however it is tightly coupled with our internal procedures and custom tools we use. My plan is to create an example scenario with open-source tooling that anyone can use as a base for their exercise.
I've held two test training sessions at work and feedback was generally good. Players found the game entertaining, but also reported learning about new tools and procedures. I'm yet to create additional scenarios, but there's interest in replaying the existing one with teams that haven't seen it yet. I'm also exploring ways of connecting the exercise with employee evaluation and professional development programs that we have.
Feel free to use the Deployments and Disasters to build your own incident scenarios on top of. Or stay tuned for future developments, as I will strive to publish example scenarios myself. You can watch the GitHub repo for updates, or follow this blog by:
- Subscribing to its RSS feed.
- Following
@antolius@qua.name
in fediverse.
If you have any feedback, comments or improvement ideas you can send me a pull request, or just contact me at: