The ReleaseTEAM Blog: Here's what you need to know...
Incident Management for DevOps Teams
Part 1: a Primer
We are all faced with a series of great opportunities – brilliantly disguised as insoluble problems.
John W. Gardner
No matter how well you’ve planned, service interruptions or unplanned reductions in service quality will happen. Under ITIL (IT Infrastructure Library), these service interruptions are called “incidents,” and the process of returning services to operation quickly or detecting impending incidents before the service goes down is “Incident Management.” ITIL labels the unknown cause of an incident as a “problem.”
DevOps teams move fast. High-performing DevOps teams may be releasing updates hourly. These fast releases help companies get to market quickly, but they can introduce more chances of an incident occurring. 29% of DevOps teams reported outages or downtime, and 48% say that bugs are the most common issue they encounter with releases, according to Atlassian’s DevOps Trends Survey 2020. However, because DevOps teams move so quickly, they are invaluable in reducing the time to restore service. In contrast, tracking an incident back to a specific cause is more difficult when teams have to untangle months of development under a waterfall development lifecycle.
Let’s take a high-level look at the steps involved in Incident Management:
- Prepare: The first step in incident management is planning for the unknown and developing a response plan.
- Detect: How and when will your teams become aware that a service interruption has occurred? Will implementing a tool help identify errors before they bring the system down or affect customers?
- Response: Once an incident is detected, can it be corrected automatically through failovers? Can users correct an error through self-service? Can the service desk resolve the error independently, or does it need to pull in developers, vendors, and other resources? Has a more extreme incident reached the threshold for becoming a declared disaster that invokes the DR plan?
- Communications: Ensure customers and stakeholders are informed throughout the process when service is interrupted.
- Resolution: After service is restored, the team still needs to conduct a postmortem to identify the cause and make corrections to prevent the incident from recurring.
In each of these Incident Management steps, a healthy DevOps organization cultivates a culture of zero blame. This zero-blame culture enables teams to cooperate and collaborate on quickly restoring service without fear of reprisal.
Evaluating and implementing tools for each of these steps can help detect incidents faster through automated monitoring, respond more quickly and collaborate more effectively on solutions, keep stakeholders informed throughout the process, and make adjustments to prevent recurrences.