The ReleaseTEAM Blog: Here's what you need to know...
Incident Management-Prepare and Plan
Part 2: Preparation Stage
“Before anything else, preparation is the key to success.”
Alexander Graham Bell
Last month we introduced Incident Management for DevOps teams. Unplanned service interruptions, or Incidents, will occur no matter how well you’ve planned during your software development lifecycle. However, with proper Incident Management, you can quickly restore service, and identify the cause of issues and what action to take to prevent them in the future.
This month, let’s take a closer look at the preparation stage:
Who will manage Incidents?
The preparation stage is about planning how your teams will identify Incidents and restore service. What steps and tools will be included in the Incident Management process? Furthermore, how will information about Incidents be tracked or shared with other teams to create workarounds, document known errors, or develop fixes?
Before selecting a tool for managing Incidents, it’s important to document which team will receive and manage the Incident. In most companies, this is the Service Desk. For DevOps teams, user reported Incidents will likely go to the Service Desk first, but automated alerts may be assigned to the team directly.
How will Incidents be detected?
Who (and what) can submit Incident reports? Incidents may originate from end users or may be triggered automatically by monitoring systems. What contact channels will the team support for reports from users: phone, email, web portal, or in person? If incidents will come from monitoring tools, are the tools integrated?
How will Incidents be classified?
After receiving a possible Incident report, the team must validate that it is an Incident and not a Service Request. They then need to be able to classify the Incident type and log the Incident. What classification categories will be supported?
Incidents are usually prioritized by their severity level. Your severity matrix will be unique to your environment, but they generally follow this format:
If the initial team cannot resolve the incident, what is the escalation process? Document which types of incidents are escalated to which teams and how those teams will be notified. Will a handoff occur, or will Service Desk Level 1 remain the “face” of the ticket through resolution?
There are several commercially available tools available to help companies implement Incident Management. Tool Suites such as BMC Remedy, ServiceNow, or Atlassian Jira Service Management focus on every aspect of IT Service Management (ITSM), including Incident Management.
However, there are quite a few ancillary tools that your team may need to detect, track, and resolve Incidents:
- Automated Monitoring. In a perfect world, Incidents are detected and corrected before they affect the end-users.
- Team Communications. How will teams communicate with each other to quickly restore service? What kind of alert systems are in place to notify offline team members?
- Customer Communications. How will the team communicate with the user or stakeholders? This can include phone systems, instant messaging, chat support, and more.
Incident Management is often the domain of the Service Desk, but DevOps teams need to understand how and when Incidents will be assigned to them — either automatically by a monitoring system or as part of an escalation hierarchy. The preparation stage is the time to plan out roles, responsibilities, and tools that will facilitate restoring service quickly.