Web Analytics
Select Page

The ReleaseTEAM Blog: Here's what you need to know...

Incident Management-Respond, Communicate, and Resolve

In our final installment of the “Incident Management for DevOps teams,” we will examine the Respond, Communicate, and Resolve steps:
an infographic of the Incident Management cycle highlighting the response, communicate, and resolve phases

Staff Response and Escalation

As part of the response, the on-call team will receive and collect information about the incident from automated monitoring tools (or a ticket if the incident is raised by a user). After receiving and assessing the information, incidents will be prioritized according to severity and urgency. In part two of this series, we introduced the basic severity rating matrix:
a chart showing the breakdown of classifying incidents
A key benefit of having DevOps teams involved in Incident Management is their familiarity with the application code. This knowledge can help organizations estimate effort, parse monitoring information, and identify resolutions more quickly. In many cases, DevOps teams may need to collaborate with other teams, including Product Management and Engineering.

Communication with Customers and Stakeholders

For each of the severity levels above, it’s essential to have a communication plan. A localized, low-urgency incident that does not affect the entire system or user base does not require the same type of communication plan as a system outage or widespread security incident affecting the company and customer base. Establishing communication templates ahead of time can streamline responses with preapproved wording that can help avoid a PR misstep. Status pages are an excellent customer-facing communication tool, particularly for commercial applications and websites with a large number of users. Customers can check a system status page before opening duplicate tickets, get an estimate of expected downtime, and follow the incident for updates. Another external communication tool is updating social media channels with the status, which can be used together with status pages. For internal communications with teams like Product, Security, Engineering, and Customer Support, organizations may use chat tools like Slack or Microsoft Teams. This may include setting up a specific channel to organize all communications about that incident for high severity incidents. These tools provide a record of what each team member is working on and provides context, especially if the response spans more than one shift. Throughout the response and resolution phases, provide regular updates to colleagues and customers. It’s better to hear that the team is still working on the issue than to say nothing.

Resolution, Post-Mortem, and Improvement

Restoring service quickly for a high severity incident may depend on one or more workarounds. At this point, examine what went wrong and determine how to incorporate a long-term fix for the incident. A post-mortem, or post-incident review, is also the time to look at improving the processes, documentation, and product to avoid similar incidents from reoccurring. Don’t forget to communicate with stakeholders and customers, letting them know that the company takes service interruptions seriously. All teams experience unexpected service outages at some time. By planning for incident management, transparency, and regular communications, and incorporating what you’ve learned from the incident back into making the product better, your customers will continue to trust you. More information: ReleaseTEAM can help organizations plan and implement Incident Management tools such as Atlassian Jira Software, Atlassian OpsGenie, and Atlassian StatusPage.

Join Our Mailing List

Let's Talk DevOps!

Call: (866)-887-0489
Email: info@releaseteam.com

Veteran-Owned Business Badge

Corporate HQ

1400 W 122nd Ave.
Suite 202
Denver, CO 80234
720-887-0489

Massachusetts

1257 Worcester Rd.
Suite 108
Framingham, MA 01701
866-887-0489

Canada

PMB# 604
1-110 Cumberland St.
Toronto, ON M5R 3V5
866-887-0489