The ReleaseTEAM Blog: Here's what you need to know...
Incident Management-Respond, Communicate, and Resolve
In our final installment of the “Incident Management for DevOps teams,” we will examine the Respond, Communicate, and Resolve steps:

Staff Response and Escalation
As part of the response, the on-call team will receive and collect information about the incident from automated monitoring tools (or a ticket if the incident is raised by a user). After receiving and assessing the information, incidents will be prioritized according to severity and urgency. In part two of this series, we introduced the basic severity rating matrix:
A key benefit of having DevOps teams involved in Incident Management is their familiarity with the application code. This knowledge can help organizations estimate effort, parse monitoring information, and identify resolutions more quickly. In many cases, DevOps teams may need to collaborate with other teams, including Product Management and Engineering.
Communication with Customers and Stakeholders
For each of the severity levels above, it’s essential to have a communication plan. A localized, low-urgency incident that does not affect the entire system or user base does not require the same type of communication plan as a system outage or widespread security incident affecting the company and customer base. Establishing communication templates ahead of time can streamline responses with preapproved wording that can help avoid a PR misstep. Status pages are an excellent customer-facing communication tool, particularly for commercial applications and websites with a large number of users. Customers can check a system status page before opening duplicate tickets, get an estimate of expected downtime, and follow the incident for updates. Another external communication tool is updating social media channels with the status, which can be used together with status pages. For internal communications with teams like Product, Security, Engineering, and Customer Support, organizations may use chat tools like Slack or Microsoft Teams. This may include setting up a specific channel to organize all communications about that incident for high severity incidents. These tools provide a record of what each team member is working on and provides context, especially if the response spans more than one shift. Throughout the response and resolution phases, provide regular updates to colleagues and customers. It’s better to hear that the team is still working on the issue than to say nothing.Resolution, Post-Mortem, and Improvement
Restoring service quickly for a high severity incident may depend on one or more workarounds. At this point, examine what went wrong and determine how to incorporate a long-term fix for the incident. A post-mortem, or post-incident review, is also the time to look at improving the processes, documentation, and product to avoid similar incidents from reoccurring. Don’t forget to communicate with stakeholders and customers, letting them know that the company takes service interruptions seriously. All teams experience unexpected service outages at some time. By planning for incident management, transparency, and regular communications, and incorporating what you’ve learned from the incident back into making the product better, your customers will continue to trust you. More information: ReleaseTEAM can help organizations plan and implement Incident Management tools such as Atlassian Jira Software, Atlassian OpsGenie, and Atlassian StatusPage.Let's Talk DevOps!
Call: (866)-887-0489
Email: info@releaseteam.com
Corporate HQ
1499 W. 120th Ave
Suite 110
Westminster, CO 80234
720-887-0489
Massachusetts
1257 Worcester Rd.
Suite 108
Framingham, MA 01701
866-887-0489
Canada
PMB# 604
1-110 Cumberland St.
Toronto, ON M5R 3V5
866-887-0489