A key benefit of having DevOps teams involved in Incident Management is their familiarity with the application code. This knowledge can help organizations estimate effort, parse monitoring information, and identify resolutions more quickly. In many cases, DevOps teams may need to collaborate with other teams, including Product Management and Engineering.
Communication with Customers and Stakeholders
For each of the severity levels above, it’s essential to have a communication plan. A localized, low-urgency incident that does not affect the entire system or user base does not require the same type of communication plan as a system outage or widespread security incident affecting the company and customer base. Establishing communication templates ahead of time can streamline responses with preapproved wording that can help avoid a PR misstep. Status pages are an excellent customer-facing communication tool, particularly for commercial applications and websites with a large number of users. Customers can check a system status page before opening duplicate tickets, get an estimate of expected downtime, and follow the incident for updates. Another external communication tool is updating social media channels with the status, which can be used together with status pages. For internal communications with teams like Product, Security, Engineering, and Customer Support, organizations may use chat tools like Slack or Microsoft Teams. This may include setting up a specific channel to organize all communications about that incident for high severity incidents. These tools provide a record of what each team member is working on and provides context, especially if the response spans more than one shift. Throughout the response and resolution phases, provide regular updates to colleagues and customers. It’s better to hear that the team is still working on the issue than to say nothing.
Resolution, Post-Mortem, and Improvement
Restoring service quickly for a high severity incident may depend on one or more workarounds. At this point, examine what went wrong and determine how to incorporate a long-term fix for the incident. A post-mortem, or post-incident review, is also the time to look at improving the processes, documentation, and product to avoid similar incidents from reoccurring. Don’t forget to communicate with stakeholders and customers, letting them know that the company takes service interruptions seriously. All teams experience unexpected service outages at some time. By planning for incident management, transparency, and regular communications, and incorporating what you’ve learned from the incident back into making the product better, your customers will continue to trust you. More information:
ReleaseTEAM can help organizations plan and implement Incident Management tools such as Atlassian Jira Software
, Atlassian OpsGenie
, and Atlassian StatusPage