How To Establish a High Severity Incident Management Program

How To Establish a High Severity Incident Management Program

This is the first post in a three-part series on High Severity Incident (SEV) Management Programs. Check out part 2, Understanding The Role Of The Incident Manager On-Call (IMOC), and part 3, Understanding The Role Of The Technical Lead On-Call (TLOC).

Introduction

High severity incident management is the practice of recording, triaging, tracking, and assigning business value to problems that impact critical systems. The purpose of establishing a program is to enhance the customer experience by improving your infrastructure reliability and upskilling your team. In this guide, we will share how to establish and measure the success of your own high severity incident management program.

What is High Severity Incident Management?

The management of high severity incidents encompasses high severity incident (SEV) detection, diagnosis, mitigation, prevention, and closure. SEV prevention includes SEV review and SEV correlation.

What are SEVs?

SEV is a term used to refer to an incident, it is derived from the word severity.

What are common types of SEVs?

  • Availability Drop
  • Product Issue / Feature Broken
  • Data Loss
  • Security Risk

What are examples of SEVs?

SEVs are common across all industries. They impact our ability to catch flights, access our bank accounts, buy the latest game we are excited about and much more.

The following are examples of SEVs that have occurred:

More examples have been collected and shared by Outage Report.

What are SEV levels?

SEV levels are designed to be simple for everyone to quickly understand the amount of urgency required in a situation. Launching your high severity program with SEV levels is important because it will make it easier to inform everyone across your engineering team how you classify incidents.

SEV levels empower and educate your entire team to feel confident lodging SEVs and effectively prioritising them.

An example of SEV levels are described in the table below:

SEV LevelDescriptionTarget resolution timeWho is notified
SEV 0Catastrophic Service ImpactResolve within 15 minEntire company
SEV 1Critical Service ImpactResolve within 8 hoursTeams working on SEV & CTO
SEV 2High Service ImpactResolve within 24 hoursTeams working on SEV

How do your resolution times impact SLOs/SLAs?

There are three service level terms often used to measure the level of service that will be provided to customers. These terms are; service level indicators (SLIs), service level objectives (SLOs) and service level agreements (SLAs). Companies will often set an SLO which is higher than their SLA, for example the SLA provided to customers would be 99.99% but the internal unpublished SLO would be 99.999%.

An SLA level of 99.99 % uptime/availability gives the following periods of potential downtime/unavailability:

  • Daily: 8.6s
  • Weekly: 1m 0.5s
  • Monthly: 4m 23.0s
  • Yearly: 52m 35.7s

You can calculate the error budget you have available based on your SLO and SLA at uptime.is.

What is the full lifecycle of a SEV?

The full lifecycle of a SEV involves detection, diagnosis, mitigation, prevention and closure. There is a glossary of SEV terms just below.

null

When we are reporting on SEVs that have occurred in the past to look for trends and patterns, it is useful to understand the overall mean measurements for SEVs. For example, the mean time to detection (MTTD) for all SEVs. This measurement is useful to dive deeper and understand which types of systems or services take much longer to detect incidents for. Engineers will often take the time to tag incidents because it enables the creation of visualizations by cause and magnitude.

How are SEVs measured?

High Severity Incidents (SEVs) are measured by the availability error rate and total time of impact. We use the formula below to identify the number of customer requests which were impacted:

% loss * outage duration

This also lends itself to prioritization. For example, you can assign a value to each customer request and the result will be visibility of the most important requests. You may choose to assign a higher value to requests related to new product launches, e.g. the launch of a new TV series that is being streamed.

Revenue impact for Freemium products can occur due to outages impacting customer upgrade flows. For example, a Freemium photo storage service’s revenue could be impacted by a mail campaign outage. This could then result in campaign mail not sending when a user hits their storage quota and is prompted to upgrade.

Glossary of common industry SEV terms

The following acronyms are used to measure and track SEVs. This data is important to collect in your SEV tracking tool. It is best to fill in all data when the SEV has been resolved. When you store this data it can then be used to automatically generate monthly SEV metrics reports.

SEV TermDescription
TTITotal time of impact
TTDTime to detection
TTETime to engagement
TTRTime to resolution
TTPTime to prevention
TBFTime between failures

How do you create SEV levels for free and paid products?

When you are establishing your high severity incident management program it is important to consider the pricing model of your product. If your product pricing uses a freemium model you likely have different SEV levels than a product with subscription pricing and no free tier.

If your product has tiers such as business and enterprise with different costs you could classify requests for these products with different severity ratings. This is very useful if you have VIP customers who require higher levels of availability than the majority of your customers.

Example: SEV levels for data loss

Any SEV which involves loss of customer data should be classified as a SEV 0.

SEV LevelData Loss Impact
SEV 0Loss of customer data
SEV 1Loss of primary backup
SEV 2Loss of secondary backup

Example: SEV levels for a free or freemium product

The following diagram illustrates the way to measure SEV levels for a free or freemium product. For example, a 90% loss of customer requests * 10min is classified as a SEV 1 incident. This is considered a critical service impact but we do not classify it as catastrophic because of the length of the outage and insignificant impact on revenue.

null

The diagram below demonstrates a SEV 1 impacting WWW and API with no impact to client. This is classified as a critical service impact. We measure this SEV as 0.2% * 30 min (6) for WWW and 0.11% * 30 min (3.3) for API. The 99.99 SLA has been breached. In the example below the SLA is 99.99 for availability.

null

Example: SEV levels for a product with subscription or complex pricing

The following diagram illustrates the way to measure SEV levels for a product with subscription pricing. For example, a 90% loss of customer requests * 10min is classified as a SEV 0 incident. This is considered a catastrophic service impact due to the impact to revenue.

null

How do you ensure your team operates effectively during a SEV 0?

There are two roles which are important to establish when you create your high severity incident management program. These are critical to ensure your team can effectively diagnose, mitigate and prevent SEV 0s. These roles are the Incident Manager On-Call (IMOC) and the Tech Lead On-Call (TLOC).

Incident Manager On-Call (IMOC) / Call Leader

  • Small rotation of Engineering Leaders ( <10 )
  • One person is on-call in this role at any point in time.
  • Wide knowledge of services and engineering teams

Tech Lead On-Call (TLOC)

  • The engineer responsible for resolving the SEV
  • Deep knowledge of own service area
  • Deep knowledge of upstream and downstream dependencies

How do you set up IMOCs for success during SEV 0s?

IMOCs play a critical role in managing SEV 0s. It is important to train IMOCs before they do their first rotation because they are the sole person responsible for coordination and communication of the SEV 0 with the entire company.

IMOC training is best conducted in a one hour face-to-face training session with time for questions. It involves gaining an understanding of the following:

  • SEV levels
  • The full lifecycle of SEVs
  • Examples of previous SEV 0s that have occurred within your company
  • Access to an IMOC runbook with communications templates

To ensure IMOCs can be as effective as possible, the following practices should be in place:

  • A 24/7 rotation for IMOCs with one Primary IMOC and one Secondary IMOC on-call for a week
  • Automatically page the Primary IMOC for SEV 0s.
  • Set up automatic escalation to the Secondary IMOC if the Primary IMOC does not ack the page in 1min
  • A monthly sync for all IMOCs

When your IMOC is paged, the first thing they will need to do is look at the recorded SEV details and then open the Critical Services Dashboard. Next we will discuss how to categorize SEVs to enable the IMOC to quickly grok the impact of SEVs.

How do you categorise SEVs?

When establishing your high severity incident management program it is useful to be able to view requests split by different categories such as services or new products. For example, if your team is responsible for a website, API and desktop client, being able to determine the impact based on the requests for WWW, API and Client will be very useful. This will enable you to more quickly work through the full SEV lifecycle.

The dashboard below is an example of how your dashboard would appear during a SEV impacting WWW and API with no impact to Client. There is also a red line to assist in determining if the SLA has been breached. In the example below the SLA is 99.99 for availability.

null

How do you empower everyone in your company to record SEVs?

There is a significant benefit to enabling and empowering everyone in your company to have the understanding and confidence to record SEVs.

At the Toyota Motor Manufacturing plant in Kentucky they “pull the Andon cord” 5000 times per day. The Andon cord is a physical rope that follows the assembly line; it can be pulled at any time to stop the manufacturing line. When someone pulls the cord and the problem is serious enough, everyone stops what they’re doing until the problem is fixed.

To empower your team, set up tooling and automation to make it simple for everyone to record, discuss, recall, and correlate SEVs. Many teams are now distributed around the world with multiple offices. Giving your team an easy to use web-based tool to record SEVs will increase the likelihood your high severity incident management program will be successful.

We have created a simple Google Form for you to use with your engineering teams. This form will enable all engineers to lodge SEVs. You can create a copy of this RecordSEV tool Google Form. The RecordSEV tool can be seen below:

Critical Services Dashboard

How should you name SEVs?

To enable effective recall of SEVs and facilitate discussions it is useful to give SEVs descriptive names. A name generator that enables you to match a descriptive word and animal can be used to choose a name for the SEV. For example, “SEV 0 Cheeky Kangaroo”.

Descriptive SEV names will improve communication for your SEV program. Alternatively, SEVs can be named using numerical codes or timestamps.

What causes SEVs?

From industry analysis of SEVs we can classify the causes of SEVs as either technical issues or cultural issues. Some SEVs are caused by both technical and cultural issues. See the table below for examples:

Type of IssueExample
Technical IssueIncorrect configuration change
Cultural IssueLack of on-call training

What are some of the expected failures are you likely to experience?

Technical Issues

  • Dependency Failure
  • Region/Zone Failure
  • Provider Failure
  • Overheating
  • PDU failure
  • Network upgrades
  • Rack failures
  • Core Switch failures
  • Connectivity issues
  • Flaky DNS
  • Misconfigured machines
  • Bugs
  • Corrupt or unavailable backups

Cultural Issues

  • Lack of knowledge sharing
  • Lack of knowledge handover
  • Lack of on-call training
  • Lack of Chaos Engineering
  • Lack of a high severity incident management program
  • Lack of documentation and playbooks
  • Lack of alerts and pages
  • Lack of effective alerting thresholds
  • Lack of backup strategy

How do you prevent SEVs from repeating?

It is possible to prevent SEVs from occurring again but to do this you must be able to record, correlate, and track SEVs.

We recommend making sure you have the following in place:

  • Descriptive SEV names
  • A form for all employees to record SEVs, e.g. RecordSEV tool
  • A postmortem tradition
  • Tracking if SEVs repeat, e.g. it’s another “Cheeky Kangaroo”
  • Capturing, tracking and resolving SEV Action Items
  • Chaos Engineering
  • Observability tools

How can you use Chaos Engineering to empower your teams to prevent SEVs?

Chaos Engineering is the facilitation of controlled experiments to uncover weaknesses of your infrastructure and team. Having your entire engineering team routinely performing chaos experiments on your systems will help you identify problems before they become incidents, and along the way increase your confidence that your systems and teams can handle failure. When used alongside the prevention methods described above, chaos empowers your teams to proactively uncover the technical and cultural issues that impact your company.

Here is a list of resources you can use to further explore how Chaos Engineering can be reduced to reduce the severity and number of incidents at your company:

Conclusion

A high severity incident management program is an important subset of reliability engineering, focused on ensuring that a team is prepared to manage incidents. This might seem complex; however, it can greatly improve your customer experience and empower you to meet SLAs. This will also empower you to be better prepared for compliance and auditing events as they arise. By following this guide you will be able to establish your own high severity incident management program and measure its success.

Join the Chaos Engineering Slack to connect with 5,000+ SREs and engineers about establishing SLOs and reducing incidents.

Join the Chaos Engineering Slack

Connect with 5,000+ engineers who are building more reliable systems with Chaos Engineering.
Email
Last Updated:
Categories: SRE

Related

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin.

Get started