Handling Incidents

Operations workflow for handling incidents

Whenever an incident occurs, you should follow this process.

Create an issue

Always create an issue using the Incident template. The information you put in it initially does not need to be “complete”. It will often be a simplified title and the error/problem itself that is occuring.

Determine criticality level

After issue creation, you need to determine the criticality level of the incident. This is done by locating the impacted items on the Customer Support Operations System Criticality Google sheet (internal access only).

In the event an incident is impacting multiple systems, use the highest criticality value.

Work to resolve the problem

With that out of the way, you will work to resolve the problem. Make sure to make ample commnets on the issue as you do.

Escalate

In some situations, you will not be able to resolve an incident quickly enough and will need to escalate it. The time between escalation levels is going to depend on the criticality level of the incident:

Criticality level Time to Resolve Escalate up to Time to escalate to next level Special notes
Mission Critical 1-2 hours Level 4 10 minutes without a resolution path
Business Critical 2-4 hours Level 3 30 minutes without a resolution path
Business Operational 24-48 hours Level 2 8 hours without a resolution path Move to level 3 is not resolved in time
Administrative 48-72 hours Level 2 24 hours without a resolution path Move to level 3 is not resolved in time

As an example, if you are working an Administrative level incident:

  • You have 24 hours to try to resolve it yourself
  • After 24 hours with no clear path to resolution, you must escalate to level 1
  • After 48 hours with no clear path to resolution, you must escalate to level 2
  • After 72 hours with no clear path to resolution, you must escalate to level 3. This is because it has now surpassed the documented time to resolve the incident

As another example, if you are working a Mission Critical level incident:

  • You have 10 minutes to try to resolve it yourself
  • After 10 minutes with no clear path to resolution, you must escalate to level 1
  • After 20 minutes with no clear path to resolution, you must escalate to level 2
  • After 30 minutes with no clear path to resolution, you must escalate to level 3
  • After 40 minutes with no clear path to resolution, you must escalate to level 4

Escalation levels

Level Action
1 Post in team channel asking for assistance
2 Page Customer Support Operations Specialist oncall
3 Page Fullstack Engineer, Customer Support Operations oncall
4 Page Sr. Support Engineering Manager - Operations

Resolve the issue

Once you have resolved the incident, you need to resolve the issue you created. This is normally done by cross-linking MRs to thhe issue or by detailing what was done to fix the problem.

Once that is done, you can close the issue, unless an After Incident Review is required.

After Incident Review

For this, you will utilize the Customer Support Operations After Incident Reviews Google doc (internal only).

Make a duplicate of the Template tab, and then fill it in completely. You can use previous documents as an example of what is needed.

Make sure to notify the Sr. Support Engineering Manager - Operations of the document once you have completely filled it in.