DBRE Escalation Process

Note

We are using Slack, @dbre, for escalations.

About This Page

This page outlines the DBRE team escalation process and guidelines for developing the rotation schedule for handling infrastructure incident escalations.

Expectation

The expectation for the DBRE engineers is to be a database consultant and collaborate with the EOC who requested on-call escalation to troubleshoot together. There is no expectation that the DBRE engineer is solely responsible for a resolution of the escalation.

Escalation Process

Scope of Process

This process is designed for the following issues:
1. GitLab.com S1 and S2 production incidents raised by the Engineer On Call , Development, and Security teams.
This process is NOT a path to reach the DBRE team for non-urgent issues that the Development, Security, and Support teams run into. Such issues can be moved forward by:
1. Labelling with team::Database Reliability and following the Reliability General Workflow
2. Raising to the #g_infra_database_reliability Slack channel, or
3. Asking the infrastructure-lounge Slack channel assigning the @dbre user group
This process provides for Weekdays coverage only.

Example of qualified issue

Production issue examples:
1. GitLab.com: S1/S2 or DB failover and degraded GitLab.com performance
2. GitLab.com: Severity 1 vulnerability being actively exploited or high likelihood of being exploited and puts the confidentiality, availability, and/or integrity of customer data in jeopardy.

Process Outline

NOTE: The DBRE support does not need to announce beginning/end of their shift in #db_squad unless there is an active incident happening (check the chat history of the channel to know if there is an active incident). This is because many engineers have very noisy notifications enabled for that channel, and such announcements are essentially false positives which make them check the channel unnecessarily.

Weekdays (UTC)

Incidents will be escalated by the EOC or Incident Manager by notifying the DBRE through @dbre slack handle with an eligible DBRE according to their working hours.
During incidents the available DBRE can pass the incident to another DBRE/Reliability EM, if they are urgently needed somewhere else.
In timezones where we have only one DBRE, the DBRE can pass the incident to the available Reliability Engineering manager who will work to find someone(not necessarily a DBRE) who can help

Escalation

EOC/IM, notify the DBRE on-call via slack handle @dbre requesting for the DBRE to join the incident zoom/channel
DBRE responds to the ping by acknowledging the ping and joining the incident channel and zoom
If DBRE support does not respond, the EOC/IM, notify the available Reliability EM
DBRE triages the issue and works towards a solution.
If necessary, DBRE reach out for further help or domain expert as needed.

In the event that no DBRE engineers respond to the ping, the EOC will then notify the Reliability, Engineering Managers. They will need to find someone available and notify this in the escalation thread. As an EM:

Try to find someone available from the DBRE group
If the search is positive, leave a message in the thread as an acknowledgement that the engineer will be looking into the issue

Weekends and Holidays (UTC)

The first iteration will only focus on weekdays.

First response time SLOs

OPERATIONAL EMERGENCY ISSUES ONLY

GitLab.com: DBRE engineers provide initial response (not solution) in both incident channel and the tracking issue within 15 minutes.

Relay Handover

Since the dbre who are on call may change frequently, responsibility for being available rests with them.
In the instance of an ongoing escalation no DBRE should finish their on-call duties until they have arranged for and confirmed the DBRE taking over from them is present, or they have notified someone who is able to arrange a replacement. They do not have to find a replacement themselves, but they need confirmation from someone that a replacement will be found.
In the instance of an ongoing escalation being handed over to another incoming on-call DBRE the current on-call DBRE summarize full context of on-going issues, such as but not limited to
- Current status
- What was attempted
- What to explore next if any clue
- Anything that helps bring the next on-call dbre up to speed quickly
  
  These summary items should be in written format in the following locations:
- Existing threads in respective Incident channel
- Incident tracking issues
  
  This shall be completed at the end of shifts to hand over smoothly.
For current Production incident issues and status, refer to Production Incidents board.
If an incident is ongoing at the time of handover, outgoing DBRE may prefer to remain on-call for another shift. This is acceptable as long as the incoming DBRE agrees
If you were involved in an incident which has been mitigated during your shift, leave a note about your involvement in the incident issue and link to it in the respective incident Slack channel indicating you participated in the issue as an informational hand-off to future on-call DBRE.

Resources

Responding Guidelines

When responding to an Incident, utilize the below procedure as guidelines to follow to assist both yourself and the members requesting your assistance

Join the Incident Zoom - this can be found bookmarked in the #incident-management Slack Channel
Join the appropriate incident slack channel for all communications that are text based - Normally this is #incident-<ISSUE NUMBER>
Work with the EOC to determine if a known code path is problematic

Should the knowledge of this be in your domain, continue working with the EOC to troubleshoot the problem
Should this be something you may be unfamiliar with, attempt to determine code ownership by team - Knowing this will enable us to see if we can bring online an Engineer from that team into the Incident

Work with the Incident Manager to ensure that the Incident issue is assigned to the appropriate Engineering Manager - if applicable

Shadowing An Incident Triage Session

Feel free to participate in any incident triaging call if you would like to have a few rehearsals of how it usually works. Simply watch out for active incidents in #incident-management and join the Situation Room Zoom call (link can be found in the channel) for synchronous troubleshooting. There is a nice blog post about the shadowing experience.

Replaying Previous Incidents

Situation Room recordings from previous incidents are available in this Google Drive folder (internal).

Shadowing A Whole Shift

To get an idea of what’s expected of an on-call DBRE and how often incidents occur it can be helpful to shadow another shift. To do this simply identify and contact the DBRE on-call to let them know you’ll be shadowing. During the shift keep an eye on #incident-management for incidents and observe how the DBRE on-call follows the process if any arise.

Tips & Tricks of Troubleshooting

Tools for Engineers

Training videos of available tools
Dashboards examples, more are available via the dropdown at upper-left corner of any dashboard below

Last modified June 6, 2024: Remove ul-indent exception and fix errors (5c73f128)

View page source - Edit this page - please contribute.