DBO Escalation Process

About This Page

This page outlines the DBO team’s escalation process and guidelines for developing the rotation schedule for handling infrastructure incident escalations.

Expectation

The expectation for the DBO engineers is to be a database consultant and collaborate with the EOC who requested on-call escalation to troubleshoot together. There is no expectation that the DBO engineer is solely responsible for a resolution of the escalation.

Escalation Process

Scope of Process

  1. This process is designed for the following issues:
    1. GitLab.com S1 and S2 production incidents raised by the Engineer On Call , Development, and Security teams.
  2. This process is NOT a path to reach the DBO team for non-urgent issues that the Development, Security, and Support teams run into. Such issues can be moved forward by:
    1. Labelling with group::database operations and following the Reliability General Workflow
    2. Raising to the #g_database_operations Slack channel, or
    3. Asking the infrastructure-lounge Slack channel assigning the @dbre or @dbo user group
  3. This process provides for Weekdays coverage only.

Example of qualified issue

  1. Production issue examples:
    1. GitLab.com: S1/S2 or DB failover and degraded GitLab.com performance
    2. GitLab.com: Severity 1 vulnerability being actively exploited or high likelihood of being exploited and puts the confidentiality, availability, and/or integrity of customer data in jeopardy.

Process Outline

NOTE: The DBO support does not need to announce beginning/end of their shift in #g_database_operations, unless there is an active incident happening (check the chat history of the channel to know if there is an active incident). This is because many engineers have very noisy notifications enabled for that channel, and such announcements are essentially false positives which make them check the channel unnecessarily.

Weekdays (UTC)

  1. Incidents will be escalated by the EOC or Incident Manager by notifying the DBO through @dbre or @dbo slack handle with an eligible DBO according to their working hours.
  2. During incidents the available DBO can pass the incident to another DBO/Reliability EM, if they are urgently needed somewhere else.
  3. In timezones where we have only one person, the DBO can pass the incident to the available Reliability Engineering manager who will work to find someone(not necessarily a DB) who can help
Escalation
  1. EOC/IM, notify the DBO on-call via slack handle @dbre or @dbo requesting for the DBO to join the incident zoom/channel
  2. DBO responds to the ping by acknowledging the ping and joining the incident channel and zoom
  3. If DBO support does not respond, the EOC/IM, notify the available Reliability EM
  4. DBO triages the issue and works towards a solution.
  5. If necessary, DBO reach out for further help or domain expert as needed.

In the event that no DBO engineers respond to the ping, the EOC will then notify the Reliability, Engineering Managers. They will need to find someone available and notify this in the escalation thread. As an EM:

  1. Try to find someone available from the DBO group
  2. If the search is positive, leave a message in the thread as an acknowledgement that the engineer will be looking into the issue

Weekends and Holidays (UTC)

The first iteration will only focus on weekdays.

First response time SLOs

OPERATIONAL EMERGENCY ISSUES ONLY

  1. GitLab.com: DBO engineers provide initial response (not solution) in both incident channel and the tracking issue on a best-effort basis.

Relay Handover

  • Since the dbo who are on call may change frequently, responsibility for being available rests with them.
  • In the instance of an ongoing escalation no DBO should finish their on-call duties until they have arranged for and confirmed the DBO taking over from them is present, or they have notified someone who is able to arrange a replacement. They do not have to find a replacement themselves, but they need confirmation from someone that a replacement will be found.
  • In the instance of an ongoing escalation being handed over to another incoming on-call DBO the current on-call DBO summarize full context of on-going issues, such as but not limited to
    • Current status

    • What was attempted

    • What to explore next if any clue

    • Anything that helps bring the next on-call dbo up to speed quickly

      These summary items should be in written format in the following locations:

    • Existing threads in respective Incident channel

    • Incident tracking issues

      This shall be completed at the end of shifts to hand over smoothly.

  • For current Production incident issues and status, refer to Production Incidents board.
  • If an incident is ongoing at the time of handover, outgoing DBO may prefer to remain on-call for another shift. This is acceptable as long as the incoming DBO agrees
  • If you were involved in an incident which has been mitigated during your shift, leave a note about your involvement in the incident issue and link to it in the respective incident Slack channel indicating you participated in the issue as an informational hand-off to future on-call DBO.

Resources

Responding Guidelines

When responding to an Incident, utilize the below procedure as guidelines to follow to assist both yourself and the members requesting your assistance

  1. Join the Incident Zoom - this can be found bookmarked in the #incident-management Slack Channel
  2. Join the appropriate incident slack channel for all communications that are text based - Normally this is #incident-<ISSUE NUMBER>
  3. Work with the EOC to determine if a known code path is problematic
  • Should the knowledge of this be in your domain, continue working with the EOC to troubleshoot the problem
  • Should this be something you may be unfamiliar with, attempt to determine code ownership by team - Knowing this will enable us to see if we can bring online an Engineer from that team into the Incident
  1. Work with the Incident Manager to ensure that the Incident issue is assigned to the appropriate Engineering Manager - if applicable

Shadowing An Incident Triage Session

Feel free to participate in any incident triaging call if you would like to have a few rehearsals of how it usually works. Simply watch out for active incidents in #incident-management and join the Situation Room Zoom call (link can be found in the channel) for synchronous troubleshooting. There is a nice blog post about the shadowing experience.

Replaying Previous Incidents

Situation Room recordings from previous incidents are available in this Google Drive folder (internal).

Shadowing A Whole Shift

To get an idea of what’s expected of an on-call DBO and how often incidents occur it can be helpful to shadow another shift. To do this simply identify and contact the DBO on-call to let them know you’ll be shadowing. During the shift keep an eye on #incident-management for incidents and observe how the DBO on-call follows the process if any arise.

Tips & Tricks of Troubleshooting

  1. How to Investigate a 500 error using Sentry and Kibana.
  2. Walkthrough of GitLab.com’s SLO Framework.
  3. Scalability documentation.
  4. Use Grafana and Kibana to look at PostgreSQL data to find the root cause.
  5. Ues Grafana, Thanos, and Prometheus to troubleshoot API slowdown.
  6. Let’s make 500s more fun

Tools for Engineers

  1. Training videos of available tools
    1. Visualization Tools Playlist.
    2. Monitoring Tools Playlist.
    3. How to create Kibana visualizations for checking performance.
  2. Dashboards examples, more are available via the dropdown at upper-left corner of any dashboard below
    1. Saturation Component Alert.
    2. Service Platform Metrics.
    3. SLAs.
    4. Web Overview.