DBRE Escalation Process
Note
We are using Slack, @dbre, for escalations.About This Page
This page outlines the DBRE team escalation process and guidelines for developing the rotation schedule for handling infrastructure incident escalations.
Expectation
The expectation for the DBRE engineers is to be a database consultant and collaborate with the EOC who requested on-call escalation to troubleshoot together. There is no expectation that the DBRE engineer is solely responsible for a resolution of the escalation.
Escalation Process
Scope of Process
- This process is designed for the following issues:
- GitLab.com S1 and S2 production incidents raised by the Engineer On Call , Development, and Security teams.
- This process is NOT a path to reach the DBRE team for non-urgent issues that the Development, Security, and Support teams run into. Such issues can be moved forward by:
- Labelling with
team::Database Reliability
and following the Reliability General Workflow - Raising to the
#g_infra_database_reliability
Slack channel, or - Asking the infrastructure-lounge Slack channel assigning the
@dbre
user group
- Labelling with
- This process provides for Weekdays coverage only.
Example of qualified issue
- Production issue examples:
- GitLab.com: S1/S2 or DB failover and degraded GitLab.com performance
- GitLab.com: Severity 1 vulnerability being actively exploited or high likelihood of being exploited and puts the confidentiality, availability, and/or integrity of customer data in jeopardy.
Process Outline
NOTE: The DBRE support does not need to announce beginning/end of their shift in #db_squad unless there is an active incident happening (check the chat history of the channel to know if there is an active incident). This is because many engineers have very noisy notifications enabled for that channel, and such announcements are essentially false positives which make them check the channel unnecessarily.
Weekdays (UTC)
- Incidents will be escalated by the EOC or Incident Manager by notifying the DBRE through @dbre slack handle with an eligible DBRE according to their working hours.
- During incidents the available DBRE can pass the incident to another DBRE/Reliability EM, if they are urgently needed somewhere else.
- In timezones where we have only one DBRE, the DBRE can pass the incident to the available Reliability Engineering manager who will work to find someone(not necessarily a DBRE) who can help
Escalation
- EOC/IM, notify the DBRE on-call via slack handle @dbre requesting for the DBRE to join the incident zoom/channel
- DBRE responds to the ping by acknowledging the ping and joining the incident channel and zoom
- If DBRE support does not respond, the EOC/IM, notify the available Reliability EM
- DBRE triages the issue and works towards a solution.
- If necessary, DBRE reach out for further help or domain expert as needed.
In the event that no DBRE engineers respond to the ping, the EOC will then notify the Reliability, Engineering Managers. They will need to find someone available and notify this in the escalation thread. As an EM:
- Try to find someone available from the DBRE group
- If the search is positive, leave a message in the thread as an acknowledgement that the engineer will be looking into the issue
Weekends and Holidays (UTC)
The first iteration will only focus on weekdays.
First response time SLOs
OPERATIONAL EMERGENCY ISSUES ONLY
- GitLab.com: DBRE engineers provide initial response (not solution) in both incident channel and the tracking issue within 15 minutes.
Relay Handover
- Since the dbre who are on call may change frequently, responsibility for being available rests with them.
- In the instance of an ongoing escalation no DBRE should finish their on-call duties until they have arranged for and confirmed the DBRE taking over from them is present, or they have notified someone who is able to arrange a replacement. They do not have to find a replacement themselves, but they need confirmation from someone that a replacement will be found.
- In the instance of an ongoing escalation being handed over to
another incoming on-call DBRE the current on-call DBRE
summarize full context of on-going issues, such as but not limited to
-
Current status
-
What was attempted
-
What to explore next if any clue
-
Anything that helps bring the next on-call dbre up to speed quickly
These summary items should be in written format in the following locations:
-
Existing threads in respective Incident channel
-
Incident tracking issues
This shall be completed at the end of shifts to hand over smoothly.
-
- For current Production incident issues and status, refer to Production Incidents board.
- If an incident is ongoing at the time of handover, outgoing DBRE may prefer to remain on-call for another shift. This is acceptable as long as the incoming DBRE agrees
- If you were involved in an incident which has been mitigated during your shift, leave a note about your involvement in the incident issue and link to it in the respective incident Slack channel indicating you participated in the issue as an informational hand-off to future on-call DBRE.
Resources
Responding Guidelines
When responding to an Incident, utilize the below procedure as guidelines to follow to assist both yourself and the members requesting your assistance
- Join the Incident Zoom - this can be found bookmarked in the
#incident-management
Slack Channel - Join the appropriate incident slack channel for all communications that are text based - Normally this is
#incident-<ISSUE NUMBER>
- Work with the EOC to determine if a known code path is problematic
- Should the knowledge of this be in your domain, continue working with the EOC to troubleshoot the problem
- Should this be something you may be unfamiliar with, attempt to determine code ownership by team - Knowing this will enable us to see if we can bring online an Engineer from that team into the Incident
- Work with the Incident Manager to ensure that the Incident issue is assigned to the appropriate Engineering Manager - if applicable
Shadowing An Incident Triage Session
Feel free to participate in any incident triaging call if you would like to have a few rehearsals of how it usually works. Simply watch out for active incidents in #incident-management and join the Situation Room Zoom call (link can be found in the channel) for synchronous troubleshooting. There is a nice blog post about the shadowing experience.
Replaying Previous Incidents
Situation Room recordings from previous incidents are available in this Google Drive folder (internal).
Shadowing A Whole Shift
To get an idea of what’s expected of an on-call DBRE and how often incidents occur it can be helpful to shadow another shift. To do this simply identify and contact the DBRE on-call to let them know you’ll be shadowing. During the shift keep an eye on #incident-management for incidents and observe how the DBRE on-call follows the process if any arise.
Tips & Tricks of Troubleshooting
- How to Investigate a 500 error using Sentry and Kibana.
- Walkthrough of GitLab.com’s SLO Framework.
- Scalability documentation.
- Use Grafana and Kibana to look at PostgreSQL data to find the root cause.
- Ues Grafana, Thanos, and Prometheus to troubleshoot API slowdown.
- Related incident: 2019-11-27 Increased latency on API fleet.
- Let’s make 500s more fun
Tools for Engineers
- Training videos of available tools
- Dashboards examples, more are available via the dropdown at upper-left corner of any dashboard below
c16c2006
)