DBO Escalation Process
Note
We are using PagerDuty for escalations.About This Page
This page outlines the DBO team’s incident escalation policy.
Shortcuts
- DBO PagerDuty schedule
- Slack x PD integration: /pd trigger
@dbo-oncall
- Slack handles:
@dbre
or@dbo-oncall
- Slack channels: #g_database_operations
group::database operations
- Production Incidents
SLO and Expectations
-
DBO RESPONSE IS ON A BEST-EFFORT BASIS
-
LOCAL TIMEZONE, WEEKDAY COVERAGE ONLY
-
S1 / S2 INCIDENTS ONLY
-
NB1: Due to limited staffing, i.e. having only one person in EMEA timezone, there will be times during the business day, within multible timezones, where there will not be anyone able to respond. We understand the criticality of responding to S1/S2 incidents and we will make every effort to ensure there is adequete and timeliness in our responses, but given the current staffing levels, we are not at this point adhereing to a hard SLO. To do justice to this situation, it is also expected that schedules are changed on an ad-hoc bases.
-
NB2: DBO will join incidents as a subject matter expert in a consultative capacity and there should be no expectation that the DBO engineer is solely responsible for a resolution of the escalation. There may be times where the DBO needs to escalate to other subject matter experts, such as the Database Framework (DBF) team, in order to make headway on the incident at hand.
-
Escalation Process
Scope and Qualifiers
-
GitLab.com S1 and S2 production incidents raised by the Incident Manager On Call, Engineer On Call and Security teams.
-
NB1: Gitlab Dedicated support is consultative at this point. DBO team currently not equipped, i.e. lacking access and and training on how to support Dedicated databases. This may change in the future; check back here for updates on this topic.
-
NB2: Self Managed support is discrtionary and will be evaluated on a case-by-case basis.
-
NB3: This process is NOT a path to reach the DBO team for non-urgent issues. For non-urgent issues, please create a Request for Help (RFP) issue using this Issue template.
-
NB4: The DBO on-shift is responsbile for coordinating warm handoffs during shift changes, especially when there is an ongoing, active incident.
-
Escalation
-
EOC/IM, Development or Security page the DBO on-call via PagerDuty
-
DBO responds by acknowledging the page and joining the incident channel and zoom
-
DBO triages the issue and works towards a solution.
-
If necessary, DBO reach out for further help or domain expert as needed.
- NB1: If DBO support does not respond, escalation path as defined within PagerDuty ensues.
Resources
Responding Guidelines
When responding to an Incident, utilize the below procedure as guidelines to follow to assist both yourself and the members requesting your assistance
- Join the Incident Zoom - this can be found bookmarked in the
#incident-management
Slack Channel - Join the appropriate incident slack channel for all communications that are text based - Normally this is
#incident-<ISSUE NUMBER>
- Work with the EOC to determine if a known code path is problematic
- Should the knowledge of this be in your domain, continue working with the EOC to troubleshoot the problem
- Should this be something you may be unfamiliar with, attempt to determine code ownership by team - Knowing this will enable us to see if we can bring online an Engineer from that team into the Incident
- Work with the Incident Manager to ensure that the Incident issue is assigned to the appropriate Engineering Manager - if applicable
Shadowing An Incident Triage Session
Feel free to participate in any incident triaging call if you would like to have a few rehearsals of how it usually works. Simply watch out for active incidents in #incident-management and join the Situation Room Zoom call (link can be found in the channel) for synchronous troubleshooting. There is a nice blog post about the shadowing experience.
Replaying Previous Incidents
Situation Room recordings from previous incidents are available in this Google Drive folder (internal).
Shadowing A Whole Shift
To get an idea of what’s expected of an on-call DBO and how often incidents occur it can be helpful to shadow another shift. To do this simply identify and contact the DBO on-call to let them know you’ll be shadowing. During the shift keep an eye on #incident-management for incidents.
Tips & Tricks of Troubleshooting
- How to Investigate a 500 error using Sentry and Kibana.
- Walkthrough of GitLab.com’s SLO Framework.
- Scalability documentation.
- Use Grafana and Kibana to look at PostgreSQL data to find the root cause.
- Use Grafana, Thanos, and Prometheus to troubleshoot API slowdown.
- Related incident: 2019-11-27 Increased latency on API fleet.
- Let’s make 500s more fun
Tools for Engineers
- Training videos of available tools
- Dashboards examples, more are available with the dropdown list at upper-left corner of any dashboard below
7aa497c0
)