Database Help Workflow
This guide is for Reliability and Support engineers to quickly and easily find the help they need on database-related emergencies.
Note
This page is intended for helping find resources during emergencies. If this isn’t an emergency (IE: S1/S2 issue), instead ensure there are appropriately labeled issues so they can be triaged by the team.Non-Emergencies
-
Label your issue to be triaged by the right team:
- For operational or configuration issues, label them
~"team::Database Reliability"
- For issues related to the packaged Postgres in Omnibus/Charts, label them
~"group::distribution"
- For application issues, label them for the team responsible for that feature.
- If you’re not sure, take a look at the guide below for help identifying the right team.
- For operational or configuration issues, label them
-
If the issue is blocking or you need to escalate:
- For application issues, Post a detailed message in the channel for the team that is responsible for the related feature.
- For operational or configuration issues, post in #g_infra_database_reliability (internal)
- For issues related to the packaged Postgres in Omnibus/Charts, post in #g_distribution (internal)
1. Start
If the emergency is related to configuration or operations like:
- Emergencies related to operating Postgres
- Connection errors
- Replication errors
- SSL Errors
Proceed to step 2. Configuration Errors
If not, proceed to step 3. Application Errors
2. Configuration or Operational Errors
If the emergency is related to an ongoing incident on gitlab.com or Dedicated customer, Follow the DBRE Escalation Process
If the emergency is related to a self-managed customer, reach out to the Distribution Team in #g_distribution (internal) who manages self-managed configuration.
3. Application Errors
If the emergency is related to a single query, page, or endpoint, for example:
- A page (or class of pages) has a 500 and Sentry identifies it as a PG Timeout Error
- A long running transaction is identified as coming from a sidekiq worker
- A single query is identified as taking 10s on production and slowing down the site
- A migration is failing or has resulted in an emergency
Go to step 4. Single Source Issues
If not, go to 5. Widespread Issues
4. Single Source Issues
If the emergency is from an error that includes a feature category, go to 8. Reach out to a team based on feature category. As subject matter experts for a given feature, backend engineers are typically familiar with the involved database patterns and are typically best suited to solve issues related to their features even when the issue is related to database actions.
If the emergency is related to a migration, see 6. Determine Migration Source
If the emergency is related to a Rails controller, Sidekiq worker, API endpoint, or background migration, determine the feature category using details in our feature categorization guide, then go to 8. Reach out to a team based on feature category
If you need assistance to identify the source, go to 9. Escalating assistance
5. Widespread Issues
If the application (or component a major component ex: Sidekiq) is down or unresponsive due to what you believe to be a database related incident, that’s an “All hands on deck”.
- Activate Development On-Call. While it may seem unnecessary, many backend developers are familiar enough with the application and database that they should be able to help isolate a source while trying to get database experts involved.
- Reach out in the #database, #g_database, and #g_infra_database_reliability channels (internal) for expert help using the
@db-team
(database capability) or@dbre
(database reliability) group handles.
6. Determine Migration Source
The easiest way is using git
, from the gitlab repository, run:
git log --first-parent {path/to/migration.rb}
That should give you an output that includes a link to the merge request where the migration was added.
If that doesn’t give a clear answer, you can look at the tables involved in the migration and take a guess at the team. See 7. Determine Source based on a Table
If it’s still unclear what team to contact, go to 9. Escalating assistance
7. Determine a source based on a Table Name
Each database table has a documentation file that can be used to determine a corresponding group.
- Look for the corresponding file named
{table_name}.yml
in https://gitlab.com/gitlab-org/gitlab/-/tree/master/db/docs - In the file, find the list of related
feature_categories
- Using the feature category, go to 8. Reach out to a team based on feature category
- If there is more than one category, pick one from the list and start with that team
- If it’s still unclear what team to contact, go to 9. Escalating assistance. Be sure to include details about the table that’s causing the issue, and why you believe it’s involved.
8. Reach out to a team based on feature category
Even if the emergency is related to the database or has database words in it, the best first step is to contact the team responsible for that area of the application. The easiest way to figure out what team is responsible for that area is by feature category.
- Using the feature category check the corresponding group in the Category-Team mappings
- Reach out in that team’s slack channel, and
@mention
the team’s manager for assistance - If the team doesn’t respond, go to 9. Escalating assistance.
9. Escalating assistance
When escalating an emergency, be as specific as possible and provide as many details as possible. Per communication guidelines, avoid acronyms whenever possible.
Always include:
- A link to the issue, Sentry error, incident, or Zendesk ticket
- The text of any error messages
- Links to any applicable charts
- Details about the query, migration, or issue
For Ongoing GitLab.com or Dedicated Incidents
- Activate Development On-Call
- If development on-call needs additional database expertise, reach out in #database
- If there’s no response within 15 minutes, or the request is urgent, tag
@db-team
(Application) or@dbre
(Infrastructure/Operations) in a thread on the original message - If there’s no response to the ping within 15 minutes, and the request is urgent, use slack to find the phone number of the Database or DBRE manager and text or call them.
For Support Escalations
- File a request for help issue
- Reach out in #database, include a link to the request for help
c16c2006
)