Database

Database Reliability at GitLab

The group of Database Reliability Engineers (DBREs) are on the Reliability Engineering teams that run GitLab.com. We care most about database reliability aspects of the infrastructure and GitLab as a product.

We strive to approach database reliability from a data driven perspective as much as we can. As such, we start by defining Service Level Objectives below and document what service levels we currently aim to maintain for GitLab.com.

Database SLOs

We use Service Level Objects (SLOs) to reason about the performance and reliability aspects of the database. We think of SLOs as “commitments by the architects and operators that guide the design and operations of the system to meet those commitments.”1

Backup and Recovery

In backup and recovery, there are two SLOs:

SLO Current level Definition
DB-DR-TTR 8 hours Maximum time to recovery from a full database backup in case of disaster
DB-DR-RETENTION-MULTIREGIONAL 7 days The number of days we keep backups for recovery purposes in Multi-regional Storage class in GCS.
DB-DR-RETENTION-COLDLINE From 8 to 90 days The number of days we keep backups for recovery purposes in Coldline storage class in GCS.

The backup strategy is to take a daily snapshot of the full database (basebackup) and store this in Google Cloud Storage. Additionally, we capture the write-ahead log data in GCS to be able to perform point-in-time recovery (PITR) using one of the basebackups. Read more on Disaster Recovery

For DB-DR-TTR we need to consider worst-case scenarios with the latest backup being 24 hours old. Hence recovery time includes the time it takes to perform PITR to recover from archive to a certain point in time (right before the disaster).

We are able to recover to any point in time within the last DB-DR-RETENTION days.

High Availability

For GitLab.com we maintain availability above 99.95%. For the PostgreSQL database, we define the following SLOs:

SLO Level Definition
DB-HA-UPTIME 99.9% General database availability
DB-HA-PERF p99 < 200ms 99th percentile of database queries runtime below this level.
DB-HA-LOSS 60s Maximum accepted data loss in face of a primary failure

A DB-HA-UPTIME of 99.9% allows for roughly 45 minutes of downtime per month. Uptime means, the database cluster is available to serve queries from the application while maintaining other database SLOs.

We allocate a downtime budget of 45 minutes per month for planned downtimes, although we strive to keep downtime as low as possible. The downtime budget can be used to introduce change to the system. If the budget is used up (planned or unplanned), we stop introducing change and focus on availability (similar to SRE error budgets).

As for DB-HA-PERF, 99% of queries should finish below 200ms.

With DB-HA-LOSS we require an upper bound on replication lag. A write on the primary is considered at risk as long as it has not been replicated to a secondary (or to the PITR archive).

To make it easier to find your way around you can find a list of useful or important links below.

As a database specialist the following tools can be very helpful:

  • Postgres Checkup:Detailed report about the status of the PostgreSQL database.
  • Private Grafana: for both application and system level performance data.
  • Performance Bar: type pb in GitLab and a bar with performance metrics will show up at the top of the page. This tool is especially useful for viewing the queries executed and their timings.
  • Sherlock: a tool similar to the performance bar but meant for development environments. Sherlock is able to show backtraces and the output of EXPLAIN ANALYZE for executed queries. Enable by starting Rails with env ENABLE_SHERLOCK=1 bundle exec rails s.
  • https://explain.depesz.com/ for visualizing the output of EXPLAIN ANALYZE.

Dashboards

The following (private) Grafana dashboard are important / useful for database specialists:

Documentation

Basically everything under https://docs.gitlab.com/ee/development/#databases, but the following guides in particular are important:

For various other development related guides refer to https://docs.gitlab.com/ee/development/.


  1. From “Database Reliability Engineering”, O’Reilly Media, 2017 ↩︎


Database: Disaster Recovery

Purpose

This page contains an overview of the disaster recovery strategy we have in place for the PostgreSQL database. In this context, a disaster means losing the main database cluster or parts of it (a DROP DATABASE-type incident).

The overview here is not complete and is going to be extended soon.

We base our strategy on PostgreSQL’s Point-in-Time Recovery (PITR) feature.

This means we’re shipping daily snapshots and transaction logs (WAL) to an external storage (the archive). Given a snapshot, we are now able to replay the WAL until a certain point in time is reached (for example, right before the disaster happened earlier).

Last modified October 29, 2024: Fix broken links (455376ee)