Database Operations Team

Mission

The mission of the Database Operations team at GitLab is to Build, Run, Own and Evolve the entire lifecycle of the PostgreSQL database engine for GitLab.com.

The team is focused on owning the reliability, scalability, performance & security of the database engine and its supporting services. The team should be seeking to build their services on top of Production Engineering::Foundations services and cloud vendor managed products, where appropriate, to reduce complexity, improve efficiency and deliver new capabilities quicker.

The team uses Engineering Principals to guide the decisions it makes for it’s services. The team does not explicitly have any self hosted product responsibilities but we should contribute the lessons we learn running the database engine for GitLab at scale in production back to the Product, Development & Support teams to improve overall customer experience with GitLab, as well as collaborating with the Support teams when self managed customers encounter complex database engine issues.

Team Members

Name Role
Rick MarRick Mar Engineering Manager, Database Reliability
Alexander SosnaAlexander Sosna Senior Database Reliability Engineer
Biren ShahBiren Shah Senior Database Reliability Engineer
Jon SissonJon Sisson Senior Site Reliability Engineer
Rafael HenchenRafael Henchen Senior Database Reliability Engineer

Ownership

Services

Systems and services we are primarily responsible for:

  • PostgreSQL Core
  • PostgreSQL High Availability and Load Balancing (e.g. Patroni, PGBouncer, consul, PostgreSQL Replication etc.)
  • PostgreSQL Disaster Recovery (backup/restore and other techniques)
  • Database Observability (Prometheus instrumentation, workload analysis etc.)
  • Support & troubleshooting of GitLab applications, specifically related to their use of and interaction with the PostgreSQL ecosystem.

Systems or services explicitly not owned by us:

System name Description Owner and supported by Extra info/Open questions
Redis There are several use cases such as caching, rate-limiting, sidekiq queueing. Scalability Group Redis Architecture
Clickhouse
Data team systems Data team
Self Managed databases Self managed Support
Workflow Issue Labels
Weekly Issue Triage
Backlog Current Milestone
Issue Backlog
Reaching us #g_infra_database_reliability
@gitlab-org/reliability/database
Weekly Agenda Weekly APAC and EMEA/AMER
Achievements FY24 - Q1

DBRE Escalations

We have a detailed DBRE escalation process that provides escalation guidelines for handling database related production incidents.

OKRs

We use quarterly Objectives and Key Results to plan and measure our Key Performance Indicators (KPIs).

Performance indicators

We measure the value we contribute by using performance indicator metrics.

In addition to the Infrastructure Department’s KPIs for availability and performance of GitLab.com, the Database Operations team tracks the following:

  • Backup and Recovery SLOs
  • General database availability(Uptime)

Key Technical Skills

The team is comprised of DBREs with varying levels of expertise in:

  • Supporting PostgreSQL in large production environments.
  • Infrastucture automation and configuration management, using tools such as Chef, Ansible, Terraform, etc.
  • PostgreSQL internals, tuning & optimization, SQL and PL/pgSQL.

To make it easier to find your way around you can find a list of useful or important links below.

The following tools can be helpful:

  • Postgres Checkup:Detailed report about the status of the PostgreSQL database.
  • Private Grafana: for both application and system level performance data.
  • Performance Bar: type pb in GitLab and a bar with performance metrics will show up at the top of the page. This tool is especially useful for viewing the queries executed and their timings.

Dashboards

The following (private) Grafana dashboard are important / useful for database specialists:

Documentation


DBRE Escalation Process
This page outlines the DBRE team escalation process and guidelines for developing the rotation schedule for handling infrastructure incident escalations.