Database Operations Team
Mission
The Database Operations team at GitLab mission is to Build, Run and Own the entire lifecycle of the PostgreSQL database engine for GitLab.com.
The team is focused on owning the reliability, scalability, performance & security of the database engine and its supporting services. The team should be seeking to build their services on top of Reliability::Foundations services and cloud vendor managed products, where appropriate, to reduce complexity, improve efficiency and deliver new capabilities quicker.
The team uses Engineering Principals to guide the decisions it makes for it’s services. The team does not explicitly have any self hosted product responsibilities but we should contribute the lessons we learn running the database engine for GitLab at scale in production back to the Product, Development & Support teams to improve overall customer experience with GitLab, as well as collaborating with the Support teams when self managed customers encounter complex database engine issues.
Team Members
Ownership
Services
Systems and services we are primarily responsible for:
- PostgreSQL Core
- PostgreSQL High Availability and Load Balancing (e.g. Patroni, PGBouncer, consul, PostgreSQL Replication etc.)
- PostgreSQL Disaster Recovery (backup/restore and other techniques)
- Database Observability (Prometheus instrumentation, workload analysis etc.)
- Support & troubleshooting of GitLab applications, specifically related to their use of and interaction with the PostgreSQL ecosystem.
Systems or services explicitly not owned by us:
System name | Description | Owner and supported by | Extra info/Open questions |
---|---|---|---|
Redis | There are several use cases such as caching, rate-limiting, sidekiq queueing. | Scalability Group | Redis Architecture |
Clickhouse | |||
Data team systems | Data team | ||
Self Managed databases | Self managed Support |
Useful Links
Workflow | Issue Labels Weekly Issue Triage |
Backlog | Current Milestone Issue Backlog |
Reaching us | #g_infra_database_reliability @gitlab-org/reliability/database |
Weekly Agenda | Weekly APAC and EMEA/AMER |
Achievements | FY24 - Q1 |
DBRE Escalations
We have a detailed DBRE escalation process that provides escalation guidelines for handling database related production incidents.
OKRs
We use quarterly Objectives and Key Results to plan and measure our Key Performance Indicators (KPIs).
Performance indicators
We measure the value we contribute by using performance indicator metrics.
In addition to the Infrastructure Department’s KPIs for availability and performance of GitLab.com, the Database Operations team tracks the following:
- Backup and Recovery SLOs
- General database availability(Uptime)
Key Technical Skills
The team is comprised of DBREs with varying levels of expertise in:
- Supporting PostgreSQL in large production environments.
- Infrastucture automation and configuration management, using tools such as Chef, Ansible, Terraform, etc.
- PostgreSQL internals, tuning & optimization, SQL and PL/pgSQL.
Common Links
To make it easier to find your way around you can find a list of useful or important links below.
Monitoring & Performance Related Tools
The following tools can be helpful:
- Postgres Checkup:Detailed report about the status of the PostgreSQL database.
- Private Grafana: for both application and system level performance data.
- Performance Bar: type
pb
in GitLab and a bar with performance metrics will show up at the top of the page. This tool is especially useful for viewing the queries executed and their timings.
Dashboards
The following (private) Grafana dashboard are important / useful for database specialists:
- PostgreSQL Overview
- Patroni Overview
- Patroni CI Overview
- PgBouncer Overview
- PgBouncer CI Overview
- GitLab Triage
- PostgreSQL Bloat
- PostgreSQL Disk IO
- Host stats
- Tuple Statistics
Documentation
- What requires downtime?
- Adding database indexes
- Post Deployment Migrations
- Background Migrations
- SQL Migration Style Guide
- SQL Query Guidelines
- Infrastructure runbooks and documentation
c16c2006
)