Database Platform

This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. The development, release, and timing of any products, features, or functionality may be subject to change or delay and remain at the sole discretion of GitLab Inc.
Status Authors Coach DRIs Owning Stage Created
wip alexander-sosna andrewn devops data_stores 2024-02-06

Current GitLab.com Architecture

GitLab.com uses 4 PostgreSQL clusters running on virtual machines in GCP managed by Terraform, Chef and Ansible for complex tasks. With Cells, we want to create significant more than 4 PostgreSQL clusters, if we want 50 Cells, we’ll have to have 50+ PostgreSQL clusters.

@startuml
component "Rails"
component "Container Registry"

database "main"
database "ci"
database "registry"
database "embedding"

component "pgbouncer" as pgbouncermain
component "pgbouncer" as pgbouncerci
component "pgbouncer" as pgbouncerregistry
component "pgbouncer" as pgbouncerembedding

"Rails" --> pgbouncermain
pgbouncermain --> "main"

"Rails" --> pgbouncerci
pgbouncerci --> "ci"

"Rails" --> pgbouncerembedding
pgbouncerembedding --> "embedding"


"Container Registry" --> pgbouncerregistry
pgbouncerregistry --> "registry"


:User: --> "Rails"
:User: --> "Container Registry"
@enduml

We currently have 4 different database clusters, each hosting a single production database

Database Description Size Scaling headroom
Main Default location for all data Over 20TB Very little
CI All CI/CD related data Over 20TB Very little
Registry Container Registry data Less than 2TB Lots of headroom
Embedding Running pgvector to store AI training data Less than 2TB Lots of headroom

The data will be sharded by the organization_id, and an organization_id can live in 1 Cell.

During the Cells rollout most of the data will remain in the current database clusters, so it is important to accommodate for the requirements of this larger monolithic systems as well.

Future Architecture

For the upcoming Cells project the requirements are different. The current estimate is that each Cell will host a relatively small number of users compared to our current monolith, posing the new challenge to host many small database clusters and to create them on short notice

The exact requirements will develop over time, but the following is requested:

Our current automation involves many manual steps, including the creation of dedicated Chef roles for each cluster. The change process to create new cluster is the largest overhead, requiring multiple Merge Requests and Peer Review as well execution by Change Request. This is incompatible with the newly given requirements.

However, Cells will start iterative with one single Cell, needing one database cluster. The current platform can therefore be used at least for Cells 1.0, but could be kept until all data is stored in the new Cell local database clusters.

It is also possible to migrate the current data to a new database platform, before the full sharing across multiple smaller Cells. If this is desired, the requirements for the current monolith need to be meet by the target platform as well.

Requirements

Requirements will be split into what we currently need for the monolithic data stores for GitLab.com and what we estimate for the individual cells.

Requirements can be put into the following categories. They can be high, soft, low.

  • high, is needed to operate GitLab.com.
  • medium, is currently needed for operation, but could be replaced
  • low, nice to have

Universal Requirements

Requirement Description Priority Cloud SQL k8s operator Amazon RDS
Support GitLab Ability to support GitLab application with minimal, or no changes high
PostgreSQL Major Release Support for stable releases within 6 months, so development and infrastructure teams can work on integration. high ✅/?
PostgreSQL Major Release Support for stable releases within 3 months, so development and infrastructure teams can work on integration. medium ✅/? ✅/partial
PostgreSQL Patch Release Support for minor releases (bugfixes) within 7 days. high ✅/?
PostgreSQL Security Fixes Any security fix to follow out Security SLAs, critical within 24 hours. high
PostgreSQL Beta Release Support for current PostgreSQL beta release, so development and infrastructure teams can test early. low ✅/only preview env
Near Zero Downtime Upgrade Upgrades need to be possible with only seconds, not minutes or hours, of APDEX degradation. high ✅/? ✅/partial
HA solution High availability and failover automation, on-par or better than current Patroni solution. Switchover / Failover in seconds without human intervention and not allowing split-brain scenarios. high
Logging Integration Structured logging on Postgres available. high
Metrics Integration in our Prometheus / Grafana monitoring setup. high
PostgreSQL Extensions - Operation (high) The following extensions are currently necessary:
- pg_stat_statements
- pg_wait_sampling
- amcheck
- pgvector
- pg_trgm
- btree_gin
- btree_gist
- plpgsql
- pg_repack
high
PostgreSQL Extensions - Debug (medium) The following extensions are used for debugging: - pg_stat_kcache
- pgstattuple
- pageinspect
- pg_buffercache
medium ✅/not kcache
PostgreSQL Extensions - Migration / Sharding (low) The following extensions are not used yet, but might become important in the future:
- postgres_fdw
- file_fdw
low ✅/not file_fdw
Debug Tooling We currently use tools like strace to hook into PostgreSQL processes to find performance root causes. In a SaaS we would need to make sure the service provider is willing and able to do such analysis instead. medium
Backups with third-party tools Automated and manual base backups with third party tools such as wal-g, pgBackRest, with custom backup/archive repositories and retention policies. high
Disk based backup / restore Automated and manual fast backup like disk / volume snapshots, configurable retention policy, atomic snapshots or integration with pg_start_backup / pg_stop_backup to guarantee consistent data high
Incremental backups Customize incremental backup execution, which allow us to perform more frequent backups, which reduces RTO. medium
Backup Export Ability to export backups to generic storage e.g. GCS buckets. high ❌/only S3 .parquet
Local Streaming Replication Streaming Physical Replication to the same region is required to offload read-only workload and for horizontal scaling. high
Multi-Region Streaming Replication SR to remote locations, like multi-region, is needed for DR and future migrations. high
External Streaming Replication SR to an external PostgreSQL database outside of the production environment, a different account, or even a different Cloud provider, can be required for DR, testing and future migrations. medium
WAL Archiving Replication WAL Archiving Replication eliminates the impact of hot standby feedback and streaming replication, which is useful for analytics (eg. execution of long/slow queries and reports), query testing and performance debug high
WAL Delayed Archiving Replication WAL Delayed Archiving Replication allow to keep a Replica continuously recoverying in a previous point-in-time (eg. 8 hours ago), this provides a very quick Disaster Recovery method for issues such as human errors. medium
Logical Replication Logical Replication is needed for zero downtime upgrades, future migrations or any kind of infrastucture change that might not support PostgreSQL physical replication. high ✅ / ?
Read Load Distribution Standbys to distribute read load need to be deployable on short notice, to mitigate performance bottlenecks, a suitable automation is acceptable as well. high
Regional deployment We need to be able to define the region of individual standbys for DR requirements. low
Database Lab Integration Database Lab is used by our backend developers and needs to be integrated. low ✅ / ? ✅ / ?

Current Platform Requirements

Performance

Performance is a key factor. The first task is to define performance outlines and a reliable method to test if they are met.

{- TODO: Define performance requirements and test procedure -}

We observed the following load peaks and can use them as an indication:

Cells Requirements

Requirement Description Priority
API Provision a ready to receive traffic PostgreSQL cluster via an API. high
Manage 100+ We need to be able to efficiently manage 100+ clusters. high
Homogenous databases All the databases need to behave the same way, and we don’t have special snowfalkes for each deployment. medium

TODO: Define performance requirements and check with different steak holders. Discussed with @rnienaber, we will start with the largest reference architecture.

Decomposition

The application data for GitLab.com is currently decomposed into two separate database clusters, Main and CI. We are evaluating if we can further decompose the Main database with decomposing Secure and Govern related tables to a separate Postgres DB to gain more headroom and scalability for the current platform.

For Cells it is a design choice to scale horizontally by adding more Cells and to rebalance by moving organizations to less saturated cells. Cells should not be scaled vertically to a point where decomposition is reasonable. Therefore decomposition within Cells will not be needed and is also not supported by the current Dedicated / GET / Cells tooling. Adding support for Decomposition in GET and Dedicated in the future would be possible as a medium-complexity task.

In line with this, only a single database cluster is provisioned per Cell, capable of hosting all needed databases.

Overview - Possible Solutions

Non-PostgreSQL Offerings

We briefly looked into non-PostgreSQL, claiming certain levels of PostgreSQL compatibility, like AlloyDB. We came to the conclusion that cost and risk are not in any meaningful proportion to the expected benefit. Currently, there is no inherent shortcoming of PostgreSQL itself that justifies further investigation.

Further details below:

Cloud SQL

Cloud SQL is Google’s standard PostgreSQL offering. It is a custom fork, claimed to be 100% compatible to the upstream Releases and is simply referenced as PostgreSQL in the Cloud SQL documentation. GitLab currently recognizes Cloud SQL as a supported PostgreSQL implementation.

Pro Description Priority / Importance
DBaaS Minimal operation overhead, in theory most efficient. We only need to define what DB we need and get it provided as a Service. high
Outsourced We don’t need to maintain infrastructure beyond our hooks, interfaces, monitoring, alerting and forecast. high
Ops. Extensions The essential extensions necessary to run GitLab.com are available. high
API An API for all offered featured is already provided. high
Cons / Risks Description Priority / Importance
Product Lock-in Cloud SQL is a One-Way door decision. Currently, we can replicate our data to any given destination providing PostgreSQL’s Streaming Replication interface. GCP activity prohibits users from migrating away by multiple measures, e.g.: no base backup export, no WAL archiving, no Streaming Replication. A migration away will require significant downtime, unacceptable by our current availability goals for GitLab.com. high / blocker
Release Delay It is vital to get PostgreSQL releases in a reasonable time and having a reliable roadmap to as source of truth for planing. Google gives no guarantee, only an estimate to offer GA in 3 months time. This estimate is not correct for PG16 was released 2023-09-14 and is today 2024-02-16 not available yet. This is no isolated case, PG15 became available 2023-05-24, 7 months after the release. high / blocker
Bugfix Delay Fixes to critical bugs regarding data consistency or open vulnerabilities need to be available quickly, ideally within hours, at least in a few days time. Google gives no guarantee, only an estimate of 30 days. This estimate is not correct. 15.3 was released 2023-05-11 and is today 2024-02-16 not available yet, 281 days after the release. In the meantime the versions 15.4, 15.5 and 15.6 have been released and are also not available yet, leaving Cloud SQL 4 patch levels behind. high / blocker
No Zero Downtime Upgrade Upgrades need to be possible with only seconds, not minutes or hours, of APDEX degradation. We found inconsistent data for maintenance time, ranging from 10s to 30+ minutes and need to validate this claim with our data set. high / blocker
Base Backups We regularly export base backups for testing, data analysis, export to Database Labs for the Database Team, or disaster recovery preparation. Google currently denies export of backups to generic storage, e.g. GCS buckets. high / blocker
External Streaming Replication CloudSQL supports multi-region physical Replication, but it doesn’t support physical replication to external database. medium
WAL Archiving Delayed Replication CloudSQL doesn’t support recovery_min_apply_delay to configure delayed replication. But in case hot_standby_feedback is disabled and a lag is created due to a large query CloudSQL automatically move a Replica from Streaming to Archive replication. medium
Observability: DB We loose observability (eg; strace/ perf) by not having full access to the database machine. We need GCP to execute low level debugging, but currently have indication that this would happen in a timely manner. medium
Observability: Queries We get most of the observability we currently have, but we won’t have the ability to dig into the database internals like lock contention with Cloud SQL because we don’t have access to the database process. When we only have small instances e.g. 50k users this might be medium, but for larger instances could be high. medium
Escalation to database engineers We could not find a specific support path where we can reliably reach Cloud SQL experts in a timely manner. A Cloud SQL expert needs to be available within minutes to collaborate with us on S1 incidents. high

Most of the information above can be found in the official Cloud SQL documentation. Some estimates were given to us verbally by the GCP team, as seen in the meeting notes Discuss Cloud SQL and AlloyDB with GitLab (internal).

Things to validate

  • Could we use the offered logical replication feature (pglogical) for our migration needs? - estimate 2-4 weeks
  • Can we access WAL and base_backups, as it appears in the pitr documentation, in contrast to our meeting, where GCP denied it. - estimate < 1 week
  • How long does a major upgrade take for our 50k reference architecture? - estimate 4-5 weeks
  • Is Query Insights a sufficient replacement for the current observability tooling.
  • How long does it take to create a read-replica? How long does it take to create a new cluster from backup? 10GB, 100GB, 1TB - estimate 1 week

k8s Operator

There are currently multiple available operators that seam feasible, but detailed evaluation and benchmarking to ensure our requirements are met is critical. One feature that might be lacking is zero downtime upgrades. Currently, we maintain our own automation for this as well and could adapt it until an operator provides this feature.

Pro Description Priority / Importance
API An API for all offered featured is already provided. high
Reduced operation tooling By only supporting k8s, we remove the need to manage VM via Chef. We need to maintain less code and reduce complexity. High
Off-the-shelf automation Utilizing an off-the-shelf operator reduces the amount of automation code to maintain. We do not need to develop an API from scratch. high
Control over our data Observability and control are not given away. We can decide if and when we want to migrate to any other solution. We can decide on where our data is stored, level of redundancy air gab backups, ect. medium
Future proof When there is a better or more desirable solution available in the future we are not restricted in any way to migrate. high
No product lock-in We are not locked in to one product we can not leave in the future. medium
Debugging capability Compared to any SaaS offering we do not rely on a vendor to be willing and able to debug our problems in a timely manner. high
Good integration with Cells Compared to other self-hosted solutions, the database will run in the same k8s cluster as the rest of the workloads. This removes the need to integrate external components as well as multiple failure vectors. medium
Near Zero Downtime Upgrade We can adapt Gitlab’s (db-migration/pg-upgrade-logical Automation to achieve near-zero downtime for PostgreSQL MVU over k8s high / blocker
Cons / Risks Description Priority
Know how k8s is widely used among our infrastructure, but not within Database Reliability. We need to build up knowledge in the team. medium
Missing Features Missing features need to be implemented by us, e.g. not supported extension. medium

Things to validate

Amazon RDS PostgreSQL

Amazon Relational Database Services PostgreSQL is AWS’s managed database service offering fully compatible with PostgreSQL community version. In fact, Amazon only packs and deploys the PostgreSQL community binaries into the RDS instance underlying infrastructure. GitLab currently recognizes Amazon RDS PostgreSQL as a supported PostgreSQL implementation.

Pro Description Priority / Importance
DBaaS Minimal operation overhead, in theory. We only need to define what DB we need and get it provided as a Service. high
Outsourced We don’t need to maintain infrastructure beyond our hooks, interfaces, monitoring, alerting and forecast. high
Ops. Extensions The essential extensions necessary to run GitLab.com are available. high
API An API for all offered features is already provided. high
Gitlab Dedicated Compatibility Amazon RDS PostgreSQL is already used by our Gitlab Dedicated architecture, therefore is already integrated into Gitlab Dedicated-tooling ?
Less Risk of Vendor lock-in Currently, we are not fully locked into the RDS infrastructure. Migration is possible via PostgreSQL native logical replication, Export to S3, or even Amazon DMS, which supports migrating RDS databases to “on-premises” medium
Commitment to PostgreSQL development Amazon currently employs 9 committers and major contributors to PostgreSQL source code. In contrast, Microsoft employs 8 committers/major contributors and Google just 1. low
Bugfix/Patches within 7 days Proven history for minor releases (bugfixes) within 7 days. A few patches were released even in the same day. high
Release within 6 months Track record that PostgreSQL major versions always was released for RDS in less than 6 months, sometimes even in less than 3 months. high
Near-Zero Downtime Upgrade The default RDS option for MVU is based in pg_upgrade without logical replication, therefore it requires downtime of dozens of minutes. However, the downtime can be reduced (usually under 60 seconds) with the RDS Blue/Green deployment for database updates. high / blocker
Escalation to RDS expert engineers There’s no specific support path to reach an Internal RDS engineer in a timely manner. However, RDS support has the SME (subject matter experts) program, which is part of Amazon Support escalation and should be available within minutes to collaborate with us on S1 incidents. high
Support custom Backup Plans Amazon RDS supports custom backup plans and not only “daily” backups.
Cons / Risks Description Priority / Importance
Export of Backups with third-party tools We regularly export base backups for testing, data analysis, export to Database Labs for the Database Team, or disaster recovery preparation, but third-party backup tools are not supported by RDS. The only option to export data for external usage is Snapshot Expott into Apache Parquet format, which might not be supported by all consumers, or exporting pg_dump to S3, which is slow to process high / blocker
External Streaming Replication RDS supports cross-region physical Replication, but it doesn’t support physical replication to an external database. medium
WAL Archiving Delayed Replication RDS PostgreSQL doesn’t support recovery_min_apply_delay to configure delayed replication. medium
Debug capability We lose observability (eg; strace/ perf) by not having full access to the database machine, however, if absolutely necessary RDS internal engineers might be escalated to perform any low-level debugging. Currently, we have evidences that this would happen in a timely manner. medium
No direct access to WAL files We can’t export WAL files nor execute pg_waldump into an RDS instance. This sometimes was required to debug frequently modified objects and their respective content low

Things to validate

  • We need to validate how to export RDS Performance Insights metrics into our Monitoring tools
  • How to export PostgreSQL logs into Elastic
  • Validate a connection pooling solution for both Write and Read-Only workloads
  • Compare Migration options
    • Native logical replication
    • Amazon Database Migration Services
  • Evaluate time and impact of Major Version Upgrades 50k reference architecture?
    • Also evaluate Blue/Green deployments
  • How long does it take to create a read-replica, or a new cluster from a backup? 10GB, 100GB, 1TB
  • How long is the database service downtime with Blue/Green major version upgrade method?
    • We should also test if pgbouncer or RDS Proxy can hold requests to alleviate the impact during a Blue/Green deployment.
Evaluate changes over current Dedicated-RDS deployment
  • Cost of Enable Enhanced Monitoring with small granularity ( < 10 seconds)
  • Evaluate offload of read operations into Standby Replicas
  • Evaluate “Enable auto minor version upgrade”
  • Evaluate performance gain with “Dedicated Log Volume”
  • Increase logging level: slow queries, temp usage, autovacuum, lock waits, connections/disconnections, DDL statements
  • Configure pg_stat_statements settings
  • Load and Configure auto_explain
  • Implement “logical backup” solution
  • Review Alarms: RDS Event Subscriptions, AWS Cloud Watch

Refactor Current Automation

The current platform to host all PostgreSQL clusters for GitLab.com is based mainly on Terrafrom and Chef, managing GCP virtual machines. Some complex operations are orchestrated via Ansible.

The main problems currently with this architecture is the historical growth and technical debt. It was designed to host a single, or small fixed number of database clusters. Hosting any arbitrary number of clusters is theoretically possible, but requires manual steps and overhead.

These inefficiencies are incompatible with the estimated requirements of hosting many database clusters, e.g. >100.

Some examples:

  • For each database cluster we have to create multiple Chef roles
  • For each database cluster we have to manually define a network subnet and write it in a configuration file
  • By policy for each database cluster in production, we have an equivalent in staging

These shortcomings and bottlenecks could be removed by improving the automation with Chef or replacing it by a different tool set.

Pro Description Priority / Importance
Iteration, Low Risk We have an inefficient but fully working platform, therefore we can deploy enhancements iteratively, one step at a time. There is no “big bang” migration, no risk of not having certain features or to discover shortcomings of a new platform late in the process. high
Cons / Risks Description Priority / Importance
Chef Know-How Chef know-how and usage inside GitLab (and outside) is already decreasing. The current team does not have the needed know-how and capacity to implement the needed changes in a timely manner. Moving other SRE with Chef and domain knowledge into the DBRE team would be necessary, as well as GitLab’s commitment to Chef. high / blocker
Efficiency The ideal outcome will most likely still be not as efficient as a DBaaS solution. medium
Development / Maintenance We still have to develop and maintain the full automation code in-house. high / blocker
Integration Each Cell will run in a single k8s cluster, this solution will not be Cell local. This brings higher complexity and possible failure vectors. medium

Cells 1.0

Cells 1.0 is the first iteration for Cells, which should become GA during 2024. This first iteration will only host a limited number of test users and projects and therefore does not have the full requirements. It poses an opportunity to learn and iterate on the database platform.

During the development the number of Cells can grow from 1 to up to 10. The estimate is that it will take multiple month until a second Cell will be created, and a second cluster is needed.

Cloud SQL currently does not fulfill the given high priority requirements, and there is no roadmap which fully includes them as well. We will use Cloud SQL temporary for Cells 1.0 despite that to unblock the Cells 1.0 time-line. Therefor we acknowledge and agree to the associated risks and bottlenecks.

  • There will be downtime to migrate to the final platform!
  • The quality of service will not be equal to GitLab.com.
    • There is no clear SLO for deep level support.
    • No major version upgrades with reasonable downtime.
  • We can not grow reliable grow this solution to the target of 50k users, we see problems for 3k users regarding connection pooling already.

We prioritize finding a solution meeting the requirements, to make it available after Cells 1.0. The most promising candidates are the existing k8s PostgreSQL operators, offering PostgreSQL as a Service like experience.

Cells 1.x

As highlighted in Cells 1.0 we start with Cloud SQL as a temporary solution to free resource to build a sufficient platform for the long-term requirements.

The first phase will evaluate which k8s operators are available and which are the most fitting candidates for a full evaluation.

Current Database Platform

The current database platform is limited in future growth, but currently not at the edge of saturation. It is not clear when the database will become a bottleneck, but based on the prior experience from before decomposition a database with 200% size poses operational challenges.

There are currently no predictions how long it takes for our database to double in size, but it is fair to assume this will not happen in the next 24 month, until 2026.

This gives enough time to migrate significant data into cells, or to improve / change the database platform here as well.