Database Platform
Status | Authors | Coach | DRIs | Owning Stage | Created |
---|---|---|---|---|---|
wip |
alexander-sosna
|
andrewn
|
devops data_stores | 2024-02-06 |
Current GitLab.com Architecture
GitLab.com uses 4 PostgreSQL clusters running on virtual machines in GCP managed by Terraform, Chef and Ansible for complex tasks. With Cells, we want to create significant more than 4 PostgreSQL clusters, if we want 50 Cells, we’ll have to have 50+ PostgreSQL clusters.
@startuml
component "Rails"
component "Container Registry"
database "main"
database "ci"
database "registry"
database "embedding"
component "pgbouncer" as pgbouncermain
component "pgbouncer" as pgbouncerci
component "pgbouncer" as pgbouncerregistry
component "pgbouncer" as pgbouncerembedding
"Rails" --> pgbouncermain
pgbouncermain --> "main"
"Rails" --> pgbouncerci
pgbouncerci --> "ci"
"Rails" --> pgbouncerembedding
pgbouncerembedding --> "embedding"
"Container Registry" --> pgbouncerregistry
pgbouncerregistry --> "registry"
:User: --> "Rails"
:User: --> "Container Registry"
@enduml
We currently have 4 different database clusters, each hosting a single production database
Database | Description | Size | Scaling headroom |
---|---|---|---|
Main |
Default location for all data | Over 20TB | Very little |
CI |
All CI/CD related data | Over 20TB | Very little |
Registry |
Container Registry data | Less than 2TB | Lots of headroom |
Embedding |
Running pgvector to store AI training data | Less than 2TB | Lots of headroom |
The data will be sharded by the organization_id
, and an organization_id
can live in 1 Cell.
During the Cells rollout most of the data will remain in the current database clusters, so it is important to accommodate for the requirements of this larger monolithic systems as well.
Future Architecture
For the upcoming Cells project the requirements are different. The current estimate is that each Cell will host a relatively small number of users compared to our current monolith, posing the new challenge to host many small database clusters and to create them on short notice
The exact requirements will develop over time, but the following is requested:
- Creation and removal of database clusters via API, which will be part of the provisioning of a cell.
- Database clusters up to the size of our reference architecture, 50,000 users API: 1000 RPS, Web: 100 RPS, Git (Pull): 100 RPS, Git (Push): 20 RPS.
- Up to 2,000 Cells and database clusters.
Our current automation involves many manual steps, including the creation of dedicated Chef roles for each cluster. The change process to create new cluster is the largest overhead, requiring multiple Merge Requests and Peer Review as well execution by Change Request. This is incompatible with the newly given requirements.
However, Cells will start iterative with one single Cell, needing one database cluster. The current platform can therefore be used at least for Cells 1.0, but could be kept until all data is stored in the new Cell local database clusters.
It is also possible to migrate the current data to a new database platform, before the full sharing across multiple smaller Cells. If this is desired, the requirements for the current monolith need to be meet by the target platform as well.
Requirements
Requirements will be split into what we currently need for the monolithic data stores for GitLab.com and what we estimate for the individual cells.
Requirements can be put into the following categories. They can be high, soft, low.
- high, is needed to operate GitLab.com.
- medium, is currently needed for operation, but could be replaced
- low, nice to have
Universal Requirements
Requirement | Description | Priority | Cloud SQL | Crunchy K8s Operator | Amazon RDS |
---|---|---|---|---|---|
Support GitLab | Ability to support GitLab application with minimal, or no changes | high | ✅ | ✅ | ✅ |
PostgreSQL Major Release | Support for stable releases within 6 months, so development and infrastructure teams can work on integration. | high | ❌ | ✅ | ✅ |
PostgreSQL Major Release | Support for stable releases within 3 months, so development and infrastructure teams can work on integration. | medium | ❌ | ✅ | ✅/partial |
PostgreSQL Patch Release | Support for minor releases (bugfixes) within 7 days. | high | ❌ | ✅ | ✅ |
PostgreSQL Security Fixes | Any security fix to follow out Security SLAs, critical within 24 hours. | high | ❌ | ✅/partial | ✅ |
PostgreSQL Beta Release | Support for current PostgreSQL beta release, so development and infrastructure teams can test early. | low | ❌ | ✅ | ✅/only preview env |
Near Zero Downtime Upgrade | Upgrades need to be possible with only seconds, not minutes or hours, of APDEX degradation. | high | ❌ | ✅/partial, requires engineering efforts | ✅/partial |
HA solution | High availability and failover automation, on-par or better than current Patroni solution. Switchover / Failover in seconds without human intervention and not allowing split-brain scenarios. | high | ✅ | ✅ | ✅ |
Logging Integration | Structured logging on Postgres available. | high | ✅ | ✅/Configurable & Sidecar option | ✅ |
Metrics | Integration in our Prometheus / Grafana monitoring setup. | high | ✅ | ✅ | ✅ |
PostgreSQL Extensions - Operation (high) | The following extensions are currently necessary: - pg_stat_statements - pg_wait_sampling - amcheck - pgvector - pg_trgm - btree_gin - btree_gist - plpgsql - pg_repack |
high | ✅ | ✅/not wait_sampling, repack;but option to rebuild image | ✅ |
PostgreSQL Extensions - Debug (medium) | The following extensions are used for debugging: - pg_stat_kcache - pgstattuple - pageinspect - pg_buffercache |
medium | ❌ | ✅/not kcache;but option to rebuild image | ✅/not kcache |
PostgreSQL Extensions - Migration / Sharding (low) | The following extensions are not used yet, but might become important in the future: - postgres_fdw - file_fdw |
low | ❌ | ✅ | ✅/not file_fdw |
Debug Tooling | We currently use tools like strace to hook into PostgreSQL processes to find performance root causes. In a SaaS we would need to make sure the service provider is willing and able to do such analysis instead. | medium | ❌ | ✅/On-demand package install or container image rebuild | ❌ |
Backups with third-party tools | Automated and manual base backups with third party tools such as wal-g, pgBackRest, with custom backup/archive repositories and retention policies. | high | ❌ | ✅ | ❌ |
Disk based backup / restore | Automated and manual fast backup like disk / volume snapshots, configurable retention policy, atomic snapshots or integration with pg_start_backup / pg_stop_backup to guarantee consistent data | high | ✅ | ✅/Clone-only, no multi-AZ support yet; on roadmap. | ✅ |
Incremental backups | Customize incremental backup execution, which allow us to perform more frequent backups, which reduces RTO. | medium | ✅ | ✅ | ✅ |
Backup Export | Ability to export backups to generic storage e.g. GCS buckets. | high | ❌ | ✅ | ❌/only S3 .parquet |
Local Streaming Replication | Streaming Physical Replication to the same region is required to offload read-only workload and for horizontal scaling. | high | ✅ | ✅ | ✅ |
Multi-Region Streaming Replication | SR to remote locations, like multi-region, is needed for DR and future migrations. | high | ✅ | ✅ | ✅ |
External Streaming Replication | SR to an external PostgreSQL database outside of the production environment, a different account, or even a different Cloud provider, can be required for DR, testing and future migrations. | medium | ❌ | ✅ | ❌ |
WAL Archiving Replication | WAL Archiving Replication eliminates the impact of hot standby feedback and streaming replication, which is useful for analytics (eg. execution of long/slow queries and reports), query testing and performance debug | high | ✅ | ✅ | ✅ |
WAL Delayed Archiving Replication | WAL Delayed Archiving Replication allow to keep a Replica continuously recoverying in a previous point-in-time (eg. 8 hours ago), this provides a very quick Disaster Recovery method for issues such as human errors. | medium | ❌ | ✅ | ❌ |
Logical Replication | Logical Replication is needed for zero downtime upgrades, future migrations or any kind of infrastucture change that might not support PostgreSQL physical replication. | high | ✅ / ? | ✅ | ✅ |
Read Load Distribution | Standbys to distribute read load need to be deployable on short notice, to mitigate performance bottlenecks, a suitable automation is acceptable as well. | high | ✅ | ✅/use pgBackRest instead of volume snapshot | ✅ |
Regional deployment | We need to be able to define the region of individual standbys for DR requirements. | low | ✅ | ✅ | ✅ |
Database Lab Integration | Database Lab is used by our backend developers and needs to be integrated. | low | ✅ / ? | ✅ | ✅ / ? |
Current Platform Requirements
Performance
Performance is a key factor. The first task is to define performance outlines and a reliable method to test if they are met.
{- TODO: Define performance requirements and test procedure -}
We observed the following load peaks and can use them as an indication:
~100k
read-write TPS on the primaries~80k
TPS on the Main primary~32k
TPS on the CI primary
~1M
read-only TPS distributed among the standbys~714k
TPS on the Main standbys~150k
TPS on the CI standbys
Cells Requirements
Requirement | Description | Priority |
---|---|---|
API | Provision a ready to receive traffic PostgreSQL cluster via an API. | high |
Manage 100+ | We need to be able to efficiently manage 100+ clusters. | high |
Homogenous databases | All the databases need to behave the same way, and we don’t have special snowfalkes for each deployment. | medium |
TODO: Define performance requirements and check with different steak holders. Discussed with @rnienaber, we will start with the largest reference architecture.
- Up to around 50.000 users, our largest reference architecture, Up to 50,000 users API: 1000 RPS, Web: 100 RPS, Git (Pull): 100 RPS, Git (Push): 20 RPS
- This might increase in the future, or multiple hundred Cells are required to host all current users
Decomposition
The application data for GitLab.com is currently decomposed into two separate database clusters, Main
and CI
.
We are evaluating if we can further decompose the Main
database with decomposing `Secure- and Software Supply Chain Security-related tables to a separate Postgres DB to gain more headroom and scalability for the current platform.
For Cells it is a design choice to scale horizontally by adding more Cells and to rebalance by moving organizations to less saturated cells. Cells should not be scaled vertically to a point where decomposition is reasonable. Therefore decomposition within Cells will not be needed and is also not supported by the current Dedicated / GET / Cells tooling. Adding support for Decomposition in GET and Dedicated in the future would be possible as a medium-complexity task.
In line with this, only a single database cluster is provisioned per Cell, capable of hosting all needed databases.
Overview - Possible Solutions
Non-PostgreSQL Offerings
We briefly looked into non-PostgreSQL, claiming certain levels of PostgreSQL compatibility, like AlloyDB. We came to the conclusion that cost and risk are not in any meaningful proportion to the expected benefit. Currently, there is no inherent shortcoming of PostgreSQL itself that justifies further investigation.
Further details below:
- Alternatives To Postgres DB (Google Spanner vs AlloyDB)
- Switching to a proprietary database, instead of PostgreSQL
Cloud SQL
Cloud SQL is Google’s standard PostgreSQL offering.
It is a custom fork, claimed to be 100% compatible to the upstream Releases and is simply referenced as PostgreSQL
in the Cloud SQL documentation.
GitLab currently recognizes Cloud SQL as a supported PostgreSQL implementation.
Pro | Description | Priority / Importance |
---|---|---|
DBaaS | Minimal operation overhead, in theory most efficient. We only need to define what DB we need and get it provided as a Service. | high |
Outsourced | We don’t need to maintain infrastructure beyond our hooks, interfaces, monitoring, alerting and forecast. | high |
Ops. Extensions | The essential extensions necessary to run GitLab.com are available. | high |
API | An API for all offered featured is already provided. | high |
Cons / Risks | Description | Priority / Importance |
---|---|---|
Product Lock-in | Cloud SQL is a One-Way door decision. Currently, we can replicate our data to any given destination providing PostgreSQL’s Streaming Replication interface. GCP activity prohibits users from migrating away by multiple measures, e.g.: no base backup export, no WAL archiving, no Streaming Replication. A migration away will require significant downtime, unacceptable by our current availability goals for GitLab.com. | high / blocker |
Release Delay | It is vital to get PostgreSQL releases in a reasonable time and having a reliable roadmap to as source of truth for planing. Google gives no guarantee, only an estimate to offer GA in 3 months time. This estimate is not correct for PG16 was released 2023-09-14 and is today 2024-02-16 not available yet. This is no isolated case, PG15 became available 2023-05-24 , 7 months after the release. |
high / blocker |
Bugfix Delay | Fixes to critical bugs regarding data consistency or open vulnerabilities need to be available quickly, ideally within hours, at least in a few days time. Google gives no guarantee, only an estimate of 30 days. This estimate is not correct. 15.3 was released 2023-05-11 and is today 2024-02-16 not available yet, 281 days after the release. In the meantime the versions 15.4 , 15.5 and 15.6 have been released and are also not available yet, leaving Cloud SQL 4 patch levels behind. |
high / blocker |
No Zero Downtime Upgrade | Upgrades need to be possible with only seconds, not minutes or hours, of APDEX degradation. We found inconsistent data for maintenance time, ranging from 10s to 30+ minutes and need to validate this claim with our data set. | high / blocker |
Base Backups | We regularly export base backups for testing, data analysis, export to Database Labs for the Database Team, or disaster recovery preparation. Google currently denies export of backups to generic storage, e.g. GCS buckets. | high / blocker |
External Streaming Replication | CloudSQL supports multi-region physical Replication, but it doesn’t support physical replication to external database. | medium |
WAL Archiving Delayed Replication | CloudSQL doesn’t support recovery_min_apply_delay to configure delayed replication. But in case hot_standby_feedback is disabled and a lag is created due to a large query CloudSQL automatically move a Replica from Streaming to Archive replication. |
medium |
Observability: DB | We loose observability (eg; strace / perf ) by not having full access to the database machine. We need GCP to execute low level debugging, but currently have indication that this would happen in a timely manner. |
medium |
Observability: Queries | We get most of the observability we currently have, but we won’t have the ability to dig into the database internals like lock contention with Cloud SQL because we don’t have access to the database process. When we only have small instances e.g. 50k users this might be medium, but for larger instances could be high. | medium |
Escalation to database engineers | We could not find a specific support path where we can reliably reach Cloud SQL experts in a timely manner. A Cloud SQL expert needs to be available within minutes to collaborate with us on S1 incidents. | high |
Most of the information above can be found in the official Cloud SQL documentation. Some estimates were given to us verbally by the GCP team, as seen in the meeting notes Discuss Cloud SQL and AlloyDB with GitLab (internal).
Things to validate
- Could we use the offered logical replication feature (pglogical) for our migration needs? - estimate 2-4 weeks
- Can we access WAL and base_backups, as it appears in the pitr documentation, in contrast to our meeting, where GCP denied it. - estimate < 1 week
- How long does a major upgrade take for our 50k reference architecture? - estimate 4-5 weeks
- Is Query Insights a sufficient replacement for the current observability tooling.
- How long does it take to create a read-replica? How long does it take to create a new cluster from backup?
10GB
,100GB
,1TB
- estimate 1 week
k8s Operator
There are currently multiple available operators that seam feasible, but detailed evaluation and benchmarking to ensure our requirements are met is critical. One feature that might be lacking is zero downtime upgrades. Currently, we maintain our own automation for this as well and could adapt it until an operator provides this feature.
Pro | Description | Priority / Importance |
---|---|---|
API | An API for all offered featured is already provided. | high |
Reduced operation tooling | By only supporting k8s, we remove the need to manage VM via Chef. We need to maintain less code and reduce complexity. | High |
Off-the-shelf automation | Utilizing an off-the-shelf operator reduces the amount of automation code to maintain. We do not need to develop an API from scratch. | high |
Control over our data | Observability and control are not given away. We can decide if and when we want to migrate to any other solution. We can decide on where our data is stored, level of redundancy air gab backups, ect. | medium |
Future proof | When there is a better or more desirable solution available in the future we are not restricted in any way to migrate. | high |
No product lock-in | We are not locked in to one product we can not leave in the future. | medium |
Debugging capability | Compared to any SaaS offering we do not rely on a vendor to be willing and able to debug our problems in a timely manner. | high |
Good integration with Cells | Compared to other self-hosted solutions, the database will run in the same k8s cluster as the rest of the workloads. This removes the need to integrate external components as well as multiple failure vectors. | medium |
Near Zero Downtime Upgrade | We can adapt Gitlab’s (db-migration/pg-upgrade-logical Automation to achieve near-zero downtime for PostgreSQL MVU over k8s | high / blocker |
Cons / Risks | Description | Priority |
---|---|---|
Know how | k8s is widely used among our infrastructure, but not within Database Reliability. We need to build up knowledge in the team. | medium |
Missing Features | Missing features need to be implemented by us, e.g. not supported extension. | medium |
Things to validate
- Compare Candidates (4-5 weeks)
- CYBERTEC PostgreSQL Operator (zalando with commercial support)
- CloudNativePG
Amazon RDS PostgreSQL
Amazon Relational Database Services PostgreSQL is AWS’s managed database service offering fully compatible with PostgreSQL community version. In fact, Amazon only packs and deploys the PostgreSQL community binaries into the RDS instance underlying infrastructure. GitLab currently recognizes Amazon RDS PostgreSQL as a supported PostgreSQL implementation.
Pro | Description | Priority / Importance |
---|---|---|
DBaaS | Minimal operation overhead, in theory. We only need to define what DB we need and get it provided as a Service. | high |
Outsourced | We don’t need to maintain infrastructure beyond our hooks, interfaces, monitoring, alerting and forecast. | high |
Ops. Extensions | The essential extensions necessary to run GitLab.com are available. | high |
API | An API for all offered features is already provided. | high |
Gitlab Dedicated Compatibility | Amazon RDS PostgreSQL is already used by our Gitlab Dedicated architecture, therefore is already integrated into Gitlab Dedicated-tooling | ? |
Less Risk of Vendor lock-in | Currently, we are not fully locked into the RDS infrastructure. Migration is possible via PostgreSQL native logical replication, Export to S3, or even Amazon DMS, which supports migrating RDS databases to “on-premises” | medium |
Commitment to PostgreSQL development | Amazon currently employs 9 committers and major contributors to PostgreSQL source code. In contrast, Microsoft employs 8 committers/major contributors and Google just 1. | low |
Bugfix/Patches within 7 days | Proven history for minor releases (bugfixes) within 7 days. A few patches were released even in the same day. | high |
Release within 6 months | Track record that PostgreSQL major versions always was released for RDS in less than 6 months, sometimes even in less than 3 months. | high |
Near-Zero Downtime Upgrade | The default RDS option for MVU is based in pg_upgrade without logical replication, therefore it requires downtime of dozens of minutes. However, the downtime can be reduced (usually under 60 seconds) with the RDS Blue/Green deployment for database updates. |
high / blocker |
Escalation to RDS expert engineers | There’s no specific support path to reach an Internal RDS engineer in a timely manner. However, RDS support has the SME (subject matter experts) program, which is part of Amazon Support escalation and should be available within minutes to collaborate with us on S1 incidents. | high |
Support custom Backup Plans | Amazon RDS supports custom backup plans and not only “daily” backups. |
Cons / Risks | Description | Priority / Importance |
---|---|---|
Export of Backups with third-party tools | We regularly export base backups for testing, data analysis, export to Database Labs for the Database Team, or disaster recovery preparation, but third-party backup tools are not supported by RDS. The only option to export data for external usage is Snapshot Expott into Apache Parquet format, which might not be supported by all consumers, or exporting pg_dump to S3, which is slow to process | high / blocker |
External Streaming Replication | RDS supports cross-region physical Replication, but it doesn’t support physical replication to an external database. | medium |
WAL Archiving Delayed Replication | RDS PostgreSQL doesn’t support recovery_min_apply_delay to configure delayed replication. |
medium |
Debug capability | We lose observability (eg; strace / perf ) by not having full access to the database machine, however, if absolutely necessary RDS internal engineers might be escalated to perform any low-level debugging. Currently, we have evidences that this would happen in a timely manner. |
medium |
No direct access to WAL files | We can’t export WAL files nor execute pg_waldump into an RDS instance. This sometimes was required to debug frequently modified objects and their respective content |
low |
Things to validate
- We need to validate how to export RDS Performance Insights metrics into our Monitoring tools
- How to export PostgreSQL logs into Elastic
- Validate a connection pooling solution for both Write and Read-Only workloads
- pgbouncer over VMs
- RDS Proxy
- Compare Migration options
- Native logical replication
- Amazon Database Migration Services
- Evaluate time and impact of Major Version Upgrades 50k reference architecture?
- Also evaluate Blue/Green deployments
- How long does it take to create a read-replica, or a new cluster from a backup?
10GB
,100GB
,1TB
- How long is the database service downtime with Blue/Green major version upgrade method?
- We should also test if pgbouncer or RDS Proxy can hold requests to alleviate the impact during a Blue/Green deployment.
Evaluate changes over current Dedicated-RDS deployment
- Cost of Enable Enhanced Monitoring with small granularity ( < 10 seconds)
- Evaluate offload of read operations into Standby Replicas
- Evaluate “Enable auto minor version upgrade”
- Evaluate performance gain with “Dedicated Log Volume”
- Increase logging level: slow queries, temp usage, autovacuum, lock waits, connections/disconnections, DDL statements
- Configure
pg_stat_statements
settings - Load and Configure
auto_explain
- Implement “logical backup” solution
- Review Alarms: RDS Event Subscriptions, AWS Cloud Watch
Refactor Current Automation
The current platform to host all PostgreSQL clusters for GitLab.com is based mainly on Terrafrom and Chef, managing GCP virtual machines. Some complex operations are orchestrated via Ansible.
The main problems currently with this architecture is the historical growth and technical debt. It was designed to host a single, or small fixed number of database clusters. Hosting any arbitrary number of clusters is theoretically possible, but requires manual steps and overhead.
These inefficiencies are incompatible with the estimated requirements of hosting many database clusters, e.g. >100.
Some examples:
- For each database cluster we have to create multiple Chef roles
- For each database cluster we have to manually define a network subnet and write it in a configuration file
- By policy for each database cluster in production, we have an equivalent in staging
These shortcomings and bottlenecks could be removed by improving the automation with Chef or replacing it by a different tool set.
Pro | Description | Priority / Importance |
---|---|---|
Iteration, Low Risk | We have an inefficient but fully working platform, therefore we can deploy enhancements iteratively, one step at a time. There is no “big bang” migration, no risk of not having certain features or to discover shortcomings of a new platform late in the process. | high |
Cons / Risks | Description | Priority / Importance |
---|---|---|
Chef Know-How | Chef know-how and usage inside GitLab (and outside) is already decreasing. The current team does not have the needed know-how and capacity to implement the needed changes in a timely manner. Moving other SRE with Chef and domain knowledge into the DBRE team would be necessary, as well as GitLab’s commitment to Chef. | high / blocker |
Efficiency | The ideal outcome will most likely still be not as efficient as a DBaaS solution. | medium |
Development / Maintenance | We still have to develop and maintain the full automation code in-house. | high / blocker |
Integration | Each Cell will run in a single k8s cluster, this solution will not be Cell local. This brings higher complexity and possible failure vectors. | medium |
Cells 1.0
Cells 1.0 is the first iteration for Cells, which should become GA during 2024. This first iteration will only host a limited number of test users and projects and therefore does not have the full requirements. It poses an opportunity to learn and iterate on the database platform.
During the development the number of Cells can grow from 1
to up to 10
.
The estimate is that it will take multiple month until a second Cell will be created, and a second cluster is needed.
Cloud SQL currently does not fulfill the given high priority requirements, and there is no roadmap which fully includes them as well. We will use Cloud SQL temporary for Cells 1.0 despite that to unblock the Cells 1.0 time-line. Therefor we acknowledge and agree to the associated risks and bottlenecks.
- There will be downtime to migrate to the final platform!
- The quality of service will not be equal to GitLab.com.
- There is no clear SLO for deep level support.
- No major version upgrades with reasonable downtime.
- We can not grow reliable grow this solution to the target of 50k users, we see problems for 3k users regarding connection pooling already.
We prioritize finding a solution meeting the requirements, to make it available after Cells 1.0. The most promising candidates are the existing k8s PostgreSQL operators, offering PostgreSQL as a Service like experience.
Cells 1.x
As highlighted in Cells 1.0 we start with Cloud SQL as a temporary solution to free resource to build a sufficient platform for the long-term requirements.
The first phase will evaluate which k8s operators are available and which are the most fitting candidates for a full evaluation.
Current Database Platform
The current database platform is limited in future growth, but currently not at the edge of saturation. It is not clear when the database will become a bottleneck, but based on the prior experience from before decomposition a database with 200% size poses operational challenges.
There are currently no predictions how long it takes for our database to double in size, but it is fair to assume this will not happen in the next 24 month, until 2026.
This gives enough time to migrate significant data into cells, or to improve / change the database platform here as well.
a4c83fb3
)