Secure Govern Decomposition Working Group
Attributes
Property | Value |
---|---|
Date Created | 1 May 2024 |
Start Date | 13 May 2024 |
End Date | |
Slack | #wg_secure-govern-database-decomposition (only accessible from within the company) |
Google Doc | Working Group Agenda (only accessible from within the company) |
Issue Board | Epic Dashboard list |
Meeting Cadence | Weekly on Mondays. Recorded. EMEA and APAC options. |
Exit Criteria
The charter of this working group is to:
- Successfully decompose the Secure/Govern datasets to a separate
gitlab_sec
database in order to reduce pressure on the primary GitLab.com DB and assist in future scalability and stability concerns. - Consider the timing, scope, and impact of the decomposition related to prioritization and implementation of additional efforts to support GitLab.com db performance and optimization for related tables - OKR (GitLab internal)
- Evaluate the impact of the decomposition on Self-Managed instances regarding feature parity, performance/hardware requirement, improvements for different size of DBs, and admin’s effort to support.
- Provide an effective migration guide and/or tooling to assist Self-Managed instances in the decomposition of their local CI and Secure/Govern databases in alignment with GitLab.com
Objectives
Key results we’d like to achieve within the scope of the working group to ensure the outcome has the most desireable outcome.
Objective | Notes | Achieved On |
---|---|---|
To the best of our ability, ensure implementation does not result in increased costs or burden for self-managed users, similar to the CI decomposition outcome. | ||
Modifications have been signed off on by Reference Architecture and appropriately documented. | Raise an issue in the Reference Architecture tracker to gather needed advice on an ad hoc basis. | |
Minimise disruption to GitLab.com in the process of decomposition. | This may be unavoidable if we opt to use Phsyical Replication, which will require all traffic to the database to be ceased prior to cutover. |
Glossary
Preferred Term | What Do We Mean | Terms Not To Use | Examples |
---|---|---|---|
Cluster | A database cluster is a collection of interconnected database instances that replicate data. | The PostgreSQL cluster of GitLab.com (managed by Patroni) that hosts the main logical database and consists of the primary database instance along with its read-only replicas. | |
Decomposition | Feature-owned database tables are on many logical databases on multiple database instances. In terms of GitLab.com, our desired decomposition outcome includes the separation of these instance to different database servers as well. The application manages various operations (ID generation, rebalancing etc.) | Y-Axis, Vertical Sharding | All Secure/Govern tables in separate logical database from Core tables. Design illustration |
Instance | A database instance is comprised of related processes running in the database server. Each instance runs its own set of database processes. | Physical Database | |
Logical database | A logical database groups database objects logically, like schemas and tables. It is available within a database instance and independent of other logical databases. | Database | GitLab’s rails database. |
Node | Equivalent to a Database Server in the context of this working group. | Physical Database | |
Replication | Replication of data with no bias. | X-Axis, Cloning | What we do with our database clusters to enable splitting read traffic apart from write traffic. |
WAL (Write Ahead Log) | Write ahead logs are the mechanism by which Postgres records inserted data. WAL records are then processed to modify the stored dataset in a separate process. These logs can be replicated. | ||
Logical Replication | Replication of data using the built-in Postgres replication processes to transfer WAL via a PUB-SUB model | ||
Physical Replication | Replication of data by copying the actual files on the written disk to a new Phsyical Database. | ||
Application Replication | Replication of data to a separate database by the configuration of replication routines in GitLab itself. | ||
DB Schema | A SQL database schema is a namespace that contains named database objects such as tables, views, indexes, data types, functions, stored procedures and operators, see docs | ||
GitLab DB Schema | An application-level table classification schema that abstracts away the underlying database connection, see docs | ||
Server | A database server is a physical or virtual system running an operating system that is running one or more database instances. | Physical Database | |
Table | A database table is a collection of tuples having a common data structure (the same number of attributes, in the same order, having the same name and type per position) (source) | ||
Table Partitioning | A table that contains a part of the data of a partitioned table (horizontal slice). (source) | Partition | |
Dataset | A set of tables and their contained data that is contained within a logical database. | The Secure/Govern Dataset includes all tables related to GitLab’s security features, including but not limited to vulnerability and dependency tracking. | |
Featureset | A set of features associated with some kind of concept within GitLab for ease of reference. | Core, Secure/Govern | |
Core | Referred to in terms of Dataset or Featureset, this is information of functionality related to standard GitLab operations, such as Projects, Namespaces, Users and others. | ||
Secure/Govern | Referred to in terms of Dataset or Featureset, this is information of functionality related to standard GitLab operations, such as Vulnerabilities, Dependencies (SBOM), Security Findings, Policies and more. |
Overview
There is high impetus within GitLab to reduce pressure on the primary GitLab database server. The Database and Scalability teams have been taking a variety of steps to mitigate the ongoing pressure on the database server to maintain the growth and stability of GitLab in the long term. One such endeavour is Cells, however, there is desire to provide further mitigation in the short to medium term. Decomposition of the Secure/Govern dataset from the primary database was identified as a strong possible solution, similar to how the CI decomposition aided in this regard in the past.
Decomposition of the Secure/Govern dataset is a significant engineering effort due to the magnitude of the data interactions related to these features. The domain accounts for 25% of all database write traffic, and is only set to grow as we expand our feature set and grow our customer base. Further statistics and technical details can be found on the associated epic.
As this has become a scalability and stability concern for all of GitLab.com, as well as significantly constraining the ability of the Secure/Govern stages to implement new features due to continuously growing performance concerns, it is necessary to form an organised effort to effectively achieve this project.
We have the benefit of being able to lean heavily on the prior art and experience of the database-scalability working group who decomposed the CI database to achieve this goal. However, some key challenges we may face is the scale of the existing Secure/Govern codebase, and the need to maintain ongoing operations with (no/minimal) disruption to our customer base. A full GitLab.com downtime is heavily disfavoured due to our uptime SLA agreements with customers, but the scale of our operations may mean that some processes for this kind of decomposition may not be feasible.
Benefits
- Reduce write pressure on the GitLab.com primary Write database in advance of Cells 1.5
- Improve stability of GitLab operations, by isolating the primary database from Secure/Govern feature pressure
- General performance improvement for both the Core and Secure/Govern feature sets due to seperation of concerns.
- Improve iteration speed of Secure/Govern feature development without significant concern for compromising stability of the platform.
Risks
- Significant developer commitment for a currently unknown duration.
- Increased database maintenance requirement for the new decomposed database and it’s associated replicas.
- Possibly unable to be delivered prior to the delivery of Cells 2.0
- May require a full downtime of GitLab.com, which may be difficult to arrange with our customers.
Interdependencies
Secure/Govern Data has a high degree of integration with CI and standard GitLab data, such as Users, Projects and Namespaces. The past CI decomposition has succesfully delinked query interdependency of the associated CI dataset, however, significant effort will be necessary to do the same between the core GitLab dataset and Govern/Secure functionality.
Timeline
The group has determined that a gradual rollout is not advisable since it doesn’t relieve pressure on the main database cluster, and increases the amount of work to be done. As such, all slices will be rolled-out simultaneously to gitlab.com.
Progress
gantt dateFormat YYYY-MM-DD title 50% confidence timeline section Work Decompose tables :active , decompose, 2024-07-01, 2024-12-09 Slice 1 :active, slice1, 2024-07-23, 2024-12-09 Slice 2 :active, slice2, 2024-08-06, 2024-10-04 Slice 3 :active, slice3, 2024-07-15, 2024-12-09 Table decomposition complete :milestone, allslices, after slice1 slice2 slice3, 0d Phase 1 & 2 : phase12, 2024-09-11, 7w Phase 3 : phase3, after phase12, 3w Phase 4 : phase4, after allslices phase3 decompose, 3w Phase 5 : phase5, after phase4, 3w Phase 6 : phase6, after phase4, 3w Phase 7 : phase7, after phase6, 4w Rollout complete :milestone, rollout, after phase7, 0d axisFormat %Y-%m
Decomposition
Slice | % Done | Estimated completion |
---|---|---|
Slice 1 | 87% | 2024-12 |
Slice 2 | 100% | 2024-10 |
Slice 3 | 65% | 2024-12 |
Last update: 2024-09-18.
Plan
- Introduce separate
gitlab_sec
schema - Introduce
gitlab_sec
database connection (defaulting to fallback to usinggitlab_main
database) - In parallel, begin decomposition of foreign keys and cross-database transactions following the loose order of SBOM, Security, and Vulnerability code boundaries. For each slice perform the following breakdown:
- Migrate tables with low referentiality (few foreign keys)
- Migrate tables with higher referentiality (many foreign keys)
- Identify and allowlist cross-joins to be addressed
- Identify and allowlist cross-database transactions to be addressed
- Remove previously identified cross-joins and cross-database transactions allowances
- Await results of Logical Replication Production test to determine the viability of this as a migration path.
- Depending on the results of the production test, formulate a path for the safe migration of the Secure/Govern dataset to a new physical database. These may take the form of the headings below.
- Open Change Request to migrate tables using either (A) a phased approach mirroring code boundary slices above or (B) a single replication event for all tables in scope of decomposition
- Update documentation around migrating self-managed instances to multiple databases
Migration Proposal A: Logical Replication
- Research and test the possiblity of a staged logical replication in which we migrate small subsets of the Secure/Govern featureset at a time, such as SBOM.
- If a staged rollout is possible
- Identify the highest value feature subset to decompose
- Plan a decomposition strategy to separate only that feature to achieve a production benefit sooner.
- Establish the decomposed database instance
- Begin replicating the Secure/Govern data to the new database instance
- Write the necessary code to enable GitLab.com to begin utilising the new instance generically, and for the chosen feature subset.
- As this is a potentially risky operation, ensure production snapshots are ready and that customers are sufficiently informed of potential problems or dataloss in the event of failure.
- Begin testing transition of the feature to using the new database instance as it’s new write primary.
- If successful, globally rollout usage of the decomposed database for the feature subset.
- Repeat for each sufficiently sectionable feature subset until decomposition is completed.
- If a staged rollout is not possible
- Establish the decomposed database instance
- Begin replicating the full Secure/Govern data to the new database instance
- Write the necessary code to enable GitLab.com to begin utilising the new instance generically and for all Secure/Govern features.
- As this is a potentially risky operation, ensure production snapshots are ready and that customers are sufficiently informed of potential problems or dataloss in the event of failure.
- Begin testing transition of the Secure/Govern featureset to using the new database instance as it’s new write primary.
- If successful, globally rollout usage of the decomposed database for the full featureset.
- If a staged rollout is possible
- Cleanup legacy data from the GitLab core database.
Migration Proposal B: Physical Replication
- Determine acceptability of a full downtime for GitLab, or a temporary suspension of use for the entire Secure/Govern featureset to prevent dataloss. (Alternatively, notify users that there will be dataloss related to this featureset after a certain Date and Time)
- Begin communicating with customers ahead of time to minimise disatisfaction as a result of this disruption.
- Establish the decomposed database instance
- Write the necessary code to enable GitLab.com to begin utilising the new instance generically and for all Secure/Govern features.
- Begin testing transition of the Secure/Govern featureset to using the new database instance as it’s new write primary.
- Take GitLab down so that write traffic stops.
- Wait for replication to catch up on the node before promoting it to be the new leader of a new Secure DB cluster. Configure GitLab to write to this new Secure DB cluster.
- Globally rollout usage of the decomposed database for the full featureset.
- Cleanup legacy Secure/Govern data from the GitLab Core database.
- Cleanup legacy Core data from the new Secure/Govern database.
Migration Proposal C: Application Replication
- As a staged rollout is possible, identify the highest value feature subset to decompose.
- Plan a decomposition strategy to separate only that feature to achieve a production benefit sooner.
- Establish the decomposed database instance
- Write the necessary code to sync all possible data changes relating the chosen feature subset to the new database instance from whereever they may occur in the application.
- Begin replicating the Secure/Govern data to the new database instance for the chosen feature subset.
- Write the necessary code to enable GitLab.com to begin utilising the new instance generically, and for the chosen feature subset.
- As this is a potentially risky operation, ensure production snapshots are ready and that customers are sufficiently informed of potential problems or dataloss in the event of transition failure, as some data may not be able to be synced back to the Core database.
- Begin testing transition of the feature to using the new database instance as it’s new write primary.
- If successful, globally rollout usage of the decomposed database for the feature.
- Repeat for each sufficiently sectionable feature subset until decomposition is completed.
Roles and Responsibilities
Working Group Role | Name | Title |
---|---|---|
Executive Stakeholder | Jerome Ng | Engineering Director, Expansion |
Functional Lead | Gregory Havenga | Senior Backend Engineer, Govern: Threat Insights |
Functional Lead | Lucas Charles | Principal Software Engineer, Secure & Govern |
Facilitator AMER | Neil McCorrison | Manager, Software Engineering |
Facilitator APAC | Thiago Figueiró | Manager, Software Engineering |
Member | Fabien Catteau | Staff Engineer, Secure: Composition Analysis |
Member | Arpit Gogia | Backend Engineer, Secure: Dynamic Analysis |
Member | Schmil Monderer | Staff Backend Engineer, Secure: Static Analysis |
Member | Ethan Urie | Staff Backend Engineer, Secure: Secret Detection |
Member | ||
Member | Jon Jenkins | Senior Backend Engineer, Database |
Member | Ved Prakash | Staff Data Engineer, Data Science |
Member | Dylan Griffith | Principal Engineer, Create |
Member | Thong Kuah | Principal Engineer, Data Stores |
Member | Rick Mar | Manager, Core Infrastructure |
Related Performance Projects
- Tuple Reduction
- Brian Williams (DRI)
- Fabien Catteau
- Michael Becker
- Vulnerability Management Application Limits and Vulnerability Management Retention Policy
- Mehmet Emin Inaç (DRI)
- Joey Khabie
- Cells 1.0
- Subashis Chakraborty (DRI)
Useful References
Reference | Description |
---|---|
Link | Proposal for support levels for multiple databases in GitLab deployment architecture. |
Link | Epic dashboard for tracking outstanding work towards completion of decomposition |
Thanks
Much information and inspiration and experience is being enjoyed from the database-scalability working group who accomplished the succesful decomposition of the CI database.
23aa7a55
)