Sharding Working Group

The initial focus of this Sharding working group was to increase the scalability of our database with a long-term goal of 100x scalability.

Attributes

Property Value
Date Created February 11, 2020
End Date June 22, 2020
Slack #wg_database-sharding (only accessible from within the company)
Google Doc Sharding Working Group Agenda (only accessible from within the company)
Recordings Sharding Working Group Playlist

Outcome - Closed

We have decided to close this Sharding focused working group and will open a Scaling Working Group with a different focus. The initial focus of this Sharding working group was to increase the scalability of our database with a long-term goal of 100x scalability. At the onset of this group, it was theorized that we would hit a database scalability wall within 6-12 months. Subsequent analysis and incremental scalbility efforts have indicated that we have significantly more scaling headroom. Based on the analysis we have a high degree of confidence that the current architecture is in good shape to handle our needs for the next 12 months: Database Capacity and Saturation Analysis (Iteration 1) This analysis will continue on a monthly basis. We have also identified areas of incremental database scalability that has been prioritzed by the database team: Reduce total size and growth of GitLab.com’s PostgreSQL database. Between the ongoing analysis and incremental database improvements we have greatly reduced the urgency of database scalability.

Additionally, we have come to the consensus that sharding is not the desired approach for our long term scalability needs. This decision was informed through investigation, proofs of concept, research, interviews and various implementation proposals. Here’s a brief list of items that helped to inform our decision to close this working group:

The core members of this working group will continue on with the Scaling Working Group to determine our long term scaling strategy and implementation. The rest of this working group page will remain for reference purposes.


Business Goals

A scalability approach that will give us 100x headroom over what we have now on GitLab.com. Additionally, the ability to isolate customer data is an influencing factor on the design and implementation.

Background

At the onset of this working group, anecdotal information indicated that we were going to “hit a wall” on scaling our database to support our projected customer growth. Early estimates indicated us hitting a scaling wall anywhere from 6 - 12 months down the road. This estimate has since been revised to a rolling 12 month window due to Database Capacity and Saturation Analysis (Iteration 1). Database sharding was proposed as a solution to improve our scalability while simultaneously improving performance. We have since expanded our discussions from solely focusing on database sharding. Any solution, even if using database sharding technology, will require signifcant application changes as well.

The goal for customer isolation serves multiple purposes. Isolation of customer data would likely include distributing data across multiple servers. This level of distribution would improve availability by removing the single point of failure of our single database architecture. Additionally, we hearing more requests from customers to provide a solution that better separates customer data.

Areas of Investigation

In support of our business goals of scalability and customer isolation we’ve identified the following areas of investigation.

Namespace Sharding

Details can be found in the Postgres Sharding (&1854) epic. This area of investigation is focuses on sharding at the top-level namespace. The initial investigations were database-centric, focusing on sharding the tables. Our investigations have indicated the following:

Tenant Sharding

A proposal titled Tenant Sharding was recently introduced. Instead of sharding by the namespace, we introduce a higher-level entity, the tenant. By introducing the tenant entity, we turn GitLab.com into a multi-tenant SAAS platform, in the model of SAAS multi-tenant applications. Well known examples include Slack, Pagerduty, Datadog, etc. Each of these examples offers their users a scoped, isolated tenancy.

Incremental Scalability Improvements

In parallel with the sharding investigation, the database team continues to look for areas of incremental database scalability improvements. Those efforts are being tracked under these issues/epics:

Database Partitioning Implementations

Partitioning is an important subject to cover separate from sharding. If we ultimately decide that database sharding is the chosen solution to achieve our business objectives, then database partitioning is the foundation upon which database sharding is built in PostgreSQL. Even if we don’t use it for sharding, partitioning directly improves query performance and is therefore a great tool to use on its own. Our first iteration of database partitioning will be implemented on audit events. We expect that the implementation of paritioning will result in performance improvements and tooling implementation (e.g. migrations) for subsequent partitioning and sharding implementations.

Investigation Summary

The different sharding approaches, Namespace vs. Tenant, are being evaluated. They are competing approaches but each have the same goal of achieving our business goals. We are still working through the potential first iteration and implementation details of these approaches. In both cases we will need to identify and quantify the changes required at the database and application level.

While we continue to investigate Namespace vs. Tenant sharding, we can continue with the Incremental Scalability Improvements and Database Partitioning Implementation and realize immediate performance and scalability improvements.

Exit Criteria

Roles and Responsibilities

Working Group Role Person Title
Executive Stakeholder Christopher Lefelhocz VP of Development
Facilitator Craig Gomes Engineering Manager, Database
DRI for Sharding Working Group Craig Gomes Engineering Manager, Database
Functional Lead Nailia Iskhakova Software Engineer in Test, Database
Functional Lead Josh Lambert Group Manager, Product Management, Enablement
Functional Lead Gerardo “Gerir” Lopez-Fernandez Engineering Fellow, Infrastructure
Functional Lead Stan Hu Engineering Fellow, Development
Functional Lead Andreas Brandl Staff Backend Engineer, Database
Member Chun Du Director of Engineering, Enablement
Member Pat Bair Senior Backend Engineer, Database
Member Tanya Pazitny Quality Engineering Manager, Enablement
Member Mek Stittri Director of Quality Engineering

Meeting Recap

The agenda doc can be found in our Google Drive when searching for “Sharding Working Group Agenda”