Cells

This is the handbook page for the Cells project. Cells is one of the top priorities for FY2025, with the goal of providing additional scalability for GitLab.com. This handbook page contains the project information such as the project plan, roadmap, workstreams, DRIs, stakeholders, and communication channels. It also has links to important documentation such as the Cells design blueprints.

Intro

Cells is a new architecture for our software as a service platform. This architecture is horizontally scalable, resilient, and provides a more consistent user experience. It may also provide additional features in the future, such as data residency control (regions) and federated features.

For more information about the goals of Cells, see goals.

Requirements and Architecture

Cells overall architecture blueprint.

Roadmap, Stages, Phases, and DRIs

Roadmap

Cells 1.0

Cells 1.5

Cells 2.0

  • For internal customers only
  • Organizations are private
  • Users cannot interact with other Organizations (including GitLab Org)
  • Groups and projects are private in the Organization
  • For more details, see Organizations on Cells 1.0
  • For existing/new customers of GitLab.com
  • Organizations are private
  • Existing users can interact with private Organizations on Secondary Cells
  • Groups and projects are private in the Organization
  • For more details, see Organizations on Cells 1.5
  • Organizations are public or private
  • Users can interact with other Organizations
  • Groups and projects are private or public in the Organization
  • For more details, see Organizations on Cells 2.0

DRIs and Stakeholders

Role Responsibility

Sabrina Farmer

Executive Sponsor

Marin Jankovski

Senior Director of Engineering

Chun Du

Director of Engineering
  1. Liaison between project team and cross-functional engineering leaders
  2. Coordinating temporary staffing arrangements within the Data Stores stage

Nick Nguyen

Senior Engineering Manager
  1. Coordinating staffing and unblocking groups in Data Stores
  2. Drive cross-functional efforts in engineering
  3. Report on Data Stores progress and mitigate risks

Joshua Lambert

Director of Product Management
  1. Investment and staffing of Core Platform teams
  2. Liaison between project team and cross functional product managers and product leaders
  3. Escalation of product priorities competing with Cells
  4. Decision maker for supported and un-supported features for each iteration of Cells

Christina Lohr

Tenant Scale Product Manager
  1. Product definition, requirements, roadmap for Organization workstream within Tenant Scale
  2. Product definition, requirements, roadmap for Cells workstreams within Tenant Scale
  3. Point of contact to collaborate with product managers from other teams
  4. Investment and staffing of Tenant Scale

Darby Frey

Staff Fullstack Engineer, Expansion

DRI of Expansion Software Development

Kerri Miller

Staff Backend Engineer, Core Development

DRI of Core Development

Cells 1.0

All Cells 1.0 work is tracked under the Cells 1.0 Epic. The Epic is split into multiple phases where each one represents a iteration to achieve Cells 1.0. Some of these phases have dependencies over one another, and some can be run in parallel.

Phase 1: PreQA Cell

Exit Criteria:

  • New GCP organizations created.
  • Break glass procedure.
  • Ring definition exists.
  • Cell provisioned using dedicated stack.
  • Able to do configuration changes to Cell.
  • Cell available at xxx.cells.gitlab.com.
  • Cell doesn’t handle data uniqueness.

phase-1

source

Unblocks:

  • Phase 3: To provision runway deployment for Topology Service
  • Delivery team: Start testing deploys on rings

Dependencies:

  • None

Details:

Phase 2: GitLab.com HTTPS Passthrough Proxy

Exit Criteria:

  • 100% of API traffic goes through router using passthrough proxy rule.
  • 100% of Web traffic goes through router using passthrough proxy rule.
  • 100% of Git HTTPS traffic goes through router using passthrough proxy rule.
  • Requests meet latency target
  • registry.gitlab.com not proxied.

phase-2

source

Unblocks:

  • Phase 3: Router to be configured with additional rules in phase 3.

Dependencies:

  • None

Details:

Phase 3: GitLab.com HTTPS Session Routing

Exit Criteria:

  • PreQA Cell configured to generate _gitlab_session with prefix using rails config.
  • Route _gitlab_session with matching prefix to PreQA Cell using TopologyService::Classify (REST only) with static config file.
  • Continuous Delivery on Ring 0 with no rollback capabilities and doesn’t block production deployments.
  • Topology Service Readiness Review for Experiment
  • Topology Service gRPC endpoint not implemented.

Unblocks:

Before/After:

phase-3

source

Dependencies:

  • Phase 2: Passthrough proxy needs to be deployed.
  • Phase 1: GCP organizations, Ring definition exists.

Details:

Phase 4: GitLab.com HTTPS Token Routing

Exit Criteria:

  • Framework to generate routable tokens in Rails.
  • Framework to classify routable tokens in HTTP Router.
  • Topology Service being able to classify based on more criteria.
  • Route Personal Access Tokens to different Cells using TopologyService::Classify.
  • Support PRIVATE-TOKEN: and Authorization: HTTP headers for Personal Access Tokens, create issues for other to be solved in following phases.
  • Each routing rule added should be covered with relevant e2e tests.
  • Route Job Tokens and Runner Registration to different Cells using TopologyService::Classify.

Dependencies:

  • Phase 3: Topology Service and Router need to running in production.

Before/After:

phase-4

source

Details:

Phase 5: Cluster Awareness

Exit Criteria:

  • Topology Service Production Readiness Review for Beta.
  • Framework to claim resources globally using TopologySerivce::Claims storing them in Google Spanner.
  • Following resources are claimable; Username, E-Mail, Top level Group Name, Routes
  • All resources that need to be claimed identified.
  • Lease a sequence to a Cell using ToplogyService::Sequence.
  • Rails application able to send requests to TopologyService using internal network.
  • mTLS communication between TopologyService and HTTP Router.
  • mTLS communication between TopologyService and Rails.
  • mTLS communication between HTTP Router and Cell.
  • PreQA Cell can start claiming resources, still detached from Legacy Cell.
  • Claims done by PreQA Cell will be deleted.

Dependencies:

  • Phase 3: Topology Service Deployed.

Before/After:

phase-5

source

Details:

Phase 6: Monolith Cell

Exit Criteria:

  • Topology Service Production Readiness GA.
  • Legacy Cell configured as a Cell in TopologyService.
  • All new resources in Legacy Cell are claimed using TopologyService::Claims.
  • Legacy Cell claimed all existing resources.
  • Sequence leased to Legacy Cell.
  • Capacity Planning for sequences leased.
  • Latency increase for creating globally unique resources up to 20ms.

Dependencies:

Before/After:

phase-6

source

Details:

Phase 7: Cell Initialization

Exit Criteria:

  • TBD

Before/After:

Details:

Phase 8: Organization Onboarding

Exit Criteria:

  • TBD

Before/After:

Details:

Phase 10: Production Readiness

Exit Criteria:

  • Cell-Level Observability (Logs, Metrics, Alerts, Dashboard).
  • Integration with existing Incident Management tooling.
  • Compliance with GitLab.com security standards.
  • Regional and Zonal Disaster Recovery capabilities.
  • Operational tooling independence from GitLab.com/dev.gitlab.org availability.
  • Centralized WAF management for GitLab.com domain.
  • Cell-level Application Rate Limits with synchronization.
  • Least-privileged access implementation with SRE escalation path.
  • Progressive rollout of infrastructure changes across Cells with rollback support.
  • Progressive deployment capabilities across Legacy Cell and Cells with rollback support.
  • Support for toggling Feature Flags across Legacy Cell and Cells.

Dependencies:

  • Phase 1: GCP organizations, Ring definition exists.

Before/After:

Details:

Communication

Slack Channels

Meetings

Status updates

Additional Information

Cells Fast Boot 2024

We held a Cells Fast Boot in Dublin, Ireland, between 2024-04-23 and 2024-04-24. Below are the artifacts from the event.

Agenda, Slides, and Videos

Please use the Unfiltered Google account to watch video recordings.

  1. Main agenda (internal only)
  2. Introductions, overview, and logistics: Agenda (internal only)
  3. Cells Services - Global Service: Agenda (internal only), Slides (internal only), Video (internal only)
  4. Cells Services - Routing: Agenda (internal only), Slides (internal only), Video (internal only)
  5. Application Readiness - Organizations and Users: Agenda (internal only)
  6. Application Readiness - Dependencies and OKR alignments: Agenda (internal only)
  7. Deployment: Agenda (internal only), Slides (internal only), Video (internal only)
  8. Provisioning: Agenda (internal only)
  9. Observability and Runners: Agenda (internal only)
  10. Security: Agenda (internal only), Slides (internal only), Video (internal only)
  11. Disaster Recovery: Agenda (internal only), Slides (internal only), Video (internal only)
  12. Cells Mover and Isolation: Agenda (internal only)
  13. Scalability Headroom and Timeline: Agenda (internal only)

Decisions

  1. No external customers on Cells 1.0, internal dogfooding only. Cells 1.x is the target to onboard new or existing external customers.

Artifacts

  1. Day 1 recording: Part 1 (internal only), Part 2 (internal only)
  2. Day 2 recording (internal only)
  3. Database breakout recording (internal only)
  4. Organizations breakout recording (internal only)

Test Platform in Cells

Cells is a project that spans the entirety of GitLab. More information on what Cells is and how it is being developed is on the Cells hanbook page. Instead of recreating feature testing done by the other teams, we will reuse and leverage what exists currently and supplement to fill in gaps.

This approach has the following requirements:

  • It must feed back useful information to the engineering teams in an efficient, non burdensome way
  • It must provide good coverage so we have confidence to release
  • It must be easy to add/enhance/change tests
  • It works with our current process

Strategy

The testing strategy for Cells follows our practice of testing at the correct level. The testing will be focused on a couple of efforts: