Cells ADR 005: Flexible Reference Architectures

Glossary

  1. Reference Architecture: a reference architecture for deploying a GitLab instance, as defined by the Test Platforms team and documented in Reference Architectures Documentation.
  2. Cell Architecture: an iteratively versioned architecture definition, shared across all cells.
  3. Cell Sub-Archetype: a limited set of architectural deltas, deployed across the Cell fleet. Implemented as Overlays in the Tenant Model and Instrumentor, the provisioner.
  4. Overlays: a way of adjusting a reference architecture in a consistent and deterministic manner. Several Overlays exist, for example, to increase disk performance, or Postgres capacity.
  5. Tamland: a capacity forecast tool deployed across GitLab SaaS instances, including GitLab.com, GitLab Dedicated and Cells.

Context

At the Cells Fastboot offsite, the use of Reference Architectures with respect to Cells was discussed:

  1. Whether we should use existing Reference Architectures
  2. Whether we should define new Reference Architectures

Key points from this discussion included:

  1. The GitLab Reference Architecture documentation specifically states that the Reference Architectures are the starting point for defining an environment, rather than an immutable definition of an environment.
  2. Being “a single application with all the functionality of a DevSecOps Platform”, the GitLab application covers a very wide variety of use-cases, and load is dependent on the specific workloads generated by users on the instance, rather than simply the number of active users on the instance.
  3. Efforts will be made to balance Cell load by mixing a variety of organizations on each Cell. This could include Premium and Free organizations, organizations based in complementary timezones to ensure balanced workloads across the day, and user workload (for example, balancing heavy database and Gitaly users on a Cell). However, it’s likely that even with optimal organization balancing, Cells will develop hotspots that will need to be addressed by horizontally and vertically scaling individual Cells to meet their workload requirements.
  4. Dedicated Tenant instances already deploy Tamland for Capacity Planning purposes, although the capacity planning process is not yet fully established. This process could be leveraged for Cells too.
  5. As as an illustrative case at the far end of the SaaS spectrum, GitLab.com:
    1. Does not use a Reference Architecture.
    2. Is scaled, semi-dynamically, over time and according to need, as workloads and user activity changes.

What changes can be made to a Reference Architecture?

There are different types of changes that we can use to modify a Reference Architecture.

  1. Changes in Shape: this could include the addition of components only required for GitLab.com - for example, a SaaS-specific logging and analytics components. Differences between the GitLab.com product offering and the self-managed offering may necessitate changes to the architecture. These changes are out of scope of this document.
  2. Storage Capacity Scaling: over the long-term, the storage requirements of GitLab instances grow more steadily, than CPU, memory and other resource requirements. This means that tenants/cells will need to be able to grow to support more storage capacity.
  3. Vertical Scaling: the Reference Architectures specify machine/instance types. These can be adjusted to larger, more performant types, or cheaper, smaller types to better suit a specific workload.
  4. Horizontal Scaling: the Reference Architectures specify the number of- (or min and mix for-) nodes/pods/instances that run each service. These can be adjusted to suit workloads.

Decision

  1. Cells will use a Reference Architecture as an initial starting state, but will be adjusted according to load.
  2. Storage Capacity changes, Vertical Scale changes and Horizontal Scale changes will be made iteratively to the Cell Architecture, and lead to drift from the original Reference Architectures over time. These changes will be applied to all cells.
  3. Tamland Capacity Planning will be used to ensure that scaling actions are carried out ahead of potential saturation events.
  4. While the Org Mover will be used as the primary mechanism to rebalance traffic, with the goal of avoiding hotspots, it’s likely that over time, a limited number of Cell Sub-Archetypes will need to be defined to handle specific workloads, these will be defined as needed similarly to how Dedicated already provides a limited set of tenant customizations using overlays.

Cells Architecture evolution over time

Different Types of Scaling Changes

  1. Storage Capacity in the Cells Architecture should be defaults, with values for disk-space for Postgres, Gitaly etc, configurable in the tenant model. This would allow Cells to be resized for additional storage without needing to rebalance or iterate on the architecture. This would ensure that storage capacity is not overprovisioned before it is needed on individual cells.
  2. Vertical and Horizontal Scaling Changes should be implemented either in the Cell Architecture, or when appropriate, in an Sub-Archetype/Overlay.

Consequences

Cells will use homogeneous Cell Architecture, which will, over time, drift from the Reference Architectures. As these changes are made, the SaaS Platforms team may provide informal feedback to the Test Platforms team to guide future iterations of the Reference Architectures.

The Cell Architecture is (defined in Instrumentor) and will be shared by all cells, but a limited number of sub-archetypes may be developed to cope with specific workloads.

This will need to be managed with a scalable process, using Instrumentor overlays. See the Tenant Model Documentation to details of restrictions around adding overlays.

Tamland Capacity Planning processes will need to be deployed to monitor the saturation of Cells.

When future saturation is predicted by Tamland, several routes forward to resolving the issue could be taken, listed in descending order of preference:

  1. Rebalance a noisy-neighbor Organization to another cell, or a new cell, using the Org Mover. Depending on the workload that’s generating the saturation, it may be worthwhile moving the archetype to a different
  2. For storage utilization issues, possibly increase storage capacity on that cell.
  3. Iterating on the Cell Architecture to the next version to improve the saturation across all cells.
  4. Introduce a new sub-archetype to deal with a specific type of workload unsuited to existing archetypes, provision a cell based on this sub-archetype and use the Org Mover to move the noisy-neighbor to the new cell.

Analysis will determine the correct course of action.

Initially, this process will need to be manual, but over time, it may be possible to automate the rebalancing of Organizations or the scaling of cells.

As a straight-forward example of automated scaling, if, based on a Tamland prediction, a Cell is tending towards Gitaly disk space saturation, an automated process, may, in future, increase the size of the Gitaly disks, or the number of Gitaly nodes.

Alternatives

Reference Architecture with no variation

Reference Architectures could be strictly adhered to. This would require new Reference Architectures to be defined. These architectures would likely not be directly useful to customers, only GitLab SaaS internal customers.

Additionally: this would, put responsibility for Cell scaling indirectly on the team responsible for Reference Architecture definitions, the Test Platforms team, which would slow down the scaling process and create inefficiencies in the management of Cell infrastructure.

Cell Architecture with no variation

A single Cell Architecture could be deployed for all cells, with no variation.

The only options for dealing with saturation would be to scale the single architecture, increasing the resources on all cells, or to rebalance an Organization, possible to it’s own cell to avoid noisy-neighbor effects.

This would be inefficient as many Cells would be over-provisioned, so that a few cells can be correctly provisioned.

Additionally, running a Organization on it’s own cell, simply because it is too noisy for colocation with other customers, would be expensive.

Last modified August 23, 2024: Ensure frontmatter is consistent (e47101dc)