Package Stage - Container Registry

The goal of this page is to document specific processes and tools for the GitLab Container Registry project.

Overview

The goal of this page is to document specific processes and tools for the GitLab Container Registry project.

Historical Context

In milestone 8.8, GitLab launched the MVC of the Container Registry. This feature integrated the Docker Distribution registry into GitLab so that any GitLab user can have a space to publish and share container images.

But there was an inherent problem with Docker Distribution. When you delete a container image tag, it’s not actually deleted from storage. Instead, it is marked for deletion, and it will only be removed from storage when garbage collection is run. The problem is that the registry must be set to read-only mode or downtime to run garbage collection. Given the scale and SLAs of GitLab.com, it has not been possible to schedule downtime.

Fast forward many milestones, and the problem has continued to grow linearly. The GitLab.com registry consumes petabytes of storage which costs tens of thousands of dollars every month.

What’s been done so far?

Two years ago, the Package group and GitLab Staff engineers had a lengthy discussion about how to resolve this problem and to deliver a lovable product. We evaluated several options, including:

  • Integrating with another service like Harbor or Docker Trusted registry
  • Building our in-house registry
  • Forking Docker Distribution and iterating on that code

In the end, the decision was made to fork Docker distribution and make the requisite updates to add support for online garage collection.

Along the way, we made several other changes and modifications, which were all targeted to help GitLab and its customers tackle this storage problem.

First, we optimized the existing garbage collection algorithms for GCS and S3, so that large enterprises could be unblocked from running garbage collection, even if it required downtime. This helped improve the performance of the algorithm by 90+ percent.

We also added programmatic cleanup policies to help customers automatically remove (even if they were not deleted from storage) old unused images.

Fast forward a bit, and, as a team, we’ve evaluated and iterated on designs for the implementation of online garbage collection and several plans for the migration of one registry to the next. The epic details the work required to deploy the new metadata database to production and migrate all new and existing repositories to use the feature.

Why we are excited

The metadata database is not just about online garbage collection. It unblocks a whole new set of potential features and capabilities that will help our customers to manage and deploy their software reliably. For example, it unblocks some much-needed updates for the API so that we can support a more robust user interface, add enterprise features like image signing and protection and provide more stability and reliability.

By forking the project, we continued improving the application and implementing several bug fixes, performance improvements, and additional features that were not available or accepted upstream. Consequently, due to the rate of changes and how the codebases diverged, we decided to detach from upstream in June 2020. Since then, we have been evolving the project in isolation. We continue to source changes from upstream whenever necessary (mostly security fixes), but these must be merged manually.

In February 2021, Docker decided to donate Docker Distribution to the Cloud Native Computing Foundation (CNCF) to revive the project. The project was then renamed to Distribution. Some of the GitLab Container Registry maintainers are now maintainers of the CNCF project as well.

Although some of our engineers contribute to the upstream Distribution project, the codebases diverged beyond reconciliation. Therefore, there is no intention to reunite with upstream at any point in time. Some of the features we have been working on, such as the metadata database and online garbage collection (more on that later), required a significant platform re-architecture. Additionally, we will continue to tailor the GitLab Container Registry to suit the needs of GitLab the product.

Documentation

The documentation is currently scattered across multiple places, namely this handbook page, docs.gitlab.com, the project repository, and the upstream Docker documentation. This is a known issue and something we intend to address.

Standards

Being a fork of the original Docker Distribution registry, the GitLab Container Registry is based on the V1 (deprecated in 13.4 and no longer supported since 13.8) and the V2 Docker Image Specification. These specifications define the format and content of Manifests and Manifest Lists, used to describe Docker container images.

The GitLab Container Registry is also based on the original Docker Registry HTTP API V2 specification, which defines the contract of the single entrypoint for the Container Registry - its HTTP API.

Due to the need to standardize the container distribution mechanism across vendors/providers, in 2018, Docker donated the HTTP API V2 specification to the Open Container Initiative (OCI), which is a governance structure under the Linux Foundation maintaining open industry standards around containerization technologies. This led to the creation of the OCI Distribution Specification, which is the current leading standard and therefore the one that the GitLab Container Registry HTTP API adheres.

Similarly, Docker also contributed its image specification to OCI, leading to the creation of a vendor agnostic OCI Image Specification. This is the standard that the GitLab Container Registry adheres to when it comes to the Image Manifest and Image Index (the equivalent to Docker’s Image Manifest Lists) formats.

The OCI Image and Distribution specifications are backward compatible with the original Docker specifications and are actively being worked on and therefore subject to changes. The GitLab Container Registry should follow the progress of these specifications to maintain OCI compliance. We are free to extend the HTTP API with additional functionality if needed, as long as we can keep backward compatibility with the OCI Distribution specification.

Development

The following documentation is especially relevant for engineers working with the Container Registry:

  • Differences from Upstream - This is where we keep track of differences between the GitLab Container Registry and the upstream CNCF Distribution. All user-facing changes must be documented.
  • Releases - How we manage and cut new releases.
  • Contributing - We stay as close as possible to the general GitLab development guidelines but enforce stricter rules whenever appropriate. Here is where we document those specific contributing processes.
  • Development Guidelines - Links for specific development documentation, ranging from setup instructions to general GitLab development guidelines extensions.
  • Technical Documentation - Documentation about specific components or features of the application. This includes components and features inherited from upstream (with no documentation available elsewhere) and new ones.
  • Configuration - Documentation about the available application settings. The upstream configuration documentation from Docker was the base for this, but since then, we have added, deprecated, and changed multiple configurations.
  • Storage Drivers - The documentation for the storage drivers. This is only available upstream. Whenever we add or change a storage driver settings, we document it here.
  • Notifications - Documentation about the asynchronous webhook notifications feature.
  • Authentication - All about the authentication specification implemented by the Container Registry and supported by GitLab Rails (the authentication provider).
  • HTTP API Specification - This is based on the OCI Distribution Specification (see Standards for more details) but we have extended it with additional functionality. A log of changes is kept in the project documentation.
  • Online Garbage Collection - Documentation about the implementation of online garbage collection.
  • Request Flow - Sequence diagrams explaining the request flow for authentication, pull and push requests.

Administration

The following links are related to administrative tasks, mainly for self-managed instances:

Architecture

Documentation about the current architecture or any significant changes to it. The latter usually come in the form of an Architecture Blueprint:

Observability

Metrics and Dashboards

The registry exports metrics about all its components (HTTP API, storage, database and online GC) to Prometheus. These metrics are geared towards operational metrics only. Due to cardinality limitations in Prometheus, metrics that require variable labels such as repository and image identifiers are exposed through the application logs.

For GitLab.com, the container registry has multiple dashboards in Grafana and Kibana, listed below. Like for any other GitLab application, observability is critical. If you are working on this project, please take your time to go through all dashboards and becoming familiar with the metrics displayed in them.

All the underlying metrics for the Grafana and Kibana dashboards are also available on self-managed installs.

Grafana

  • Overview - Main service dashboard. Provides an overview of the Service Level Indicators (SLI). The information is available for all the service components.
  • Application Detail - Detailed information about the application metrics. Provides insight about the HTTP API and the hosts resources usage.
  • Storage Detail - Consolidated information about the registry storage backend - Google Cloud Storage (GCS).
  • Database Detail - Fine grain metrics about the metadata database.
  • Garbage Collection Detail - Extensive metrics related to the online GC feature.
  • Redis Detail - Detailed metrics about Redis usage.
  • Migration Detail - Temporary dashboard to support the ongoing GitLab.com deployment and migration (gitlab-org&5523).
  • PgBouncer - Metrics for the PgBouncer nodes in front of the registry PostgreSQL database cluster.

The source PromQL query for any graph in Grafana can be identified by looking at the corresponding dashboard source code in the runbooks project. You can also do so in the Grafana UI by clicking on the dropdown alongside a graph’s name and click Explore. That will take you to a WYSIWYG editor for that particular query, showing both the PromQL source and the rendered graph. For a more advanced overview, you can watch a recording that walked through the process of creating and querying Prometheus metrics and Grafana dashboards for the GitLab.com upgrade/migration project here.

Thanos

Behind the scenes, Grafana dashboards source Prometheus metrics from Thanos. If wanting to, you can reproduce any of the Grafana graphs in Thanos by inspecting their source query on the former and execute it in the latter. This might be useful when querying on long timespans, as Thanos is usually faster. You can also write custom queries for one time specific analysis and share the results with others by sharing screenshots and the Thanos page link.

Thanos has an autocompletion feature which is useful to find available Prometheus metrics. For example, going to Thanos and typing registry_ in the search box should show a dropdown of all available application metrics for the Container Registry. You can then use PromQL to query and visualize these metrics.

Kibana

  • Main - Overview of several registry metrics.
  • Blob Downloads - Provides additional insight about downloads at the top-level namespace and repository levels.

Logs

The container registry exposes structured access and application logs. For GitLab.com, these logs can be found in Kibana:

  • Non-production: Logs for the development, staging, and pre-production environments.
  • Production: Logs for the production environment.

Releases

The Container Registry adheres to the same development and release timeline as all other GitLab projects. However, because the release process is manual, ensuring all changes are merged on the project repository by the code cutoff (the Friday before the Thursday release day) is not enough. We still have to deal with the manual release process, which can take a couple of days to complete, especially for self-managed installs.

Considering this, we strive to have deliverables merged in the project repository 10 days before the targeted release date. At this point, a maintainer should cut a new release following the documented process. This gives us a 5-day window to deal with the release process and merge all version bumps by the code cutoff. It is the responsibility of the maintainer creating the last release to ensure that all deliverables have been merged or that the assigned engineer(s) acknowledge that changes will not be ready on time. Engineers are encouraged to set a due date on deliverable issues to reduce the chances of missing the merge and release window.

GitLab.com

Unlike the main Rails application, the Container Registry does not currently benefit from continuous delivery. Releases and deployments are, therefore, a manual process (for now). You can learn more about the process in the project documentation.

You can identify the current version of the registry running in each environment by looking at the k8s-workloads project, and more precisely here, looking for the registry_version key. This is also the right project to look at to identify if any change occurred around a given time, not only for regular upgrades but also for any application configuration changes. This can be done by searching for merged MRs with the label Service::Container Registry.

Self-Managed

Releases for self-managed are also manual and adhere to the normal GitLab monthly release cadence. Because the release process requires several manual steps, special attention must be paid to ensure that changes are merged on time before the release date. Please refer to the project documentation for more details.