Utilize bundle-uri to reduce Gitaly CPU load

This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. The development, release, and timing of any products, features, or functionality may be subject to change or delay and remain at the sole discretion of GitLab Inc.
Status Authors Coach DRIs Owning Stage Created
proposed toon mjwood jcaigitlab devops systems 2023-08-04

Summary

bundle-URI is a fairly new concept in Git that allows the client to download one or more bundles in order to bootstrap the object database in advance of fetching the remaining objects from a remote. By having the client download static files from a simple HTTP(S) server in advance, the work that needs to be done on the remote side is reduced.

Git bundles are files that store a packfile along with some extra metadata, including a set of refs and a (possibly empty) set of necessary commits. When a user clones a repository, the server can advertise one or more URIs that serve these bundles. The client can download these to populate the Git object database. After it has done this, the negotiation process between server and client start to see which objects need be fetched. When the client pre-populated the database with some data from the bundles, the negotiation and transfer of objects from the server is reduced, putting less load on the server’s CPU.

Motivation

When a user pushes changes, it usually kicks off a CI pipeline with a bunch of jobs. When the CI runners all clone the repository from scratch, if they use git clone, they all start negotiating with the server what they need to clone. This is really CPU intensive for the server.

Some time ago we’ve introduced the pack-objects, but it has some pitfalls. When the tip of a branch changes, a new packfile needs to be calculated, and the cache needs to be refreshed.

Git bundles are more flexible. It’s not a big issue if the bundle doesn’t have all the most recent objects. When it contains a fairly recent state, but is missing the latest refs, the client (that is, the CI runner) will do a “catch up” and fetch additional objects after applying the bundle. The set of objects it has to fetch from will Gitaly be a lot smaller.

Goals

Reduce the work that needs to be done on the Gitaly servers when a client clones a repository. This is particularly useful for CI build farms, which generate a lot of traffic on each commit that’s pushed to the server.

With the use bundles, the server has to craft a smaller delta packfiles compared to the pack files that contain all the objects when no bundles are used. This reduces the load on the CPU of the server. This has a benefit on the packfile cache as well, because now the packfiles are smaller and faster to generate, reducing the chances on cache misses.

Non-Goals

Using bundle-URIs will not reduce the size of repositories stored on disk. This feature will not be used to offload repositories, neither fully nor partially, from the Gitaly node to some cloud storage. In contrary, because bundles are stored elsewhere, some data is duplicated, and will cause increased storage costs.

In this phase it’s not the goal to boost performance for incremental fetches. When the client has already cloned the repository, bundles won’t be used to optimize fetches new data.

Currently bundle-URI is not fully compatible with shallow clones, therefore we’ll leave that out of scope. More info about that in Git issue #170.

Proposal

When a client clones a repository, Gitaly advertises a bundle URI. This URI points to a bundle that’s refreshed on a regular interval, for example during housekeeping. For each repository only one bundle will exist, so when a new one is created, the old one is invalidated.

The bundles will be stored on a cloud Object Storage. To use bundles, the administrator should configure this in Gitaly.

Design and implementation details

When a client initiates a git clone, on the server-side Gitaly spawns a git upload-pack process. Gitaly can pass along additional Git configuration. To make git upload-pack advertise bundle URIs, it should pass the following configuration:

  • uploadpack.advertiseBundleURIs :: This should be set to true to enable to use of advertised bundles.
  • bundle.version :: At the moment only 1 is accepted.
  • bundle.mode :: This can be either any or all. Since we only want to use bundles for the initial clone, any is advised.
  • bundle.<id>.uri :: This is the actual URI of the bundle identified with <id>. Initially we will only have one bundle per repository.

Enable the use of advertised bundles on the client-side

The current version of Git does not use the advertised bundles by default when cloning or fetching from a remote. Luckily, we control most of the CI runners ourself. So to use bundle URI, we can modify the Git configuration used by the runners and set transfer.bundleURI=true.

Access control

We don’t want to leak data from private repositories through public HTTP(S) hosts. There are a few options for how we can overcome this:

  • Only activate the use of bundle-URI on public repositories.
  • Use a solution like signed-URLs.

Public repositories only

Gitaly itself does not know if a project, and its repository, is public, so to determine whether bundles can be used, GitLab Rails has to tell Gitaly. It’s complex to pass this information to Gitaly, and using this approach will make the feature only available for public projects, so we will not proceed with this solution.

Signed URLs

The use of signed-URLs is another option to control access to the bundles. This feature, provided by Google Cloud, allows Gitaly to create a URI that has a short lifetime.

The downside to this approach is it depends on a feature that is cloud-specific, so each cloud provider might provide such feature slightly different, or not have it. But we want to roll this feature out on GitLab.com first, which is hosted on Google Cloud, so for a first iteration we will use this.

Bundle creation

Use server-side backups

At the moment Gitaly knows how to back up repositories into bundles onto cloud storage. The documentation describes how to use it.

For the initial implementation of bundle-URI we can piggy-back onto this feature. An admin should create backups for the repositories they want to use bundle-URI. With the existing configuration for backups, Gitaly can access cloud storage.

As part of housekeeping

Gitaly has a housekeeping worker that daily looks for repositories to optimize. Ideally we create a bundle right after the housekeeping (that is, garbage collection and repacking) is done. This ensures the most optimal bundle file.

There are a few things to keep in mind when automatically creating bundles:

  • Does the bundle need to be recreated? When there wasn’t much activity on the repository it’s probably not needed to create a new bundle file, as the client can fetch missing object directly from Gitaly anyway. The housekeeping tasks uses various heuristics to determine which strategy is taken for the housekeeping job, we can reuse parts of this logic in the creation of bundles.
  • Is it even needed to create a bundle? Some repositories might be very small, or see very little activity. Creating a bundle for these, and duplicating it’s data to object storage doesn’t provide much value and only generates cost and maintenance.

Controlled by GitLab Rails

Because bundles increase the cost on storage, we eventually want to give the GitLab administrator full control over the creation of bundles. To achieve this, bundle-URI settings will be available on the GitLab admin interface. Here the admin can configure per project which have bundle-URI enabled.

Configuration

To use this feature, Gitaly needs to be configured. For this we’ll add the following settings to Gitaly’s configuration file:

  • bundle_uri.strategy :: This indicates which strategy should be used to create and serve bundle-URIs. At the moment the only supported value is “backups”. When this setting to that value, Gitaly checks if a server-side backup is available and use that.
  • bundle_uri.sign_urls :: When set to true, the cloud storage URLs are not passed to the client as-is, but are transformed into a signed URL. This setting is optional and only support Google Cloud Storage (for now).

The credentials to access cloud storage are reused as described in the Gitaly Backups documentation.

Storing metadata

For now all metadata needed to store bundles on the cloud is managed by Gitaly server-side backups.

Bundle cleanup

At some point the admin might decide to cleanup bundles for one or more repositories, an admin command should be added for this. Because we’re now only using bundles created by gitaly-backup, we leave this out of scope.

Gitaly Cluster compatibility

Creating server-side backups doesn’t happen through Praefect at the moment. It’s up to the admin to address the nodes where they want to create backups from. If they make sure the node is up-to-date, all nodes will have access to up-to-date bundles and can pass proper bundle-URI parameters to the client. So no extra work is needed to reuse server-side backup bundles with bundle-URI.

Alternative Solutions

No alternative solutions are suggested at the moment.

Last modified August 23, 2024: Ensure frontmatter is consistent (e47101dc)