ADR 005: Perform sha256 calculation in PublishProvenanceService

Architecture Decision Record for decision surrounding the appropriate location for the calculation of SHA-256 hash of artifact file which is required for provenance attestations.

Context

The pipeline security group is working towards providing users with SLSA Level 3 Provenance Attestations.

As a high-level summary, as stated in the design document, a provenance statement is a JSON document that correlates the SHA-256 of an artifact with the build information. A worker then performs a digital signature, which is called a provenance attestation. This SHA-256 is the sole mechanism through which the artifacts, also called the subjects, are identified. See the subject documentation for more information. This document looks to identify the best location where to calculate the SHA-256 hash of job artifacts.

Definitions

Job Artifacts: Is one or more files generated by a CI job and saved as job artifacts as declared by the artifacts directive in the CI configuration.
Artifacts Bundle: Is the internal storage mechanism GitLab uses to persist job artifacts in the backend. The bundle is a zip of all of the individual job artifacts for a particular job.

Why is this change required?

We are required to generate provenance attestations that correlate build information with the hash of a specific artifact. While, our current implementation performs an attestation of the artifacts bundle, this has a number of disadvantages:

The “artifacts bundle” is a mechanism that GitLab uses internally that has no meaning to our users. For example, they generally do not distribute this bundle to their users, but rather distribute the artifacts themselves.
We do not currently store a correlation between the SHA-256 of the artifacts with an artifacts bundle, which means it would be impossible to achieve the desired architecture. Particularly requirements such as “The API is queried with the SHA-256 of the artifact and returns the Sigstore bundle if found”.

Technical background

At GitLab, job artifacts are stored within a zip file in object storage, termed above an “artifacts bundle”. Performing the SHA-256 calculation of job artifacts within requires the retrieval of the artifact bundle and reading of a job artifact within it, as demonstrated by the code example below:

> file = Ci::Build.last.job_artifacts.filter { |a| a.file_type == "archive"
> }[0].file.file
> entry = Zip::File.open(file).entries[0]
> Digest::SHA256.hexdigest(entry.get_input_stream.read)
3c5bba498d6f7a2cb4c195cf0873c8b68c9407f04dfa9acaad7fe4875e5e93f1

This ADR documents a decision made during the refinement of the “Calculate sha256 digest of artifact on PublishProvenanceService” issue. Initially, the pipeline security team decided to perform the calculation of the SHA-256 of the job artifacts within the Workhorse endpoint that generates the artifact metadata file. The reasoning for this was that by calculating the hash at this stage, we could avoid downloading the file.

While discussing options for the appropriate location within Workhorse to make the required changes, team members provided feedback that the mechanism we were employing to retrieve the file for hashing also resulted in the retrieval from object storage. Additionally, this endpoint is facing substantial latency already, and the metadata generation will be decoupled from it.

Options Considered

1. Performing the SHA-256 calculation within Workhorse

Pros:

Workhorse is specifically designed for computationally expensive tasks such as this one.
Workhorse already directly uploads the artifacts bundle to object storage (https://docs.gitlab.com/development/workhorse/#specialized-task-handling) to avoid tying up a Rails Web application worker and already computes an MD5/SHA-256 hash as the file is streamed.
Workhorse does some preprocessing of the artifacts bundle in gitlab-zip-metadata. It uses HTTP Range Requests to generate the file listing of the artifacts to avoid downloading the entire file.

Cons:

As highlighted above, there are significant concerns with causing additional latency on this endpoint.
Substantial changes to this endpoint are planned. Customers have asked for the ability for runners to upload artifacts in multipart uploads, and the change we are proposing would not be compatible with this.
Any changes within this critical code path will need to be carefully reviewed to avoid introducing any performance bottlenecks or issues with artifact generation.
The endpoint which handles the upload of artifacts does not deal with files, but rather streams the upload directly into object storage. This makes it particularly difficult to perform the SHA-256 calculation of files within the zip due to the intricacies of the zip file format.
Workhorse nodes have insufficient temporary storage available for this procedure.

2. Perform the hashing within the runner

Pros:

Trivially easy to perform hashing of the files, as we have access to them prior to compression.

Cons:

From a security perspective, runners execute untrusted code. This means they are not a suitable location to perform security sensitive operations such as file hashing.
Runners are not part of the trusted control plane. We would need to confirm that performing the SHA-256 calculation outside of the control plane would be acceptable for SLSA level 3.
There are data integrity concerns during transmission to object storage, where data corruption could alter the file and this is something we would need to prevent.
We would need to transmit these hashes alongside the zip file, which is not trivial as there are several intermediary steps between upload and metadata generation.

3. Performing the signing within the PublishProvenanceService service

This service is called by the PublishProvenanceWorker sidekiq worker. There is a proof of concept available for this implementation.

Pros:

The cosign attest-blob command has a mandatory <BLOB> parameter which in our case would be the job artifact. Since we are retrieving the job artifact to perform the hashing, we can reuse the file. If we were to do the hashing in Workhorse, we would still need to download the blob for purposes of attestation.
Any latency introduced would not impact Workhorse artifact generation.
The code required to implement this can be written in Ruby, where most pipeline security developers have expertise.
The solution is relatively simple.
Mechanisms for retrieving and reading artifacts already exist within the GitLab Rails codebase, and can be used as an example.
Secure coding guidelines for dealing with zip files are easily implemented within GitLab Rails.

Cons:

Requires a download of the artifact file.
Requires temporary storage for the artifact file as well as the artifact itself.

Decision

The pipeline security team decided to choose PublishProvenanceService as the location where to calculate the SHA-256 of artifacts. This was done mainly because of the limitations that prevented us from implementing this within Workhorse as described above.

Consequences

Positive

We are not required to create several relatively complicated merge requests to change gitlab-runner and Workhorse.
Implementation time is likely to be significantly reduced, provided we are able to easily mitigate any concerns around bandwidth and temporary storage space.

Negative

We will need to retrieve the artifact file and save it in temporary storage in order to read it. The impacts of this are mitigated by limiting the maximum file size, as well as pre-allocating the file to ensure sufficient storage is available.
If the SHA-256 is required multiple times, it will need to be persisted or cached. At the moment we only require the hash one time in order to create the provenance statement so this is not a concern.
Does not address issues such as Hash artifacts before uploading.