EPSS Support ADR 002: Use a new bucket for EPSS data

Context

PMDB exports data to GCP buckets. The data is later pulled by GitLab instances. Advisory data and license data are stored in different buckets. This is sensible, because advisory and license data are not directly related, and rather provide additional information about packages. Data is updated based on deltas—changes from the previous state of the data. Only those changes are saved with each addition to the database.

EPSS data is directly associated with advisories, so a simple solution would be to add it as another directory of data to the existing advisories bucket to keep them grouped together.

  • v2/<purl_type>/<timestamp>/<sequence>.ndjson contains advisories.
  • v2/<purl_type>/epss/<timestamp>/<sequence>.ndjson or v2/epss/<timestamp>/<sequence>.ndjson would contain EPSS data.

However, the current advisories bucket is structured based on purl_type. Adding an epss data type would couple epss with purl_type which is a faulty pairing. (EPSS data corresponds to a CVE, and a single CVE might affect multiple packages that have different PURL types.) Due to the tight coupling between purl_type and the existing advisories bucket, it would be difficult and convoluted to add epss to it.

Furthermore, we could add the EPSS data directly to the relevant advisory JSON objects. However, since EPSS data is updated daily, this would risk frequent exports of large amounts of entire advisories due to many changing EPSS scores. This solution was not chosen to avoid performance issues.

Following extensive discussions on the EPSS epic and discussion during the refinement of PMDB issues, it was initially decided to use the existing bucket as this feels most intuitive and at the time felt a healthier approach. Further discussion during the refinement of the GitLab backend effort led to the decision to use a new bucket, due to the complexity of the coupling of purl_type and other, unrelated areas in the monolith. Adding epss to purl_type would impact other components and we want to avoid having to work around that. We may want to later simplify these areas and reconsider the bucket structure at a later stage.

Decision

Export EPSS data to a new bucket, rather than exporting it into the existing PMDB advisories bucket.

Consequences

The implementation is simpler than adding a directory to the existing advisories bucket, but may feel less intuitive. This change require the relevant Terraform changes regarding the provisioning of a new bucket. This should also be addressed in the exporter and the GitLab package_metadata sync configuration.

Alternatives

One option is to add EPSS data to the advisories bucket as another directory, since they are directly related. This was the initial decision. This would allow us to utilize existing mechanisms and keep related data close. However, EPSS data doesn’t fit into the current structure of the advisories bucket. An ideal solution would reconstruct the buckets in a manner more fitting for this approach, but this would be a big effort and is not critical enough.

Another option is to add EPSS data directly into advisory JSON objects. This was rejected due to the update nature of EPSS. EPSS is updated daily, which may lead to having to export many entire advisories due to changes to the EPSS field, greatly increasing daily deltas and performance impact on GitLab instances.

Last modified November 1, 2024: Remove trailing spaces (6f6d0996)