EPSS Support ADR 002: Use a new bucket for EPSS data
Context
PMDB exports data to GCP buckets. The data is later pulled by GitLab instances. Advisory data and license data are stored in different buckets. This is sensible, because advisory and license data are not directly related, and rather provide additional information about packages. Data is updated based on deltas—changes from the previous state of the data. Only those changes are saved with each addition to the database.
EPSS data is directly associated with advisories, so a simple solution would be to add it as another directory of data to the existing advisories bucket to keep them grouped together.
v2/<purl_type>/<timestamp>/<sequence>.ndjson
contains advisories.v2/<purl_type>/epss/<timestamp>/<sequence>.ndjson
orv2/epss/<timestamp>/<sequence>.ndjson
would contain EPSS data.
However, the current advisories bucket is structured based on purl_type
. Adding an epss
data type would couple epss
with purl_type
which is a faulty pairing. (EPSS data corresponds to a CVE, and a single CVE might affect multiple packages that have different PURL types.) Due to the tight coupling between purl_type
and the existing advisories bucket, it would be difficult and convoluted to add epss
to it.
Furthermore, we could add the EPSS data directly to the relevant advisory JSON objects. However, since EPSS data is updated daily, this would risk frequent exports of large amounts of entire advisories due to many changing EPSS scores. This solution was not chosen to avoid performance issues.
Following extensive discussions on the EPSS epic and discussion during the refinement of PMDB issues, it was initially decided to use the existing bucket as this feels most intuitive and at the time felt a healthier approach. Further discussion during the refinement of the GitLab backend effort led to the decision to use a new bucket, due to the complexity of the coupling of purl_type
and other, unrelated areas in the monolith. Adding epss
to purl_type
would impact other components and we want to avoid having to work around that. We may want to later simplify these areas and reconsider the bucket structure at a later stage.
Decision
Export EPSS data to a new bucket, rather than exporting it into the existing PMDB advisories bucket.
Consequences
The implementation is simpler than adding a directory to the existing advisories bucket, but may feel less intuitive.
This change require the relevant Terraform changes regarding the provisioning of a new bucket.
This should also be addressed in the exporter and the GitLab package_metadata
sync configuration.
Alternatives
One option is to add EPSS data to the advisories bucket as another directory, since they are directly related. This was the initial decision. This would allow us to utilize existing mechanisms and keep related data close. However, EPSS data doesn’t fit into the current structure of the advisories bucket. An ideal solution would reconstruct the buckets in a manner more fitting for this approach, but this would be a big effort and is not critical enough.
Another option is to add EPSS data directly into advisory JSON objects. This was rejected due to the update nature of EPSS. EPSS is updated daily, which may lead to having to export many entire advisories due to changes to the EPSS field, greatly increasing daily deltas and performance impact on GitLab instances.
6f6d0996
)