Complex SPDX Expression Evaluation
Status | Authors | Coach | DRIs | Owning Stage | Created |
---|---|---|---|---|---|
proposed |
hacks4oats
|
johncrowley
tkopel
|
devops application security testing | 2025-01-06 |
Summary
SPDX license expressions provide a clear, standardized, and machine-readable way to document the software licenses associated with code, ensuring compliance with licensing obligations. Standardized expressions help users manage legal risks, meet licensing requirements, and easily share software components for reuse. Expressions can be simple, for example an expression composed of singular SPDX license identifier, or complex, like when more than one expressions are combined using boolean operators.
Motivation
Projects are increasingly adopting licensing models that make use of complex license expressions. For example, a project might require both Apache-2.0 and MIT licenses, or it might let users choose between Apache-2.0 or MIT licenses. Unfortunately, there’s no way to express this with singular license expressions, so project’s that depend on packages that utilize complex license expressions are left with a dependency list that only mentions an unknown license. This results in a user experience where it’s impossible to quickly view the licensing options of a dependency, and also presents a problem when a user tries to craft a license approval policy to fit such an expression. Supporting complex expressions will improve the development workflow for projects that work with dependencies that use complex licensing models.
Goals
- Support for simple license identifiers, and the
AND
andOR
boolean operators. - Populate dependency list with license expressions and not just license
identifiers. For example, a dependency with a
Apache-2.0 AND MIT
license will show up as such in the dependency list. - Enforce license approval policies on projects that use expressions. For
example, if a license approval policy only approves of
Apache-2.0
, and a proposed dependency has aApache-2.0 AND MIT
license, the dependency should not pass the approval policy. On the other hand, if the approval has acceptance for both licenses, it should pass. - Performance should be acceptable for groups with many projects, and projects with many dependencies.
Non-Goals
- The first iteration will not include for the following:
WITH
operator+
operator- References using
AdditionRef-
,DocumentRef-
orLicenseRef-
Proposal
At a high level, license expressions will propagate from the Package Metadata Database (PMDB). These expressions will surface in the dependency list at both the project and group level, and will also be used when evaluating the approval status of a dependency’s license.
Design and implementation details
SPDX license expressions are supported across ecosystems at varying levels. The following table depicts the level at which they’re supported by each ecosystem.
Ecosystem | Support level |
---|---|
cargo |
fully supported |
cocoapods |
one or more identifiers - equivalent to conjunctive AND |
composer |
fully supported |
conan |
one or more identifiers - equivalent to conjunctive AND |
gem |
one or more identifiers - equivalent to conjunctive AND |
golang |
one or more identifiers - equivalent to conjunctive AND |
maven |
one or more identifiers - equivalent to conjunctive AND |
npm |
fully supported |
pypi |
one or more identifiers - equivalent to conjunctive AND |
swift |
one or more identifiers - equivalent to conjunctive AND |
Licensing options
At a high level, a SPDX expression can be compiled down into a set of license combinations. Take the following expression, for example.
Apache-2.0 AND (MIT OR GLP-3.0)
The expression is human readable, deterministic, but it’s far from optimized for our use cases. We can improve on this by expanding the expression into the different licensing options it represents. In this case, the options could be represented like so:
[
["Apache-2.0", "MIT"],
["Apache-2.0", "GLP-3.0"]
]
From here, it’s easy to see how a license approval policy can be evaluated against the different licensing options. If an approval policy denies any of the licenses, the next license option can be tested for denial. The change in logic is minimal, with the only change being an additional outer loop in the violation checking logic.
Evaluation
The solution above is functional, but it must be further improved to meet our performance requirements. As previously mentioned, license scanning must remain performant even for the largest of projects. This solution does not provide that, and we can demonstrate this with the following example.
Say that we scan a project with the following components and licenses, and have an approval policy that we want to evaluate against for policy violations.
{
"packageA": [ ["LicenseA", "LicenseB"], ["LicenseA", "LicenseC"]],
"packageB": [ ["LicenseD", "LicenseE"], ["LicenseD", "LicenseF"]],
"packageC": [ ["LicenseG", "LicenseH"], ["LicenseG", "LicenseI"]],
}
Before evaluating the license approval policies, we’d need to clean up the data, so that we have a singular array of licensing options to evaluate against. The process requires us to combine the options in order to compact the arrays. On first pass, we’d have the following options.
[
["LicenseA", "LicenseB"],
["LicenseA", "LicenseC"],
]
On second pass, we’d need to combine the existing options with the next set of licensing options, so we’ll end up with something like so.
[
["LicenseA", "LicenseB", "LicenseD", "LicenseE"],
["LicenseA", "LicenseC", "LicenseD", "LicenseE"],
["LicenseA", "LicenseB", "LicenseD", "LicenseF"],
["LicenseA", "LicenseC", "LicenseD", "LicenseF"],
]
The amount of options has grown considerably, but we’re still not done.
We must do a third and final pass to incorporate the remaining licensing
options for packageC
. Doing so results in the following outcome.
[
["LicenseA", "LicenseB", "LicenseD", "LicenseE", "LicenseG", "LicenseH"],
["LicenseA", "LicenseC", "LicenseD", "LicenseE", "LicenseG", "LicenseH"],
["LicenseA", "LicenseB", "LicenseD", "LicenseF", "LicenseG", "LicenseH"],
["LicenseA", "LicenseC", "LicenseD", "LicenseF", "LicenseG", "LicenseH"],
["LicenseA", "LicenseB", "LicenseD", "LicenseE", "LicenseG", "LicenseI"],
["LicenseA", "LicenseC", "LicenseD", "LicenseE", "LicenseG", "LicenseI"],
["LicenseA", "LicenseB", "LicenseD", "LicenseF", "LicenseG", "LicenseI"],
["LicenseA", "LicenseC", "LicenseD", "LicenseF", "LicenseG", "LicenseI"],
]
From this example, we can see how quickly the search space can become because we’re essentially getting the cartesian product of different sets. This combinatorial explosion requires us to further refine the implementation, so that it remains performant should we be presented with a large amount of dual licensed packages.
To do this, we can make use of bit arrays to act as a vector of licenses. At a high level the representation would work like this.
- Every bit in the bit array will represent the detection of a license.
- There are 729 known licenses in the SPDX library. If we account for this, future growth, and word alignment, we can use a bit array with 1024 bits.
- We can XOR the bit arrays to detect for approvals or violations. Bitwise operations are fast, and can also be parallelized by the processor.
To demonstrate the implementation, we can do an example using a smaller set of available options. Say that projects have the option of using one or more out of eight licenses. We could represent the licenses like so.
Name | Leftmost Index (MSB) |
---|---|
LicenseA | 0 |
LicenseB | 1 |
LicenseC | 2 |
LicenseD | 3 |
LicenseE | 4 |
LicenseF | 5 |
LicenseG | 6 |
LicenseH | 7 |
Let’s say that there was a policy that disallowed LicenseF
, and we wanted to
evaluate it against the licensing options above. A list approach would have to
do a union operation between two lists. Ruby makes this readable and
easy to do:
all_denied_licenses = (licenses_from_report - licenses_from_policy).uniq
comparison_licenses = join_ids_and_names(license_ids, license_names)
policy_denied_license_names = (comparison_licenses - licenses_from_policy).uniq
violates_license_policy = policy_denied_license_names.present?
For a singular set of licensing options, it may be acceptable, but using this for larger sets of data would begin to deteriorate performance. In comparison, the bit array approach would look like so:
licenses_from_policy = 0b00000100
licenses_from_report = 0b01010010
def is_compliant?(licenses_from_policy, licenses_from_report)
(licenses_from_report & ~licenses_from_policy) == 0
end
is_compliant?(licenses_from_policy, licenses_from_report)
=> false
Storage
PMDB
The exporter service for our Package Metadata Database will be in charge of parsing the SPDX expression, and converting it into an array of bitmasks.
GitLab Rails
The GitLab Rails database will no longer rely on an enum to identify the license used by a project. Instead it will add an additional column that stores the original license expression, and its equivalent bitmask representations.
Alternative Solutions
9f4f0bc1
)