How to monitor and respond to issues with SAST Automatic Vulnerability Resolution?

When to use this runbook?

This runbook is intended to be used when there is a service degaradation in relation to the SAST Automatic Vulnerability Resolution feature. Such degradation can be identified by monitoring the following:

SAST Automatic Vulnerability Resolution

The SAST Automatic Vulnerability Resolution feature is built to, as the name implies, automatically resolve vulnerabilities tied to SAST rules that have been disabled or removed.

The feature depends on a number of building blocks:

Schema definition in security-report-schemas

Reports generated by security analyzer scans have their JSON schemas defined in security-report-schemas repository. Automatic vulnerability resolution depends on a certain schema field (i.e. primary_identifiers) which is part of security-report-format, the latter being the parent schema for all other security reports’ schema, including sast-report-format.

SARIF module in analyzers/report package

The primary_identifiers field contains an exhaustive list of all identifiers for which the analyzer scan (as opposed to identifiers detected), so a report may have zero vulnerabilities but scan.primary_identifiers contain a full list. The list is generated while transforming a SARIF file into a SAST security report in the sarif.go module under analyzers/report package.

Dropped identifier processing within Rails application

While ingesting a security report within the gitlab-org/gitlab application, the IngestReportsService iterates through scan primary identifiers and executes ScheduleMarkDroppedAsResolvedService for each scan type, which in turn schedules MarkDroppedAsResolvedWorker. The worker loops through all vulnerabilities with identifiers matching the disabled or dropped identifiers (i.e no longer present in latest scan).

Below is a diagram showing the complete flow of automatic vulnerability resolution feature.

flowchart TB
code --> analyzer_pipeline

subgraph analyzer_pipeline["analyzer pipeline"]
  direction LR
  analyzer["semgrep analyzer"] --> report_a["noisy-rule-123 dropped"]
  report_a --> report_b["scan.identifiers populated"]
  report_b --> report_c("gl-sast-report.json")
end

analyzer_pipeline --> rails_application

subgraph rails_application["rails application"]
  ingest["IngestReportsService"] --> schedule["ScheduleMarkDroppedAsResolved"]
  schedule --> worker["MarkDroppedAsResolvedWorker"]
end

Monitoring

To monitor automatic vulnerability resolution, there are two primary sources of information: sentry.io which lists any errors occurring in MarkDroppedAsResolvedWorker class for the last 24 hours, and SAST Engineering dashboard on Kibana, which includes a number of panels monitoring certain works and showing the volume of uploaded reports. Please see below for a list of panels of interest and a brief description of each.

SAST Report Uploads

Displays the 90th percentile of file size of security reports uploaded, per 30 minutes. This is useful to see how big (or small) security reports that have been uploaded over a certain amount of time.

SAST Failing Workers Distribution

Shows the distribution of SAST-related sidekiq workers failing over a period of time.

Vulnerabilities::MarkDroppedAsResolvedWorker Execution Time

Displays the 75th and 95th percentiles of the worker’s execution time.

Vulnerabilities::MarkDroppedAsResolvedWorker Job Status

Shows the count of job executions, split by job status, per hour. This is useful to gauge the amount of failing, deduplicated, or successful executions over a certain amount of time.

Top Projects for MarkDroppedAsResolvedWorker Executions

Shows the top projects listed by the count of their worker executions. This can be useful to see if a certain customer is experiencing an issue.

Logs

Additionally, you may want to check the following two saved searches in production logs:

What to do if something goes wrong?

  1. Start by looking at the monitoring section above. Check if MarkDroppedAsResolvedWorker has any failures.
  2. Look at the logs, and see if the issue is possibly due to a query timing out while executing a database write operation (e.g. trying to resolve a huge number of findings).
  3. Consider turning automatic vulnerability resolution off.

Possible Checks

  • If there’s an increase in error rates in relation to automatic vulnerability resolution, there’s a possiblity it could be related to this timeout issue when a very high number of vulnerability findings are being resolved.

How to turn automatic vulnerability resolution off?

The presence of primary_identifiers is required for report ingestion and automatic vulnerability resolution. If automatic vulnerability resolution is not working as expected, consider stopping automatic resolution by ensuring scans do not have primary_identifiers included in the generated reports. To do so, consider one of the following options:

  1. Update sarif.go module to revert the change introduced in this merge request.
  2. Update ScheduleMarkDroppedAsResolvedService#dropped_identifiers method to return early regardless of the existence of primary_identifiers.
Last modified June 27, 2024: Fix various vale errors (46417d02)