Secret Push Protection Monitoring

When to use this runbook?

This runbook is intended to be used when monitoring the secret push protection feature to identify and mitigate any reliability issues or performance regressions that may occur when it is enabled on Gitlab.com. The runbook can also be used to understand more about relevant dashboards below and how to improve them:

What to monitor?

While the feature, in its current form, doesn’t have any external components and is entirely encapsulated within the application server as a dependency, it does interact with a number of components as can be seen in this push event sequence diagram. Those components are:

  • GitLab Shell (Git over SSH):
    • git-receive-pack
  • Workhorse (Git over HTTP/S):
    • git-receive-pack
  • Gitaly:
    • SSHReceivePack
    • PostReceivePack
    • PreReceiveHook
    • ListAllBlobs() RPC
    • ListBlobs() RPC
    • GetTreeEntries() RPC
  • Rails:
    • /internal/allowed Endpoint

Below is a sequence diagram showing the entire workflow whether a git push takes place over HTTP or SSH:

sequenceDiagram
    actor User
    User->>+Workhorse/GitLab Shell: git push
    Workhorse/GitLab Shell->>+Gitaly: tcp/ssh
    Gitaly->>+Gitaly: PostReceivePack/SSHReceivePack
    Gitaly->>+Gitaly: git-receive-pack
    Gitaly->>+Gitaly: PreReceiveHook
    Gitaly->>+Rails: grpc
    Note over Gitaly, Rails: invokes /internal/allowed endpoint
    Rails->>+Rails: GitLab::GitAccess
    Rails->>+Rails: EE::GitLab::Checks::ChangesAccess
    Note over Rails: runs Gitlab::Checks::SecretsCheck
    break when special commit message flag is found
        Rails->>+Gitaly: push check skipped
        Gitaly->>+Workhorse/GitLab Shell: outcome of push
        Workhorse/GitLab Shell->>+User: outcome of push
    end
    break when push option is passed
        Rails->>+Gitaly: push check skipped
        Gitaly->>+Workhorse/GitLab Shell: outcome of push
        Workhorse/GitLab Shell->>+User: outcome of push
    end
    Rails->>+Gitaly: ListBlobs or ListAllBlobs
    Note over Gitaly, Rails: depends on quarantine directory existence
    Gitaly->>+Rails: grpc
    Rails->>+gitlab-secret_detection: gitlab-secret_detection::Scan
    alt no secret detected
        gitlab-secret_detection->>+gitlab-secret_detection: scan blob
        gitlab-secret_detection->>+Rails: success
        Rails->>+Gitaly: accept - no secret detected
    else scan timeout
        gitlab-secret_detection->>+gitlab-secret_detection: scan blob
        gitlab-secret_detection->>+Rails: fail - timeout
        Rails->>+Gitaly: accept - scan timeout
    else secret detected
        gitlab-secret_detection->>+gitlab-secret_detection: scan blob
        gitlab-secret_detection->>+Rails: fail - secret found
        Rails->>+Gitaly: GetTreeEntries
        Note over Gitaly, Rails: retrieves blobs' file path and commit sha
        Gitaly->>+Rails: grpc
        Rails->>+Rails: Format Response
        Rails->>+Gitaly: reject - secret detected
    end
    Gitaly->>+Workhorse/GitLab Shell: outcome of push
    Workhorse/GitLab Shell->>+User: outcome of push

Note: PreReceiveHook is not to be confused with git’s pre-receive hook. In fact, the former is a binary wrapper around the actual git hook. Please read more about the hook setup in Gitaly’s documentation.

These components are therefore the main elements we are trying to focus on when monitoring the feature.

How we monitor the feature?

As discussed above, the functionality spans a number of components. Therefore, are three main tools we could use for monitoring the feature:

This runbook focuses primarly on the Prometheus metrics available in Grafana, but also shares brief information about other tools and how they could be used. In later iterations, this may change as the feature grows and develops.

How to check the logs emitted from the feature?

To check the logs emitted from the feature, please look at the following Kibana views:

Note: Kibana retain logs for only 7 days.

How to identify and mitigate a reliability or performance issue with the feature?

The overview dashboard is the main dashboard we have built to monitor the feature. That’s where anyone should start to look when trying to identify reliability or performance issues.

The dashboard itself is split into 4 rows (or sections), with each containing a number of panels as below.

GitLab Shell (Git over SSH)

This section monitors the stability of certain operations related to the feature within Gitlab Shell, which is a set of executables created to handle Git SSH sessions. The tool itself does not handle SSH directly, but instead the SSH server/daemon gitlab-sshd maintain all connections with clients and calls up Rails via GitLab Shell to perform authorization or access checks. Please check this diagram and this description of a request cycle for more information on how that works.

The section can be used to ensure there are no performance degradations related to git-receive-pack operations when a git push operation is carried out over SSH. It is dividend into two rows/sections as follows.

Note: Most of available metrics for both gitlab-shell and gitlab-sshd aren’t aggregated by the command used, so for a more better overview of the performance of git-receive-pack operation, take a look at the Kibana logs linked in those sections instead.

gitlab-shell

RPS (Requests Per Seconds)

This panel displays average number of requests per second (RPS) made to gitlab-shell over time. The panel can be used to monitor request rates and understand if there’s a performance or scalability issue. Use the link for more detailed overview of this metric. Note: this isn’t specific to git-receive-pack command.

Panel Information

  • Metric: gitlab_component_ops:rate_5m
  • Label Filters:
    • component = gitlab_shell
    • env = gprd
    • monitor = global
    • stage = main
    • type = git
  • Operations:
    • Avg over time: range | $__interval
  • Legend:
    • RPS

Total Established Gitaly Connections

This panel displays the total number of Gitaly connections that have been established by gitlab-shell at a given time. This panel can be used to determine if there’s a sudden drop in connections between both components, which may indicate a performance or an availability issue. Note: this isn’t specific to git-receive-pack command.

Panel Information

  • Metric: gitlab_shell_gitaly_connections_total
  • Label Filters:
    • env = gprd
    • stage = main
  • Operations:
    • Count:
      • Label: status
  • Legend:
    • Auto

Established SSH Sessions

This panel displays the minimum number of established SSH sessions at a given time. The panel can be used to understand if there’s an availability issue together with the panel adjacent to it which shows how the maximum number of SSH sessions that failed to establish at a given time. Note: this isn’t specific to git-receive-pack command.

Panel Information

  • Metric: gitlab_sli:shell_sshd_sessions:total
  • Label Filters:
    • env = gprd
    • stage = main
    • type = git
  • Operations:
    • Min
  • Legend:
    • Auto

Failed SSH Sessions

This panel displays the maximum number of failed SSH sessions at a given time. The panel can be used to understand if there’s an availability issues together with the panel adjacent to it which shows how the minimum number of SSH sessions established over a given time. Note: this isn’t specific to git-receive-pack command.

Panel Information

  • Metric: gitlab_sli:shell_sshd_sessions:errors_total
  • Label Filters:
    • env = gprd
    • stage = main
    • type = git
  • Operations:
    • Max by:
      • Label: app
  • Legend:
    • Auto

Established Session Average Duration

This panel displays the average duration of establish SSH sessions summed up over a range of 24 hours. The panel can be used to determine if there’s an increase in the duration of a git pull/push over SSH which may indicate a performance or availability issue. Note: this isn’t specific to git-receive-pack command.

Panel Information

  • Metrics:
    • gitlab_shell_sshd_session_established_duration_seconds_sum
    • gitlab_shell_sshd_session_established_duration_seconds_count
  • Label Filters:
    • env = gprd
    • stage = main
    • type = git
  • Operations:
    • Divison: /
    • Rate: range | 24h
    • Sum
  • Legend:
    • {{label_name}}

All Sessions Average Duration

This panel displays the average duration of all SSH sessions (whether established or failed) summed up over a range of 24 hours. The panel can be used to determine if there’s an increase in the duration of a git pull/push over SSH which may indicate a performance or availability issue. Note: this isn’t specific to git-receive-pack command.

Panel Information

  • Metrics:
    • gitlab_shell_sshd_session_duration_seconds_sum
    • gitlab_shell_sshd_session_duration_seconds_count
  • Label Filters:
    • env = gprd
    • stage = main
    • type = git
  • Operations:
    • Divison: /
    • Rate: range | 24h
    • Sum
  • Legend:
    • {{label_name}}
gitlab-sshd

SLI Apdex

This panel displays the application performance index (Apdex) for the gitlab-sshd SSH daemon/server. This Service Level Indicator (SLI) averages close to 99.9%, most of the time but a drop in the indicator could point to an outage or degradation. Use the link for more detailed overview of this metric. Note: this isn’t specific to git-receive-pack command.

Panel Information

  • Metric: gitlab_component_apdex:ratio_5m
  • Label Filters:
    • component = gitlab_sshd
    • env = gprd
    • monitor = global
    • stage = main
    • type = git
  • Operations:
    • Min over time: range | $__interval
  • Legend:
    • Apdex

SLI Error Ratio

This panel displays the max ratio of errors clamped to maximum value received for the gitlab-sshd SSH daemon/server. This Service Level Indicator (SLI) averages close to 0.01%, most of the time but an increase in the indicator could point to an outage or degradation. Use the link for more detailed overview of this metric. Note: this isn’t specific to git-receive-pack command.

Panel Information

  • Metric: gitlab_component_errors:ratio_5m
  • Label Filters:
    • component = gitlab_sshd
    • env = gprd
    • monitor = global
    • stage = main
    • type = git
  • Operations:
    • Clamp max:
      • Maximum Scalar: 1
      • Max over time: range | $__interval
  • Legend:
    • Error %

SLI RPS (Requests Per Second)

This panel displays average number of requests per second (RPS) made to gitlab-sshd SSH daemon/server over time. The panel can be used to monitor request rates and understand if there’s a performance or scalability issue. Use the link for more detailed overview of this metric. Note: this isn’t specific to git-receive-pack command.

  • Metric: gitlab_component_ops:ratio_5m
  • Label Filters:
    • component = gitlab_sshd
    • env = gprd
    • monitor = global
    • stage = main
    • type = git
  • Operations:
    • Avg over time: range | $__interval
  • Legend:
    • Error %

Workhorse (Git over HTTP/S)

This section monitors the stability of certain operations related to the feature within Workhorse, which is a smart reverse proxy intended to handle resource-intensive and long-running requests. It intercepts all HTTP requests and either propagates them without changing or handles them itself by performing additional logic. Please check this diagram and this description of a request cycle for more information on how that works.

The section can be used to ensure there are no performance degradations related to git-receive-pack operations when a git push operation is carried out over HTTP/S.

Processed /.git/git-receive-pack Requests

This panel displays the number of HTTP requests that have been processed by workhorse over time, increasing in range of 24 hours. The panel partitions these requests by the HTTP verb/method and response code. This panel can be used to determine if the amount of git-receive-pack requests with a response code that isn’t 200 had increased recently, indicating an issue with processing such requests.

Panel Information

  • Metric: gitlab_workhorse_git_http_requests
  • Label Filters:
    • exported_service = git-receive-pack
    • env = gprd
    • stage = main
    • code != 0
  • Operations:
    • Increase: range | 24h
    • Sum:
      • Label: code
      • Label: method
  • Legend:
    • {{code}} | {{method}}

Total Established Gitaly Connections

This panel displays the total number of Gitaly connections that have been established by workhorse at a given time. This panel can be used to determine if there’s a sudden drop in connections between both components, which may indicate a performance or an availability issue.

Panel Information

  • Metric: gitlab_workhorse_gitaly_connections_total
  • Label Filters:
    • env = gprd
    • stage = main
  • Operations:
    • Count:
      • Label: status
  • Legend:
    • {{status}}

Average Latency for /.git/git-receive-pack Request [All Nodes]

This panel displays the average latency (duration) in seconds for the /.git/git-receive-pack request for all nodes running workhorse. This panel can be used to determine if there is an increase in response times for that specific request, which could indicate performance degradation issue if it surpassed a certain thershold.

Panel Information

  • Metrics:
    • gitlab_workhorse_http_request_duration_seconds_sum
    • gitlab_workhorse_http_request_duration_seconds_count
  • Label Filters:
    • env = gprd
    • stage = main
    • route = ^/.+\\.git/git-receive-pack\\z (double escaping is used for backslash )
  • Operations:
    • Divison: /
    • Rate: range | 24h
    • Sum:
      • Label: node
  • Legend:
    • Auto

SLI Apdex

This panel displays the application performance index (Apdex) for the workhorse component. This Service Level Indicator (SLI) averages close to 99.9%, most of the time but a drop in the indicator could point to an outage or degradation. Use the link for more detailed overview of this metric. Note: this isn’t specific to /.git/git-receive-pack route.

Panel Information

  • Metric: gitlab_component_apdex:ratio_5m
  • Label Filters:
    • component = workhorse
    • env = gprd
    • monitor = global
    • stage = main
    • type = git
  • Operations:
    • Min over time: range | $__interval
  • Legend:
    • Apdex

SLI Error Ratio

This panel displays the max ratio of errors clamped to maximum value received for workhorse. This Service Level Indicator (SLI) averages close to 0.001%, most of the time but an increase in the indicator could point to an outage or degradation. Use the link for more detailed overview of this metric. Note: this isn’t specific to /.git/git-receive-pack route.

Panel Information

  • Metric: gitlab_component_errors:ratio_5m
  • Label Filters:
    • component = workhorse
    • env = gprd
    • monitor = global
    • stage = main
    • type = git
  • Operations:
    • Clamp max:
      • Maximum Scalar: 1
      • Max over time: range | $__interval
  • Legend:
    • Error %

SLI RPS (Requests Per Second)

This panel displays average number of requests per second (RPS) made to workhorse over time. The panel can be used to monitor request rates and understand if there’s a performance or scalability issue. Use the link for more detailed overview of this metric. Note: this isn’t specific to /.git/git-receive-pack route.

  • Metric: gitlab_component_ops:ratio_5m
  • Label Filters:
    • component = workhorse
    • env = gprd
    • monitor = global
    • stage = main
    • type = git
  • Operations:
    • Avg over time: range | $__interval
  • Legend:
    • Error %

Gitaly

This section monitors the stability of gitaly, a tool providing high-level RPC access to git repositories. We focus in this section on the hooks and RPCs used by the feature. This can be used to ensure there are no performance degradations related any of them.

The section is divided into four sub-sections as follows, with most focus being on latency.

  1. GitLab Shell <=> Gitaly:
    • SSHReceivePack
  2. Workhorse <=> Gitaly:
    • PostReceivePack.
  3. Gitaly <=> Rails API:
    • Gitaly / Before /internal/allowed:
      • PreReceiveHook.
    • Gitaly / During /internal/allowed:
      • ListAllBlobs() RPC
      • ListBlobs() RPC
      • GetTreeEntries() RPC
GitLab Shell <=> Gitaly

PostReceivePack – Average Latency [All Hosts]

This panel displays the average latency on all hosts in milliseconds for calls to the PostReceivePack RPC, which is the RPC responsible for calling git-receive-pack command, that in turn executes the PreReceiveHook. The latter goes on to call /internal/allowed endpoint that runs access checks on Rails side.

Panel Information

  • Metric: gitaly:grpc_server_handling_seconds:avg5m
  • Label Filters:
    • job = gitaly
    • grpc_method = PostReceivePack
  • Operations:
    • Avg: 1000 * avg
  • Legend:
    • {{method}}
Workhorse <=> Gitaly

SSHReceivePack – Average Latency [All Hosts]

This panel displays the average latency on all hosts in milliseconds for calls to the SSHReceivePack RPC, which is the RPC responsible for calling git-receive-pack command (for git push over SSH), that in turn executes the PreReceiveHook. The latter goes on to call /internal/allowed endpoint which runs access checks on Rails side.

Panel Information

  • Metric: gitaly:grpc_server_handling_seconds:avg5m
  • Label Filters:
    • job = gitaly
    • grpc_method = SSHReceivePack
  • Operations:
    • Avg: 1000 * avg
  • Legend:
    • {{method}}
Gitaly <=> Rails API
Gitaly / Before /internal/allowed

PreReceiveHook – Average Latency [All Hosts]

This panel displays the average latency on all hosts in milliseconds for calls to the PreReceiveHook hook, that in turn calls /internal/allowed endpoint to runs access checks on Rails side.

Panel Information

  • Metric: gitaly:grpc_server_handling_seconds:avg5m
  • Label Filters:
    • job = gitaly
    • grpc_method = PreReceiveHook
  • Operations:
    • Avg: 1000 * avg
  • Legend:
    • {{method}}
Gitaly / During /internal/allowed

ListAllBlobs – Average Latency [All Hosts]

This panel displays the average latency in milliseconds for all calls to the ListAllBlobs RPC, which is responsible (within the context of the feature) for enumerating all blobs of a repository under a certain size limit (i.e. exactly 1MiB). This procedure is usually fast because it is mostly used with the size limit set to 0 for checking file sizes of blobs in a certain git push.

Panel Information

  • Metric: gitaly:grpc_server_handling_seconds:avg5m
  • Label Filters:
    • job = gitaly
    • grpc_method = ListAllBlobs
  • Operations:
    • Avg: 1000 * avg
  • Legend:
    • {{method}}

ListBlobs – Average Latency [All Hosts]

This panel displays the average latency in milliseconds for all calls to the ListBlobs RPC, which is responsible (within the context of the feature) for enumerating all blobs of a repository under a certain size limit (i.e. exactly 1MiB), similar to ListAllBlobs, but it also loads up file paths for those blobs. The procedure is often slower than ListAllBlobs because it loads up blob contents when enumerating them.

Panel Information

  • Metric: gitaly:grpc_server_handling_seconds:avg5m
  • Label Filters:
    • job = gitaly
    • grpc_method = ListBlobs
  • Operations:
    • Avg: 1000 * avg
  • Legend:
    • {{method}}

GetTreeEntries – Average Latency [All Hosts]

This panel displays the average latency in milliseconds for all calls to the GetTreeEntries RPC, which is responsible (within the context of the feature) for retrieving blob metadata (i.e. file path and commit sha) for all blobs that were scanned and found to include a leaked secret.

Panel Information

  • Metric: gitaly:grpc_server_handling_seconds:avg5m
  • Label Filters:
    • job = gitaly
    • grpc_method = GetTreeEntries
  • Operations:
    • Avg: 1000 * avg
  • Legend:
    • {{method}}

Rails

This section monitors the stability of the /internal/allowed endpoint which is a focal point in the feature’s journey to protect against leaked secrets in a git push. The endpoint is part of GitLab’s Internal API, and is responsible for assessing if a user has permission to perform certain operations on the repository.

The section can be used to ensure there are no performance degradations related to the /internal/allowed endpoint when changes in a certain git push are scanned for secrets.

Internal API / Request Latency

This panel displays the average, p95, p99, and mean latencies for requests made to the /internal/allowed endpoint over time. The panel can be used to monitor request latency and understand if there’s a performance or scalability issue with that specific endpoint. Use the link for more detailed overview of this metric.

Panel Information

  • Metrics:
    • controller_action:gitlab_transaction_duration_seconds_sum:rate5m
    • controller_action:gitlab_transaction_duration_seconds:p95
    • controller_action:gitlab_transaction_duration_seconds:p99
    • controller_action:gitlab_transaction_duration_seconds_sum:rate1m
    • controller_action:gitlab_transaction_duration_seconds_count:rate1m
  • Label Filters:
    • action = POST /api/internal/allowed
    • controller = Grape
    • environment = gprd
    • stage = main
    • type = internal-api
  • Operations:
    • Avg over time: range | $__interval
    • Avg:
      • Label: controller
      • Label: action
  • Legends:
    • {{action}} – avg
    • {{action}} – p95
    • {{action}} – p99
    • {{action}} – mean

Internal API / RPS (Requests Per Second)

This panel displays average number of requests per second (RPS) made to the /internal/allowed endpoint over time. The panel can be used to monitor request rates and understand if there’s a performance or scalability issue with that specific endpoint. Use the link for more detailed overview of this metric.

Panel Information

  • Metrics:
    • controller_action:gitlab_transaction_duration_seconds_count:rate1m
  • Label Filters:
    • action = POST /api/internal/allowed
    • controller = Grape
    • environment = gprd
    • stage = main
    • type = internal-api
  • Operations:
    • Avg over time: range | $__interval
    • Sum:
      • Label: controller
      • Label: action
  • Legends:
    • {{action}}

Internal API / Memory Saturation Rate

This panel displays the memory saturation rate for two components of the internal API, which are Ruby VM and Puma Workers. This is helpful to understand if the memory consumption in Rails had increased to the point of saturation, which indicates a performance and a scalability issue and requires attention. Note: this panel isn’t specific to /internal/allowed endpoint.

Panel Information

  • Metrics:
    • gitlab_component_saturation:ratio
  • Label Filters:
    • env = grpd
    • environemnt = gprd
    • stage = main
    • type = internal-api
    • component = ruby_thread_contention | puma_workers
  • Operations:
    • Max over time: range | $__interval
    • Max:
      • Label: component
  • Legends:
    • Auto

Where else to look for help?

If you’re unsure, you can always ask for help in #g_secure-secret-detection channel.

How to improve this runbook?

The runbook needs to be updated as the feature evolves and progresses. Please follow guidelines below to keep it updated.

When a panel is updated in a dashboard

If a panel is updated in a dashboard, please update the panel information and description as needed.

When a new panel is added

If a new panel is created in a dashboard, please add the name, description, and information using the same format outlined below.

**PANEL NAME IN BOLD**

A few sentences describing what the panel does and what it could be used for to identify a performance regression or reliability issue.

_Panel Information_

* Metric: `NAME_OF_METRIC_USED`
* Label Filters:
  * `LIST_OF_LABELS_USED_TO_FILTER_BY_IN_KEY_AND_VALUE`
* Operations:
  * `LIST_OF_OPERATIONS_APPLIED_ON_DATA`
* Legend:
  * `LEGEND_USED_IF_NOT_AUTOMATIC`

When a panel is removed

In case a panel is removed from the dashboard, please consider removing the corresponding section from this runbook.

How to contribute to relevant dashboards?

Dashboards discussed in this runbook can be improved as follows.

When a new component is utilised by the feature

If a new component is utilised by the feature, please follow the steps below.

  • Identify endpoints or services the feature interacts with in the component.
  • Explore metrics available for the endpoint or service.
  • If no metrics are available, consider creating them to monitor the performance of the endpoint/service.
  • Create a new row for the component in the dashboard you are editing.
  • Add as many panels as for available metrics in the new row. Use your best judgement on what is should be added.
  • Create a merge request updating this runbook with information about the panel. Use panels above for guidance.

When a component is no longer relevant

If a component is no longer relevant, please remove its corresponding row from the dashboard.

Last modified June 18, 2024: Add relative links rule (cd96f133)