Secret Push Protection Monitoring
When to use this runbook?
This runbook is intended to be used when monitoring the secret push protection feature to identify and mitigate any reliability issues or performance regressions that may occur when it is enabled on Gitlab.com. The runbook can also be used to understand more about relevant dashboards below and how to improve them:
What to monitor?
While the feature, in its current form, doesn’t have any external components and is entirely encapsulated within the application server as a dependency, it does interact with a number of components as can be seen in this push event sequence diagram. Those components are:
- GitLab Shell (Git over SSH):
git-receive-pack
- Workhorse (Git over HTTP/S):
git-receive-pack
- Gitaly:
SSHReceivePack
PostReceivePack
PreReceiveHook
ListAllBlobs()
RPCListBlobs()
RPCGetTreeEntries()
RPC
- Rails:
/internal/allowed
Endpoint
Below is a sequence diagram showing the entire workflow whether a git push
takes place over HTTP or SSH:
sequenceDiagram actor User User->>+Workhorse/GitLab Shell: git push Workhorse/GitLab Shell->>+Gitaly: tcp/ssh Gitaly->>+Gitaly: PostReceivePack/SSHReceivePack Gitaly->>+Gitaly: git-receive-pack Gitaly->>+Gitaly: PreReceiveHook Gitaly->>+Rails: grpc Note over Gitaly, Rails: invokes /internal/allowed endpoint Rails->>+Rails: GitLab::GitAccess Rails->>+Rails: EE::GitLab::Checks::ChangesAccess Note over Rails: runs Gitlab::Checks::SecretsCheck break when special commit message flag is found Rails->>+Gitaly: push check skipped Gitaly->>+Workhorse/GitLab Shell: outcome of push Workhorse/GitLab Shell->>+User: outcome of push end break when push option is passed Rails->>+Gitaly: push check skipped Gitaly->>+Workhorse/GitLab Shell: outcome of push Workhorse/GitLab Shell->>+User: outcome of push end Rails->>+Gitaly: ListBlobs or ListAllBlobs Note over Gitaly, Rails: depends on quarantine directory existence Gitaly->>+Rails: grpc Rails->>+gitlab-secret_detection: gitlab-secret_detection::Scan alt no secret detected gitlab-secret_detection->>+gitlab-secret_detection: scan blob gitlab-secret_detection->>+Rails: success Rails->>+Gitaly: accept - no secret detected else scan timeout gitlab-secret_detection->>+gitlab-secret_detection: scan blob gitlab-secret_detection->>+Rails: fail - timeout Rails->>+Gitaly: accept - scan timeout else secret detected gitlab-secret_detection->>+gitlab-secret_detection: scan blob gitlab-secret_detection->>+Rails: fail - secret found Rails->>+Gitaly: GetTreeEntries Note over Gitaly, Rails: retrieves blobs' file path and commit sha Gitaly->>+Rails: grpc Rails->>+Rails: Format Response Rails->>+Gitaly: reject - secret detected end Gitaly->>+Workhorse/GitLab Shell: outcome of push Workhorse/GitLab Shell->>+User: outcome of push
Note: PreReceiveHook
is not to be confused with git’s pre-receive hook. In fact, the former is a binary wrapper around the actual git hook. Please read more about the hook setup in Gitaly’s documentation.
These components are therefore the main elements we are trying to focus on when monitoring the feature.
How we monitor the feature?
As discussed above, the functionality spans a number of components. Therefore, are three main tools we could use for monitoring the feature:
- Kibana (Logs)
- Staging
pubsub-rails-inf-gstg
pubsub-gitaly-inf-gstg
pubsub-workhorse-inf-gstg
pubsub-shell-inf-gstg
- Production
pubsub-rails-inf-gprd
pubsub-gitaly-inf-gprd
pubsub-workhorse-inf-gprd
pubsub-shell-inf-gprd
- Staging
- Prometheus/Grafana (Metrics)
- Sentry (Error Tracking)
This runbook focuses primarly on the Prometheus metrics available in Grafana, but also shares brief information about other tools and how they could be used. In later iterations, this may change as the feature grows and develops.
How to check the logs emitted from the feature?
To check the logs emitted from the feature, please look at the following Kibana views:
Note: Kibana retain logs for only 7 days.
How to identify and mitigate a reliability or performance issue with the feature?
The overview dashboard is the main dashboard we have built to monitor the feature. That’s where anyone should start to look when trying to identify reliability or performance issues.
The dashboard itself is split into 4 rows (or sections), with each containing a number of panels as below.
GitLab Shell (Git over SSH)
This section monitors the stability of certain operations related to the feature within Gitlab Shell
, which is a set of executables created to handle Git SSH sessions. The tool itself does not handle SSH directly, but instead the SSH server/daemon gitlab-sshd
maintain all connections with clients and calls up Rails via GitLab Shell to perform authorization or access checks. Please check this diagram and this description of a request cycle for more information on how that works.
The section can be used to ensure there are no performance degradations related to git-receive-pack
operations when a git push
operation is carried out over SSH. It is dividend into two rows/sections as follows.
Note: Most of available metrics for both gitlab-shell
and gitlab-sshd
aren’t aggregated by the command used, so for a more better overview of the performance of git-receive-pack
operation, take a look at the Kibana logs linked in those sections instead.
gitlab-shell
This panel displays average number of requests per second (RPS) made to gitlab-shell over time. The panel can be used to monitor request rates and understand if there’s a performance or scalability issue. Use the link for more detailed overview of this metric. Note: this isn’t specific to git-receive-pack
command.
Panel Information
- Metric:
gitlab_component_ops:rate_5m
- Label Filters:
component
=gitlab_shell
env
=gprd
monitor
=global
stage
=main
type
=git
- Operations:
- Avg over time:
range | $__interval
- Avg over time:
- Legend:
- RPS
Total Established Gitaly Connections
This panel displays the total number of Gitaly connections that have been established by gitlab-shell at a given time. This panel can be used to determine if there’s a sudden drop in connections between both components, which may indicate a performance or an availability issue. Note: this isn’t specific to git-receive-pack
command.
Panel Information
- Metric:
gitlab_shell_gitaly_connections_total
- Label Filters:
env
=gprd
stage
=main
- Operations:
- Count:
- Label:
status
- Label:
- Count:
- Legend:
- Auto
This panel displays the minimum number of established SSH sessions at a given time. The panel can be used to understand if there’s an availability issue together with the panel adjacent to it which shows how the maximum number of SSH sessions that failed to establish at a given time. Note: this isn’t specific to git-receive-pack
command.
Panel Information
- Metric:
gitlab_sli:shell_sshd_sessions:total
- Label Filters:
env
=gprd
stage
=main
type
=git
- Operations:
- Min
- Legend:
- Auto
This panel displays the maximum number of failed SSH sessions at a given time. The panel can be used to understand if there’s an availability issues together with the panel adjacent to it which shows how the minimum number of SSH sessions established over a given time. Note: this isn’t specific to git-receive-pack
command.
Panel Information
- Metric:
gitlab_sli:shell_sshd_sessions:errors_total
- Label Filters:
env
=gprd
stage
=main
type
=git
- Operations:
- Max by:
- Label:
app
- Label:
- Max by:
- Legend:
- Auto
Established Session Average Duration
This panel displays the average duration of establish SSH sessions summed up over a range of 24 hours. The panel can be used to determine if there’s an increase in the duration of a git pull/push over SSH which may indicate a performance or availability issue. Note: this isn’t specific to git-receive-pack
command.
Panel Information
- Metrics:
gitlab_shell_sshd_session_established_duration_seconds_sum
gitlab_shell_sshd_session_established_duration_seconds_count
- Label Filters:
env
=gprd
stage
=main
type
=git
- Operations:
- Divison:
/
- Rate:
range | 24h
- Sum
- Divison:
- Legend:
{{label_name}}
This panel displays the average duration of all SSH sessions (whether established or failed) summed up over a range of 24 hours. The panel can be used to determine if there’s an increase in the duration of a git pull/push over SSH which may indicate a performance or availability issue. Note: this isn’t specific to git-receive-pack
command.
Panel Information
- Metrics:
gitlab_shell_sshd_session_duration_seconds_sum
gitlab_shell_sshd_session_duration_seconds_count
- Label Filters:
env
=gprd
stage
=main
type
=git
- Operations:
- Divison:
/
- Rate:
range | 24h
- Sum
- Divison:
- Legend:
{{label_name}}
gitlab-sshd
This panel displays the application performance index (Apdex) for the gitlab-sshd
SSH daemon/server. This Service Level Indicator (SLI) averages close to 99.9%, most of the time but a drop in the indicator could point to an outage or degradation. Use the link for more detailed overview of this metric. Note: this isn’t specific to git-receive-pack
command.
Panel Information
- Metric:
gitlab_component_apdex:ratio_5m
- Label Filters:
component
=gitlab_sshd
env
=gprd
monitor
=global
stage
=main
type
=git
- Operations:
- Min over time:
range | $__interval
- Min over time:
- Legend:
- Apdex
This panel displays the max ratio of errors clamped to maximum value received for the gitlab-sshd
SSH daemon/server. This Service Level Indicator (SLI) averages close to 0.01%, most of the time but an increase in the indicator could point to an outage or degradation. Use the link for more detailed overview of this metric. Note: this isn’t specific to git-receive-pack
command.
Panel Information
- Metric:
gitlab_component_errors:ratio_5m
- Label Filters:
component
=gitlab_sshd
env
=gprd
monitor
=global
stage
=main
type
=git
- Operations:
- Clamp max:
- Maximum Scalar:
1
- Max over time:
range | $__interval
- Maximum Scalar:
- Clamp max:
- Legend:
- Error %
This panel displays average number of requests per second (RPS) made to gitlab-sshd
SSH daemon/server over time. The panel can be used to monitor request rates and understand if there’s a performance or scalability issue. Use the link for more detailed overview of this metric. Note: this isn’t specific to git-receive-pack
command.
- Metric:
gitlab_component_ops:ratio_5m
- Label Filters:
component
=gitlab_sshd
env
=gprd
monitor
=global
stage
=main
type
=git
- Operations:
- Avg over time:
range | $__interval
- Avg over time:
- Legend:
- Error %
Workhorse (Git over HTTP/S)
This section monitors the stability of certain operations related to the feature within Workhorse
, which is a smart reverse proxy intended to handle resource-intensive and long-running requests. It intercepts all HTTP requests and either propagates them without changing or handles them itself by performing additional logic. Please check this diagram and this description of a request cycle for more information on how that works.
The section can be used to ensure there are no performance degradations related to git-receive-pack
operations when a git push
operation is carried out over HTTP/S.
Processed /.git/git-receive-pack
Requests
This panel displays the number of HTTP requests that have been processed by workhorse
over time, increasing in range of 24 hours. The panel partitions these requests by the HTTP verb/method and response code. This panel can be used to determine if the amount of git-receive-pack
requests with a response code that isn’t 200
had increased recently, indicating an issue with processing such requests.
Panel Information
- Metric:
gitlab_workhorse_git_http_requests
- Label Filters:
exported_service
=git-receive-pack
env
=gprd
stage
=main
code
!=0
- Operations:
- Increase:
range | 24h
- Sum:
- Label:
code
- Label:
method
- Label:
- Increase:
- Legend:
{{code}} | {{method}}
Total Established Gitaly Connections
This panel displays the total number of Gitaly
connections that have been established by workhorse
at a given time. This panel can be used to determine if there’s a sudden drop in connections between both components, which may indicate a performance or an availability issue.
Panel Information
- Metric:
gitlab_workhorse_gitaly_connections_total
- Label Filters:
env
=gprd
stage
=main
- Operations:
- Count:
- Label:
status
- Label:
- Count:
- Legend:
{{status}}
Average Latency for /.git/git-receive-pack
Request [All Nodes]
This panel displays the average latency (duration) in seconds for the /.git/git-receive-pack
request for all nodes running workhorse
. This panel can be used to determine if there is an increase in response times for that specific request, which could indicate performance degradation issue if it surpassed a certain thershold.
Panel Information
- Metrics:
gitlab_workhorse_http_request_duration_seconds_sum
gitlab_workhorse_http_request_duration_seconds_count
- Label Filters:
env
=gprd
stage
=main
route
=^/.+\\.git/git-receive-pack\\z
(double escaping is used for backslash )
- Operations:
- Divison:
/
- Rate:
range | 24h
- Sum:
- Label:
node
- Label:
- Divison:
- Legend:
- Auto
This panel displays the application performance index (Apdex) for the workhorse
component. This Service Level Indicator (SLI) averages close to 99.9%, most of the time but a drop in the indicator could point to an outage or degradation. Use the link for more detailed overview of this metric. Note: this isn’t specific to /.git/git-receive-pack
route.
Panel Information
- Metric:
gitlab_component_apdex:ratio_5m
- Label Filters:
component
=workhorse
env
=gprd
monitor
=global
stage
=main
type
=git
- Operations:
- Min over time:
range | $__interval
- Min over time:
- Legend:
- Apdex
This panel displays the max ratio of errors clamped to maximum value received for workhorse
. This Service Level Indicator (SLI) averages close to 0.001%, most of the time but an increase in the indicator could point to an outage or degradation. Use the link for more detailed overview of this metric. Note: this isn’t specific to /.git/git-receive-pack
route.
Panel Information
- Metric:
gitlab_component_errors:ratio_5m
- Label Filters:
component
=workhorse
env
=gprd
monitor
=global
stage
=main
type
=git
- Operations:
- Clamp max:
- Maximum Scalar:
1
- Max over time:
range | $__interval
- Maximum Scalar:
- Clamp max:
- Legend:
- Error %
This panel displays average number of requests per second (RPS) made to workhorse
over time. The panel can be used to monitor request rates and understand if there’s a performance or scalability issue. Use the link for more detailed overview of this metric. Note: this isn’t specific to /.git/git-receive-pack
route.
- Metric:
gitlab_component_ops:ratio_5m
- Label Filters:
component
=workhorse
env
=gprd
monitor
=global
stage
=main
type
=git
- Operations:
- Avg over time:
range | $__interval
- Avg over time:
- Legend:
- Error %
Gitaly
This section monitors the stability of gitaly
, a tool providing high-level RPC access to git
repositories. We focus in this section on the hooks and RPCs used by the feature. This can be used to ensure there are no performance degradations related any of them.
The section is divided into four sub-sections as follows, with most focus being on latency.
- GitLab Shell <=> Gitaly:
SSHReceivePack
- Workhorse <=> Gitaly:
PostReceivePack
.
- Gitaly <=> Rails API:
- Gitaly / Before
/internal/allowed
:PreReceiveHook
.
- Gitaly / During
/internal/allowed
:ListAllBlobs()
RPCListBlobs()
RPCGetTreeEntries()
RPC
- Gitaly / Before
GitLab Shell <=> Gitaly
PostReceivePack – Average Latency [All Hosts]
This panel displays the average latency on all hosts in milliseconds for calls to the PostReceivePack
RPC, which is the RPC responsible for calling git-receive-pack
command, that in turn executes the PreReceiveHook
. The latter goes on to call /internal/allowed
endpoint that runs access checks on Rails side.
Panel Information
- Metric:
gitaly:grpc_server_handling_seconds:avg5m
- Label Filters:
job
=gitaly
grpc_method
=PostReceivePack
- Operations:
- Avg:
1000 * avg
- Avg:
- Legend:
{{method}}
Workhorse <=> Gitaly
SSHReceivePack – Average Latency [All Hosts]
This panel displays the average latency on all hosts in milliseconds for calls to the SSHReceivePack
RPC, which is the RPC responsible for calling git-receive-pack
command (for git push over SSH), that in turn executes the PreReceiveHook
. The latter goes on to call /internal/allowed
endpoint which runs access checks on Rails side.
Panel Information
- Metric:
gitaly:grpc_server_handling_seconds:avg5m
- Label Filters:
job
=gitaly
grpc_method
=SSHReceivePack
- Operations:
- Avg:
1000 * avg
- Avg:
- Legend:
{{method}}
Gitaly <=> Rails API
Gitaly / Before /internal/allowed
PreReceiveHook – Average Latency [All Hosts]
This panel displays the average latency on all hosts in milliseconds for calls to the PreReceiveHook
hook, that in turn calls /internal/allowed
endpoint to runs access checks on Rails side.
Panel Information
- Metric:
gitaly:grpc_server_handling_seconds:avg5m
- Label Filters:
job
=gitaly
grpc_method
=PreReceiveHook
- Operations:
- Avg:
1000 * avg
- Avg:
- Legend:
{{method}}
Gitaly / During /internal/allowed
ListAllBlobs – Average Latency [All Hosts]
This panel displays the average latency in milliseconds for all calls to the ListAllBlobs
RPC, which is responsible (within the context of the feature) for enumerating all blobs of a repository under a certain size limit (i.e. exactly 1MiB). This procedure is usually fast because it is mostly used with the size limit set to 0 for checking file sizes of blobs in a certain git push.
Panel Information
- Metric:
gitaly:grpc_server_handling_seconds:avg5m
- Label Filters:
job
=gitaly
grpc_method
=ListAllBlobs
- Operations:
- Avg:
1000 * avg
- Avg:
- Legend:
{{method}}
ListBlobs – Average Latency [All Hosts]
This panel displays the average latency in milliseconds for all calls to the ListBlobs
RPC, which is responsible (within the context of the feature) for enumerating all blobs of a repository under a certain size limit (i.e. exactly 1MiB), similar to ListAllBlobs
, but it also loads up file paths for those blobs. The procedure is often slower than ListAllBlobs
because it loads up blob contents when enumerating them.
Panel Information
- Metric:
gitaly:grpc_server_handling_seconds:avg5m
- Label Filters:
job
=gitaly
grpc_method
=ListBlobs
- Operations:
- Avg:
1000 * avg
- Avg:
- Legend:
{{method}}
GetTreeEntries – Average Latency [All Hosts]
This panel displays the average latency in milliseconds for all calls to the GetTreeEntries
RPC, which is responsible (within the context of the feature) for retrieving blob metadata (i.e. file path and commit sha) for all blobs that were scanned and found to include a leaked secret.
Panel Information
- Metric:
gitaly:grpc_server_handling_seconds:avg5m
- Label Filters:
job
=gitaly
grpc_method
=GetTreeEntries
- Operations:
- Avg:
1000 * avg
- Avg:
- Legend:
{{method}}
Rails
This section monitors the stability of the /internal/allowed
endpoint which is a focal point in the feature’s journey to protect against leaked secrets in a git
push. The endpoint is part of GitLab’s Internal API, and is responsible for assessing if a user has permission to perform certain operations on the repository.
The section can be used to ensure there are no performance degradations related to the /internal/allowed
endpoint when changes in a certain git
push are scanned for secrets.
Internal API / Request Latency
This panel displays the average, p95, p99, and mean latencies for requests made to the /internal/allowed
endpoint over time. The panel can be used to monitor request latency and understand if there’s a performance or scalability issue with that specific endpoint. Use the link for more detailed overview of this metric.
Panel Information
- Metrics:
controller_action:gitlab_transaction_duration_seconds_sum:rate5m
controller_action:gitlab_transaction_duration_seconds:p95
controller_action:gitlab_transaction_duration_seconds:p99
controller_action:gitlab_transaction_duration_seconds_sum:rate1m
controller_action:gitlab_transaction_duration_seconds_count:rate1m
- Label Filters:
action
=POST /api/internal/allowed
controller
=Grape
environment
=gprd
stage
=main
type
=internal-api
- Operations:
- Avg over time:
range | $__interval
- Avg:
- Label:
controller
- Label:
action
- Label:
- Avg over time:
- Legends:
{{action}} – avg
{{action}} – p95
{{action}} – p99
{{action}} – mean
Internal API / RPS (Requests Per Second)
This panel displays average number of requests per second (RPS) made to the /internal/allowed
endpoint over time. The panel can be used to monitor request rates and understand if there’s a performance or scalability issue with that specific endpoint. Use the link for more detailed overview of this metric.
Panel Information
- Metrics:
controller_action:gitlab_transaction_duration_seconds_count:rate1m
- Label Filters:
action
=POST /api/internal/allowed
controller
=Grape
environment
=gprd
stage
=main
type
=internal-api
- Operations:
- Avg over time:
range | $__interval
- Sum:
- Label:
controller
- Label:
action
- Label:
- Avg over time:
- Legends:
{{action}}
Internal API / Memory Saturation Rate
This panel displays the memory saturation rate for two components of the internal API, which are Ruby VM and Puma Workers. This is helpful to understand if the memory consumption in Rails had increased to the point of saturation, which indicates a performance and a scalability issue and requires attention. Note: this panel isn’t specific to /internal/allowed
endpoint.
Panel Information
- Metrics:
gitlab_component_saturation:ratio
- Label Filters:
env
=grpd
environemnt
=gprd
stage
=main
type
=internal-api
component
=ruby_thread_contention
|puma_workers
- Operations:
- Max over time:
range | $__interval
- Max:
- Label:
component
- Label:
- Max over time:
- Legends:
- Auto
Where else to look for help?
If you’re unsure, you can always ask for help in #g_secure-secret-detection
channel.
How to improve this runbook?
The runbook needs to be updated as the feature evolves and progresses. Please follow guidelines below to keep it updated.
When a panel is updated in a dashboard
If a panel is updated in a dashboard, please update the panel information and description as needed.
When a new panel is added
If a new panel is created in a dashboard, please add the name, description, and information using the same format outlined below.
**PANEL NAME IN BOLD**
A few sentences describing what the panel does and what it could be used for to identify a performance regression or reliability issue.
_Panel Information_
* Metric: `NAME_OF_METRIC_USED`
* Label Filters:
* `LIST_OF_LABELS_USED_TO_FILTER_BY_IN_KEY_AND_VALUE`
* Operations:
* `LIST_OF_OPERATIONS_APPLIED_ON_DATA`
* Legend:
* `LEGEND_USED_IF_NOT_AUTOMATIC`
When a panel is removed
In case a panel is removed from the dashboard, please consider removing the corresponding section from this runbook.
How to contribute to relevant dashboards?
Dashboards discussed in this runbook can be improved as follows.
When a new component is utilised by the feature
If a new component is utilised by the feature, please follow the steps below.
- Identify endpoints or services the feature interacts with in the component.
- Explore metrics available for the endpoint or service.
- If no metrics are available, consider creating them to monitor the performance of the endpoint/service.
- Create a new row for the component in the dashboard you are editing.
- Add as many panels as for available metrics in the new row. Use your best judgement on what is should be added.
- Create a merge request updating this runbook with information about the panel. Use panels above for guidance.
When a component is no longer relevant
If a component is no longer relevant, please remove its corresponding row from the dashboard.
55741fb9
)