Debugging the Gitaly service
About this document
This document is intended for Gitaly engineers, to become familiar with GitLab’s production layout and gain the ability to effectively debug production problems. While the focus is on SaaS, many of the skills transfer also to debugging self-managed instances.
Generic GitLab background
Skim / read the following, focusing on an overview then on Gitaly:
Other useful links:
Gitaly specific background
- Familiarize yourself with Gitaly’s README
- Take a look at SRE’s runbooks
Gitaly in Production
Both gitlab.com
and Dedicated use Gitaly in “sharded” mode, that is, without Praefect (Gitaly Cluster).
Monitoring dashboards
We have some useful pre-built monitoring dashboards on GitLab’s internal Grafana instance. All dashboards are listed in this folder. Please note that some of them are fairly outdated.
The following dashboards are most common:
- Gitaly: Overview. This dashboard contains cluster-wide aggregated metrics. It is used to determine the overall health of the cluster and make it easy to spot any outlier node.
- Gitaly: Host details. This dashboard contains more detailed metrics of a particular node.
- Gitaly Housekeeping statistics. This dashboard shows detailed operational information of Gitaly housekeeping feature.
- Gitaly: Rebalance dashboard: This dashboard shows the relative balance between Gitaly nodes. It is used to determine when we need to relocate the repositories of a node to others.
A Gitaly dashboard could be either auto-generated or manually drafted. We use Jsonnet (a superset of JSON) to achieve dashboards-as-code. The definitions of such dashboards are located in this folder. Recently, that’s the recommended way to manage an observability dashboard. It allows us to use GitLab’s built-in libraries, resulting in a highly standardized dashboard.
A standardized dashboard should have a top-level section containing environment filters, node filters, and useful annotations such as feature flag activities, deployments, etc. Some dashboards have an interlinked system that connects Grafana and Kibana with a single click.
Such dashboards usually include two parts. The second half contains panels of custom metrics collected from Gitaly. The first half is more complicated. It contains GitLab-wide indicators telling if Gitaly is “healthy” and node-level resource metrics. The aggregation and calculation are sophisticated. In summary, those dashboards tell us if Gitaly performs well according to predefined thresholds, . We could contact Scalability:Observability Team for any questions.
Some examples of using built-in dashboards to investigate production issues, from an Engineer’s point of view:
- https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18156#note_1965772736
- https://gitlab.com/gitlab-com/gl-infra/production/-/issues/15980#note_1457815084
- https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/23532#note_1374642198
Gitaly’s Prometheus metrics
A panel in a dashboard is a visualization of the aggregated version of underlying metrics. We use Prometheus to collect metrics. To simplify, the Gitaly server exposes an HTTP server (code) that allows Prometheus instances to fetch metrics periodically.
In a dashboard, you can click on the top-right hamburger button and choose “Explore” to get access to the underlying metrics. Or you could use the Explore page to play with metrics.
Unfortunately, we don’t have a curated list of all Gitaly metrics as well as their definition. So, you might need to look up their definition at multiple places. Here is the list of all Gitaly-related metrics. There are some sources
- Node-level or environmental metrics. Those metrics are powered by other systems that host the Gitaly process. They are not exposed by Gitaly but are very useful, for example: CPU metrics, memory metrics, or cgroup metrics.
- Gitaly-specific metrics. Those metrics are accounted for directly in the code. Typically, they have
gitaly_
prefixes. - Aggregated metrics, such as combining different metrics or downsizing metrics due to high cardinality issues. The list of Gitaly’s aggregated metrics is listed in this file.
In the code, you’ll see something like the following. Any registered metrics are available when Prometheus scrapes from the endpoint. Tracing those instances, you could find the usage of Gitaly-specific metrics.
repoCounter := counter.NewRepositoryCounter(cfg.Storages)
prometheus.MustRegister(repoCounter)
packObjectsServedBytes = promauto.NewCounter(prometheus.CounterOpts{
Name: "gitaly_pack_objects_served_bytes_total",
Help: "Number of bytes of git-pack-objects data served to clients",
})
A metric has a set of labels. GitLab adds the following set of labels to all metrics:
env
orenvironment
: the environment, including but not limited togprd
,gstg
,ops
, to name a few.fqdn
: the fully qualified domain name. As Gitaly runs on VMs now, this label is equivalent to the identity of the hosting node.region
andzone
: the region and zone of the node.stage
: the current stage of the process, eithermain
orcny
.service
/type
: for Gitaly, it’s alwaysgitaly
.
In the future, when Gitaly runs on K8s, we properly have more K8s-specific labels.
The query uses PromQL language. Some examples:
- Calculate the rate (ops/s) of pack-refs housekeeping task by node.
- Calculate the dropped pack-objects/RPC requests due to limited in the last 2 days
- Calculate inflight commands of gitaly-cny node. As you can see, there was as peak on 2024-06-17. It was when this incident occurs.
Debugging and performance testing tools
- gprcurl:
curl
like tool but for gPRC - grpcui: lightweight
postman
like tool for gPRC - hyperfine: a performance tool that can benchmarks over time
- hyperfine can be used together with grpcurl to check the response time of a gPRC call
Enable Git trace
The gitaly_log_git_traces
feature flag allows you to enable logging of Git trace2 spans for a specific user, group, or project. When enabled, Git outputs all trace2 spans to
gRPC logs issued by that particular actor.
/chatops run feature set --project=gitlab-org/gitlab gitaly_log_git_traces true
/chatops run feature set --user=myusername gitaly_log_git_traces true
Because of the frequency and noisiness of logs streaming in, customers should use a filter on server side logs to match the key and value of either:
"msg":"Git Trace2 API"
"component":"trace2hooks.log_exporter"
For more information, see:
On the client side, for Git trace v1, the customer can enable GIT_TRACE*
variables, including:
GIT_TRACE=true
GIT_TRACE_PACK_ACCESS=true
GIT_TRACE_PACKET=true
GIT_TRACE_PERFORMANCE=true
For Git trace v2:
# Debugging git operations
# If GIT_TRACE2_PERF_BRIEF or trace2.perfBrief is true, the time, file, and line fields are omitted.
GIT_TRACE2_PERF_BRIEF=1 GIT_TRACE2_PERF=true git clone http://gitlab.com/gitlab-org/gitaly
GIT_TRACE2_PERF_BRIEF=1 GIT_TRACE2_PERF=$(pwd)/git-perf.log git clone http://gitlab.com/gitlab-org/gitaly
# Output git events in json format
GIT_TRACE2_BRIEF=true GIT_TRACE2_EVENT=$(pwd)/trace2.json git clone http://gitlab.com/gitlab-org/gitaly
Outputs can be configured in different formats:
# Normal format
export GIT_TRACE2=~/log.normal
# OR Performance format
export GIT_TRACE2_PERF=~/log.perf
# OR Event format
export GIT_TRACE2_EVENT=~/log.event
# OR JSON format
export GIT_TRACE2_EVENT=~/log.json
For more information, see:
strace
strace(1)
a gitaly process:
strace -fttTyyy -s 1024 -o /paht/filename -p $(pgrep -fd, gitaly)
Or wrap a process to make it easy to strace, especially if it then spawns more processes:
#!/bin/bash/sh
echo $(date)" $PPID $@" >> /tmp/gitlab-shell.txt
exec /opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell-orig "$@"
# strace -fttTyyy -s 1024 -o /tmp/sshd_trace-$PPID /opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell-orig
strace
parser is useful to make the results more readable.
fast-stats
fast-stats is a useful tool developed by Support to quickly pull statistics from GitLab logs.
Examples
To find in one interval of 60m duration what the top methods called are from the gitaly logs.
fast-stats --interval 60m --limit 1 var/log/gitlab/gitaly/current
To find the top 10 User, Project, Client by Duration calling that method:
grep PostUploadPackWithSidechannel var/log/gitlab/gitaly/current | ~/bin/fast-stats --interval 60m top
git
Snoop on Gitaly Git commands.
- Stop Gitaly
- Rename gitaly-git process
find /opt/gitlab/embedded/bin/gitaly-git-v* -exec mv {} {}_orig \;
- Create wrapper script for each git version. Make sure you replace
opt/gitlab/embedded/bin/gitaly-git-vX.XX_orig
on the scrip with the right version.
#!/bin/bash
GIT="/opt/gitlab/embedded/bin/gitaly-git-vX.XX_orig"
FILE="/tmp/gitaly-$(date +%Y-%m-%d@%H:%M)"
echo -e "\n$(date) $PPID $@\n" >> $FILE
exec $GIT "$@" | tee -a $FILE
echo -e "\n--------------\n" >> $FILE
- Make scripts executable
find /opt/gitlab/embedded/bin/gitaly-git-v* -exec chmod 777 {} \;
- Start Gitaly
Log analysis
Kibana (Elastic) Dashboards
CPU and memory profiling
pprof metrics are exported to the GCP Cloud Profiler.
Note that Gitaly nodes are distributed across a number of GCP projects. You can use the project dropdown on the top nav bar to
switch between the various gitlab-gitaly-gstg-*
and gitlab-gitaly-gprd-*
projects.
Capacity management
Gitaly team is responsible for maintaining reasonable serving capacity for gitlab.com.
We get alerts from Tamland if capacity runs low, see this issue comment.
Capacity planning documentation explains how this works in general.
c16c2006
)