Debugging the Gitaly service

About this document

This document is intended for Gitaly engineers, to become familiar with GitLab’s production layout and gain the ability to effectively debug production problems. While the focus is on SaaS, many of the skills transfer also to debugging self-managed instances.

Generic GitLab background

Skim / read the following, focusing on an overview then on Gitaly:

Gitaly specific background

Familiarize yourself with Gitaly’s README
Take a look at SRE’s runbooks

Gitaly in Production

Both gitlab.com and Dedicated use Gitaly in “sharded” mode, that is, without Praefect (Gitaly Cluster).

Monitoring dashboards

We have some useful pre-built monitoring dashboards on GitLab’s internal Grafana instance. All dashboards are listed in this folder. Please note that some of them are fairly outdated.

The following dashboards are most common:

Gitaly: Overview. This dashboard contains cluster-wide aggregated metrics. It is used to determine the overall health of the cluster and make it easy to spot any outlier node.
Gitaly: Host details. This dashboard contains more detailed metrics of a particular node.
Gitaly Housekeeping statistics. This dashboard shows detailed operational information of Gitaly housekeeping feature.
Gitaly: Rebalance dashboard: This dashboard shows the relative balance between Gitaly nodes. It is used to determine when we need to relocate the repositories of a node to others.

A Gitaly dashboard could be either auto-generated or manually drafted. We use Jsonnet (a superset of JSON) to achieve dashboards-as-code. The definitions of such dashboards are located in this folder. Recently, that’s the recommended way to manage an observability dashboard. It allows us to use GitLab’s built-in libraries, resulting in a highly standardized dashboard.

A standardized dashboard should have a top-level section containing environment filters, node filters, and useful annotations such as feature flag activities, deployments, etc. Some dashboards have an interlinked system that connects Grafana and Kibana with a single click.

Such dashboards usually include two parts. The second half contains panels of custom metrics collected from Gitaly. The first half is more complicated. It contains GitLab-wide indicators telling if Gitaly is “healthy” and node-level resource metrics. The aggregation and calculation are sophisticated. In summary, those dashboards tell us if Gitaly performs well according to predefined thresholds, . We could contact Scalability:Observability Team for any questions.

Gitaly Debug Indicators

Some examples of using built-in dashboards to investigate production issues, from an Engineer’s point of view:

Gitaly’s Prometheus metrics

A panel in a dashboard is a visualization of the aggregated version of underlying metrics. We use Prometheus to collect metrics. To simplify, the Gitaly server exposes an HTTP server (code) that allows Prometheus instances to fetch metrics periodically.

In a dashboard, you can click on the top-right hamburger button and choose “Explore” to get access to the underlying metrics. Or you could use the Explore page to play with metrics.

Gitaly Debug Explore

Unfortunately, we don’t have a curated list of all Gitaly metrics as well as their definition. So, you might need to look up their definition at multiple places. Here is the list of all Gitaly-related metrics. There are some sources

Node-level or environmental metrics. Those metrics are powered by other systems that host the Gitaly process. They are not exposed by Gitaly but are very useful, for example: CPU metrics, memory metrics, or cgroup metrics.
Gitaly-specific metrics. Those metrics are accounted for directly in the code. Typically, they have gitaly_ prefixes.
Aggregated metrics, such as combining different metrics or downsizing metrics due to high cardinality issues. The list of Gitaly’s aggregated metrics is listed in this file.

Gitaly Debug Metric Lists

In the code, you’ll see something like the following. Any registered metrics are available when Prometheus scrapes from the endpoint. Tracing those instances, you could find the usage of Gitaly-specific metrics.

repoCounter := counter.NewRepositoryCounter(cfg.Storages)
prometheus.MustRegister(repoCounter)

packObjectsServedBytes = promauto.NewCounter(prometheus.CounterOpts{
  Name: "gitaly_pack_objects_served_bytes_total",
  Help: "Number of bytes of git-pack-objects data served to clients",
})

A metric has a set of labels. GitLab adds the following set of labels to all metrics:

env or environment: the environment, including but not limited to gprd, gstg, ops, to name a few.
fqdn: the fully qualified domain name. As Gitaly runs on VMs now, this label is equivalent to the identity of the hosting node.
region and zone: the region and zone of the node.
stage: the current stage of the process, either main or cny.
service/type: for Gitaly, it’s always gitaly.

In the future, when Gitaly runs on K8s, we properly have more K8s-specific labels.

The query uses PromQL language. Some examples:

Calculate the rate (ops/s) of pack-refs housekeeping task by node.
Calculate the dropped pack-objects/RPC requests due to limited in the last 2 days
Calculate inflight commands of gitaly-cny node. As you can see, there was as peak on 2024-06-17. It was when this incident occurs.

Debugging and performance testing tools

gprcurl: curl like tool but for gPRC
grpcui: lightweight postman like tool for gPRC
hyperfine: a performance tool that can benchmarks over time
- hyperfine can be used together with grpcurl to check the response time of a gPRC call

strace

strace(1) a gitaly process:

strace -fttTyyy -s 1024 -o /paht/filename -p $(pgrep -fd, gitaly)

Or wrap a process to make it easy to strace, especially if it then spawns more processes:

#!/bin/bash/sh
echo $(date)" $PPID $@" >> /tmp/gitlab-shell.txt
exec /opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell-orig "$@"
# strace -fttTyyy -s 1024 -o /tmp/sshd_trace-$PPID /opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell-orig

strace parser is useful to make the results more readable.

fast-stats

fast-stats is a useful tool developed by Support to quickly pull statistics from GitLab logs.

Examples

To find in one interval of 60m duration what the top methods called are from the gitaly logs.

fast-stats --interval 60m --limit 1 var/log/gitlab/gitaly/current

To find the top 10 User, Project, Client by Duration calling that method:

grep PostUploadPackWithSidechannel var/log/gitlab/gitaly/current | ~/bin/fast-stats --interval 60m top

git

Snoop on Gitaly Git commands.

Stop Gitaly
Rename gitaly-git process find /opt/gitlab/embedded/bin/gitaly-git-v* -exec mv {} {}_orig \;
Create wrapper script for each git version. Make sure you replace opt/gitlab/embedded/bin/gitaly-git-vX.XX_orig on the scrip with the right version.

#!/bin/bash
GIT="/opt/gitlab/embedded/bin/gitaly-git-vX.XX_orig"
FILE="/tmp/gitaly-$(date +%Y-%m-%d@%H:%M)"
echo -e "\n$(date) $PPID $@\n" >> $FILE
exec $GIT "$@" | tee -a $FILE
echo -e "\n--------------\n" >> $FILE

Make scripts executable find /opt/gitlab/embedded/bin/gitaly-git-v* -exec chmod 777 {} \;
Start Gitaly

Log analysis

Kibana (Elastic) Dashboards

gstg
gprd

CPU and memory profiling

pprof metrics are exported to the GCP Cloud Profiler.

gstg
gprd

Note that Gitaly nodes are distributed across a number of GCP projects. You can use the project dropdown on the top nav bar to switch between the various gitlab-gitaly-gstg-* and gitlab-gitaly-gprd-* projects.

Capacity management

Gitaly team is responsible for maintaining reasonable serving capacity for gitlab.com.

We get alerts from Tamland if capacity runs low, see this issue comment.

Capacity planning documentation explains how this works in general.

Last modified August 8, 2024: doc: Add debug tool to snoop on Git process (9701ec6f)

View page source - Edit this page - please contribute.