Debugging the Gitaly service

About this document

This document is intended for Gitaly engineers, to become familiar with GitLab’s production layout and gain the ability to effectively debug production problems. While the focus is on SaaS, many of the skills transfer also to debugging self-managed instances.

Generic GitLab background

Skim / read the following, focusing on an overview then on Gitaly:

Other useful links:

Gitaly specific background

Gitaly in Production

Both gitlab.com and Dedicated use Gitaly in “sharded” mode, that is, without Praefect (Gitaly Cluster).

Monitoring dashboards

We have some useful pre-built monitoring dashboards on GitLab’s internal Grafana instance. All dashboards are listed in this folder. Please note that some of them are fairly outdated.

The following dashboards are most common:

A Gitaly dashboard could be either auto-generated or manually drafted. We use Jsonnet (a superset of JSON) to achieve dashboards-as-code. The definitions of such dashboards are located in this folder. Recently, that’s the recommended way to manage an observability dashboard. It allows us to use GitLab’s built-in libraries, resulting in a highly standardized dashboard.

A standardized dashboard should have a top-level section containing environment filters, node filters, and useful annotations such as feature flag activities, deployments, etc. Some dashboards have an interlinked system that connects Grafana and Kibana with a single click.

Such dashboards usually include two parts. The second half contains panels of custom metrics collected from Gitaly. The first half is more complicated. It contains GitLab-wide indicators telling if Gitaly is “healthy” and node-level resource metrics. The aggregation and calculation are sophisticated. In summary, those dashboards tell us if Gitaly performs well according to predefined thresholds, . We could contact Scalability:Observability Team for any questions.

Gitaly Debug Indicators

Some examples of using built-in dashboards to investigate production issues, from an Engineer’s point of view:

Gitaly’s Prometheus metrics

A panel in a dashboard is a visualization of the aggregated version of underlying metrics. We use Prometheus to collect metrics. To simplify, the Gitaly server exposes an HTTP server (code) that allows Prometheus instances to fetch metrics periodically.

In a dashboard, you can click on the top-right hamburger button and choose “Explore” to get access to the underlying metrics. Or you could use the Explore page to play with metrics.

Gitaly Debug Explore

Unfortunately, we don’t have a curated list of all Gitaly metrics as well as their definition. So, you might need to look up their definition at multiple places. Here is the list of all Gitaly-related metrics. There are some sources

  • Node-level or environmental metrics. Those metrics are powered by other systems that host the Gitaly process. They are not exposed by Gitaly but are very useful, for example: CPU metrics, memory metrics, or cgroup metrics.
  • Gitaly-specific metrics. Those metrics are accounted for directly in the code. Typically, they have gitaly_ prefixes.
  • Aggregated metrics, such as combining different metrics or downsizing metrics due to high cardinality issues. The list of Gitaly’s aggregated metrics is listed in this file.

Gitaly Debug Metric Lists

In the code, you’ll see something like the following. Any registered metrics are available when Prometheus scrapes from the endpoint. Tracing those instances, you could find the usage of Gitaly-specific metrics.

repoCounter := counter.NewRepositoryCounter(cfg.Storages)
prometheus.MustRegister(repoCounter)

packObjectsServedBytes = promauto.NewCounter(prometheus.CounterOpts{
  Name: "gitaly_pack_objects_served_bytes_total",
  Help: "Number of bytes of git-pack-objects data served to clients",
})

A metric has a set of labels. GitLab adds the following set of labels to all metrics:

  • env or environment: the environment, including but not limited to gprd, gstg, ops, to name a few.
  • fqdn: the fully qualified domain name. As Gitaly runs on VMs now, this label is equivalent to the identity of the hosting node.
  • region and zone: the region and zone of the node.
  • stage: the current stage of the process, either main or cny.
  • service/type: for Gitaly, it’s always gitaly.

In the future, when Gitaly runs on K8s, we properly have more K8s-specific labels.

The query uses PromQL language. Some examples:

Debugging and performance testing tools

  • gprcurl: curl like tool but for gPRC
  • grpcui: lightweight postman like tool for gPRC
  • hyperfine: a performance tool that can benchmarks over time
    • hyperfine can be used together with grpcurl to check the response time of a gPRC call

strace

strace(1) a gitaly process:

strace -fttTyyy -s 1024 -o /paht/filename -p $(pgrep -fd, gitaly)

Or wrap a process to make it easy to strace, especially if it then spawns more processes:

#!/bin/bash/sh
echo $(date)" $PPID $@" >> /tmp/gitlab-shell.txt
exec /opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell-orig "$@"
# strace -fttTyyy -s 1024 -o /tmp/sshd_trace-$PPID /opt/gitlab/embedded/service/gitlab-shell/bin/gitlab-shell-orig

strace parser is useful to make the results more readable.

fast-stats

fast-stats is a useful tool developed by Support to quickly pull statistics from GitLab logs.

Examples

To find in one interval of 60m duration what the top methods called are from the gitaly logs.

fast-stats --interval 60m --limit 1 var/log/gitlab/gitaly/current

To find the top 10 User, Project, Client by Duration calling that method:

grep PostUploadPackWithSidechannel var/log/gitlab/gitaly/current | ~/bin/fast-stats --interval 60m top

git

Snoop on Gitaly Git commands.

  1. Stop Gitaly
  2. Rename gitaly-git process find /opt/gitlab/embedded/bin/gitaly-git-v* -exec mv {} {}_orig \;
  3. Create wrapper script for each git version. Make sure you replace opt/gitlab/embedded/bin/gitaly-git-vX.XX_orig on the scrip with the right version.
#!/bin/bash
GIT="/opt/gitlab/embedded/bin/gitaly-git-vX.XX_orig"
FILE="/tmp/gitaly-$(date +%Y-%m-%d@%H:%M)"
echo -e "\n$(date) $PPID $@\n" >> $FILE
exec $GIT "$@" | tee -a $FILE
echo -e "\n--------------\n" >> $FILE
  1. Make scripts executable find /opt/gitlab/embedded/bin/gitaly-git-v* -exec chmod 777 {} \;
  2. Start Gitaly

Log analysis

Kibana (Elastic) Dashboards

CPU and memory profiling

pprof metrics are exported to the GCP Cloud Profiler.

Note that Gitaly nodes are distributed across a number of GCP projects. You can use the project dropdown on the top nav bar to switch between the various gitlab-gitaly-gstg-* and gitlab-gitaly-gprd-* projects.

Capacity management

Gitaly team is responsible for maintaining reasonable serving capacity for gitlab.com.

We get alerts from Tamland if capacity runs low, see this issue comment.

Capacity planning documentation explains how this works in general.