Monitoring of GitLab.com

GitLab.com Service Availability

The calculation methodology for GitLab.com Service Availability definition is in the monitoring policy.

More details on definitions of outage, and degradation are on the incident-management page

Historical Service Availability

Year Month	Availability	Comments
2025 July	99.91%
2025 June	99.84%
2025 May	99.73%
2025 April	99.97%
2025 March	100.00%
2025 February	99.99%
2025 January	99.98%
2024 December	99.95%
2024 November	100.00%
2024 October	99.66%
2024 September	99.85%
2024 August	100.00%
2024 July	99.99%
2024 June	99.99%
2024 May	100.00%
2024 April	99.96%
2024 March	100%
2024 February	99.86%
2024 January	100%
2023 December	99.99%
2023 November	99.99%
2023 October	99.89	Oct 30 Sev 1
2023 September	99.98%
2023 August	100%
2023 July	99.78%	Two severity 1 incidents contributed to ~94% of service disruption. 2023-07-07, 2023-07-14
2023 June	100%
2023 May	99.92%
2023 April	99.98%
2023 March	99.99%
2023 February	99.98%
2023 January	99.80%
2022 December	100%
2022 November	99.86%
2022 October	100%
2022 September	99.98%
2022 August	99.92%
2022 July	99.95%
2022 June	99.96%
2022 May	99.99%
2022 April	99.98%
2022 March	99.91%
2022 February	99.87%
2022 January	99.95%
2021 December	99.96%
2021 November	99.71%
2021 October	99.98%
2021 September	99.85%
2021 August	99.86%
2021 July	99.78%
2021 June	99.84%
2021 May	99.85%	does not include manual adjustment for PostgreSQL 12 Upgrade
2021 April	99.98%
2021 March	99.34%
2021 February	99.87%
2021 January	99.88%
2020 December	99.96%
2020 November	99.90%
2020 October	99.74%
2020 September	99.95%
2020 August	99.87%
2020 July	99.81%
2020 June	99.56%
2020 May	99.58%

These videos provide examples of how to quickly identify failures, defects, and problems related to servers, networks, databases, security, and performance.

Monitoring Tools playlist (requires GitLab Unfiltered YouTube account access)
Visualization Tools playlist (requires GitLab Unfiltered YouTube account access)

Monitoring

Pingdom Statistics

We use our apdex based measurements to report official availability (see above). However, we also have some public pingdom tests for a representative view of overall performance of GitLab.com. These are available at https://stats.pingdom.com. Specifically, this has the availability and latency of reaching

a GitLab.com issue. For reference, it is the first gitlab-ce issue.
GitLab.com “plain and simple” called the GitLab public check.

Monitoring Infrastructure

We use Grafana Mimir to ingest and query metrics. Mimir is an open-source, distributed time series database that extends Prometheus. You can read more about its implementation in our Runbook Docs.

Monitoring Dashboards

Metrics can be viewed in Grafana. The Grafana Explore dashboard allows querying of all data in Mimir using PromQL.

Access requires a @gitlab.com email address through Google SSO
Highly Available setup
Alerting feeds from this setup
Separated from the public for compliance, security and availability reasons

Adding Dashboards

To learn how to set up a new graph or dashboard using Grafana, take a look at the following resources:

Guide to setting up Grafana dashboards by Grafana
YouTube video showing how to set up a dashboard
The Grafana repo where we keep an archive of InfluxDB dashboards created in Grafana. Use these to see details in the file structure, but note that the repo is truly an archive (nothing populates from it) and can be out of date.

Need access to add a dashboard? Ask any team lead within the infrastructure team.

Dashboards for stage groups

We have a set of monitoring dashboards designed for each stage group. These dashboards are designed to give an insight, to everyone working in a feature category, into how their code operates at GitLab.com scale. They are grouped per stage group to show the impact of feature/code changes, deployments, and feature-flag toggles.

The dashboards for stage groups are at a very early stage. All contributions are welcome. If you have any questions or suggestions, please submit an issue in the Scalability Team issues tracker.

Selection of Useful Dashboards from the Monitoring

Blackbox Monitoring

GitLab Web Status: front end perspective of GitLab. Useful to understand how GitLab.com looks from the user perspective. Use this graph to quickly troubleshoot what part of GitLab is slow.
GitLab Git Status: front end perspective of GitLab ssh access.

Private Whitebox Monitor

Host Stats: useful to dive deep into a specific host to understand what is going on with it. Select a host from the dropdown on the top.
Business Stats: shows many pushes, new repos and CI builds.
Daily overview: shows endpoints with amount of calls and performance metrics. Useful to understand what is slow generally.

Logs

Network, System, and Application logs are processed, stored, and searched using the ELK stack. We use a managed Elasticsearch cluster on GCP and as such our only interface to this is through APIs, Kibana and the elastic.co web UI. For monitoring system performance and metrics, Elastic’s x-pack monitoring metrics are used. They are sent to a dedicated monitoring cluster. Long-term we intend to switch to Prometheus and Grafana as the preferred interface. As it is managed by Elastic they run the VMs and we do not have access to them. However, for investigating errors and incidents, raw logs are available via Kibana. Staging logs are available via a separate Kibana instance.

Kibana dashboards are used to monitor application activity, spam events, transient errors, system and network authentication events, security events, etc. Commonly used dashboards are the Abuse, SSH, and Rack Attack dashboards.

One can view how we log our infrastructure as outlined by our runbook.

The policy related to log management can be found in [the monitoring policy].

Adding dashboards

To learn how to create Kibana dashboards use the following resources:

GitLab Profiling

Go services

Stackdriver Continuous Go Profiling can be used to have a better understanding of how our Go services perform and consume resources. (requires membership of the Google Workspace stackdriver-profiler-sg group)

It provides a simple UI on GCP with CPU and Memory usage data for:

For more information, there’s a quick video tutorial available.

We also did a series of deep dives by pairing with the development teams for each project in this issue, this resulted in the following videos:

Instrumenting Ruby to Monitor Performance

Blocks of Ruby code can be “instrumented” to measure performance.

Documentation of instrumentation with more detail on how to implement this
An example of how this is used for GitLab itself, can be found in this initializer.

Other Tools

Sentry

Error tracking service.

Setting sentry alerts for your group

Creating alert rules allows groups to monitor their features and help catch issues proactively. This helps in getting the issues fixed before they breach the error budget SLO which in turn helps in keeping the GitLab.com Service Availability high.

Steps for creating the alerts:

Visit Sentry’s alert rules dashboard.
Click on “Create Alert” button at the top right.
Set the required conditions as per your group’s feature categories.
Create a new public slack channel with the following naming convention “g_group_name_alerts”. Eg: #g_govern_compliance_alerts
Select this channel for sending the alert notifications.
Monitor the group for any new alerts and work towards resolving those.

Sitespeed.io

Tool that helps you monitor, analyze and optimize your website speed and performance.

Staging Monitoring

How Staging is monitored and how traffic is generated

Last modified August 2, 2025: Availability Update for GitLab.com (5f695d78)

View page source - Edit this page - please contribute.