Monitoring of GitLab.com

GitLab.com Service Availability

The calculation methodology for GitLab.com Service Availability definition is in the monitoring policy.

More details on definitions of outage, and degradation are on the incident-management page

Historical Service Availability

Year Month Availability Comments
2024 September 99.85%
2024 August 100.00%
2024 July 99.99%
2024 June 99.99%
2024 May 100.00%
2024 April 99.96%
2024 March 100%
2024 February 99.86%
2024 January 100%
2023 December 99.99%
2023 November 99.99%
2023 October 99.89 Oct 30 Sev 1
2023 September 99.98%
2023 August 100%
2023 July 99.78% Two severity 1 incidents contributed to ~94% of service disruption. 2023-07-07, 2023-07-14
2023 June 100%
2023 May 99.92%
2023 April 99.98%
2023 March 99.99%
2023 February 99.98%
2023 January 99.80%
2022 December 100%
2022 November 99.86%
2022 October 100%
2022 September 99.98%
2022 August 99.92%
2022 July 99.95%
2022 June 99.96%
2022 May 99.99%
2022 April 99.98%
2022 March 99.91%
2022 February 99.87%
2022 January 99.95%
2021 December 99.96%
2021 November 99.71%
2021 October 99.98%
2021 September 99.85%
2021 August 99.86%
2021 July 99.78%
2021 June 99.84%
2021 May 99.85% does not include manual adjustment for PostgreSQL 12 Upgrade
2021 April 99.98%
2021 March 99.34%
2021 February 99.87%
2021 January 99.88%
2020 December 99.96%
2020 November 99.90%
2020 October 99.74%
2020 September 99.95%
2020 August 99.87%
2020 July 99.81%
2020 June 99.56%
2020 May 99.58%

These videos provide examples of how to quickly identify failures, defects, and problems related to servers, networks, databases, security, and performance.

Monitoring

Pingdom Statistics

We use our apdex based measurements to report official availability (see above). However, we also have some public pingdom tests for a representative view of overall performance of GitLab.com. These are available at https://stats.pingdom.com. Specifically, this has the availability and latency of reaching

Monitoring Infrastructure

We use Grafana Mimir to ingest and query metrics. Mimir is an open-source, distributed time series database that extends Prometheus. You can read more about its implementation in our Runbook Docs.

Monitoring Dashboards

Metrics can be viewed in Grafana. The Grafana Explore dashboard allows querying of all data in Mimir using PromQL.

  • Access requires a @gitlab.com email address through Google SSO
  • Highly Available setup
  • Alerting feeds from this setup
  • Separated from the public for compliance, security and availability reasons

Adding Dashboards

To learn how to set up a new graph or dashboard using Grafana, take a look at the following resources:

Need access to add a dashboard? Ask any team lead within the infrastructure team.

Dashboards for stage groups

We have a set of monitoring dashboards designed for each stage group. These dashboards are designed to give an insight, to everyone working in a feature category, into how their code operates at GitLab.com scale. They are grouped per stage group to show the impact of feature/code changes, deployments, and feature-flag toggles.

  1. List of dashboards for each stage group (GitLab team members only).
  2. Guide to getting started with dashboards for stage groups
  3. YouTube video introducing the stage group dashboards

The dashboards for stage groups are at a very early stage. All contributions are welcome. If you have any questions or suggestions, please submit an issue in the Scalability Team issues tracker.

Selection of Useful Dashboards from the Monitoring

Blackbox Monitoring

  • GitLab Web Status: front end perspective of GitLab. Useful to understand how GitLab.com looks from the user perspective. Use this graph to quickly troubleshoot what part of GitLab is slow.
  • GitLab Git Status: front end perspective of GitLab ssh access.

Private Whitebox Monitor

  • Host Stats: useful to dive deep into a specific host to understand what is going on with it. Select a host from the dropdown on the top.
  • Business Stats: shows many pushes, new repos and CI builds.
  • Daily overview: shows endpoints with amount of calls and performance metrics. Useful to understand what is slow generally.

Logs

Network, System, and Application logs are processed, stored, and searched using the ELK stack. We use a managed Elasticsearch cluster on GCP and as such our only interface to this is through APIs, Kibana and the elastic.co web UI. For monitoring system performance and metrics, Elastic’s x-pack monitoring metrics are used. They are sent to a dedicated monitoring cluster. Long-term we intend to switch to Prometheus and Grafana as the preferred interface. As it is managed by Elastic they run the VMs and we do not have access to them. However, for investigating errors and incidents, raw logs are available via Kibana. Staging logs are available via a separate Kibana instance.

Kibana dashboards are used to monitor application activity, spam events, transient errors, system and network authentication events, security events, etc. Commonly used dashboards are the Abuse, SSH, and Rack Attack dashboards.

One can view how we log our infrastructure as outlined by our runbook.

The policy related to log management can be found in [the monitoring policy].

Adding dashboards

To learn how to create Kibana dashboards use the following resources:

GitLab Profiling

Go services

Stackdriver Continuous Go Profiling can be used to have a better understanding of how our Go services perform and consume resources. (requires membership of the Google Workspace stackdriver-profiler-sg group)

It provides a simple UI on GCP with CPU and Memory usage data for:

For more information, there’s a quick video tutorial available.

We also did a series of deep dives by pairing with the development teams for each project in this issue, this resulted in the following videos:

Instrumenting Ruby to Monitor Performance

Blocks of Ruby code can be “instrumented” to measure performance.

Other Tools

Sentry

Error tracking service.

Setting sentry alerts for your group

Creating alert rules allows groups to monitor their features and help catch issues proactively. This helps in getting the issues fixed before they breach the error budget SLO which in turn helps in keeping the GitLab.com Service Availability high.

Steps for creating the alerts:

  1. Visit Sentry’s alert rules dashboard.
  2. Click on “Create Alert” button at the top right.
  3. Set the required conditions as per your group’s feature categories.
  4. Create a new public slack channel with the following naming convention “g_group_name_alerts”. Eg: #g_govern_compliance_alerts
  5. Select this channel for sending the alert notifications.
  6. Monitor the group for any new alerts and work towards resolving those.

Sitespeed.io

Tool that helps you monitor, analyze and optimize your website speed and performance.


Staging Monitoring
How Staging is monitored and how traffic is generated
Last modified November 14, 2024: Fix broken external links (ac0e3d5e)