Diagnose Errors on GitLab.com

This guide provides resources for diagnosing HTTP 5XX errors on GitLab.com.

Overview

This guide provides resources for the diagnosing of 5XX errors on GitLab.com. This is used when a user contacts support stating they’re receiving either a 500 or 503 error on GitLab.com.

Reports of Slowness

If reports of slowness are received on GitLab.com, first take a look at the GitLab Grafana Monitor, especially:

  • Worker CPU -> Git CPU Percent

  • Worker Load -> Git Worker Load

Degraded Runner Performance

If a customer reports a shared runner running slower than it normally does, it is likely there is a degraded performance happening during the period the customer experienced slowness on the pipeline.

Check the CI Runners Overview graphs where you will find an increase in queue apdex and latency.

Check on the #feed_alerts, #production, and #incident-management Slack channels to ensure this isn’t an outage or infrastructure issue.

If you notice slowness yourself on GitLab.com

Before you post to #production or make an issue, here are some helpful ways to capture data that help to narrow down the issue(s):

  1. You can add performance_bar=flamegraph query parameter to generate a CPU flamegraph.
  2. Use the performance bar by typing pb in your browser window. Reload the page and grab the information from the server side.
  3. If using Chrome, open the Chrome developer tools (View > Developer > Developer Tools), reload the page, and look at the Network tab. This will show all of the requests and times.
  4. If using Firefox, there is a similar network view under Tools > Web Developer > Network which will show requests and timing.

Screenshots from any of these tools will greatly help any engineers looking into the problems.

Connection Troubleshooting

If our customer is reporting problems connecting to GitLab.com, we should ask for the following:

traceroute gitlab.com
curl https://gitlab.com/cdn-cgi/trace
curl https://gitlab.com/cdn-cgi/trace
curl -svo /dev/null https://gitlab.com

Reports of File Corruption

A 503 error on a merge request page may also happen if the repository is corrupted. To confirm, a push to a corrupted repository may show the following:

data/repositories/@hashed/ee/98/ee98b34f343b4e48106fff666d12b61f23f.git/objects/f7/e7f4782) is corrupt

If the customer is reporting a similar error above, take the following steps to verify if their file server was affected:

  1. Obtain the project URL of the affected repository.
  2. Open the project admin page using the URL https://gitlab.com/admin/projects/user-namespace.
  3. Locate the server of the repository by looking at gitaly-storage-name.
  4. Search the GitLab Infrastructure issue tracker for any related issue.
  5. If an issue is found related to the file server, post the ticket number in the issue so an infrastructure engineer can look into it.

Workflows

The following workflows will guide you on how to search Kibana and/or Sentry for the event in our logs that caused a particular 5XX error.

Searching Kibana

See the 500-specific section in the Kibana workflow.

Searching Sentry

See the Sentry workflow.

A video walkthrough of investigating 500 errors using Kibana and Sentry can be seen here (GitLab Unfiltered).

Get the results into an issue

Once results have been found in either Kibana or Sentry, do the following.

  1. Gather as much information as possible. Make an internal note on the ticket including links to the logs found in either Kibana or Sentry.
  2. Search the GitLab issue tracker for any duplicate or related issue.
  3. Confirm if the issue is known or unknown and proceed accordingly: Issue is known or Issue is unknown.

In a Priority 1/Severity 1 situation, consider a dev escalation.

Responding to the user

Issue is known

If the issue is known it should have a corresponding issue in the GitLab issue tracker. If you found an entry in Sentry that has been converted into an issue, you should see the issue number in the header within Sentry:

Sentry linked issue

Click the issue number to be taken directly to the issue where you can leave a comment to provide a link to the Zendesk ticket.

Then, respond to the user with information about the cause of the issue, provide a link to it, and invite them to subscribe to it for updates.

Issue is unknown
Issues found in Sentry
  1. Convert the issue to a GitLab issue by using the “Create GitLab Issue” button on the issue page. Note: There is a known issue with the GitLab integration in Sentry that prevents this from working, due to the number of projects under the gitlab-org group. As a result you will most likely need to create the GitLab issue manually.
  2. Comment on the issue providing a link to the Zendesk ticket.
  3. Add any additional labels if needed such as customer, priority and severity, and the appropriate DevOps stage.
  4. Respond to the user with information about the cause of the issue, provide a link to it, and invite them to subscribe to it for updates.
Issues found in Kibana
  1. Get a “short url” to the Kibana logs.
  2. Create a new GitLab issue and be sure to include a link to the Zendesk ticket along with the Kibana logs.
  3. Add the bug label and any others if needed such as customer, priority and severity, and the appropriate DevOps stage.
  4. Respond to the user with information about the cause of the issue, provide a link to it, and invite them to subscribe to it for updates.

Note: If a 5xx error is found in Kibana then there is a high chance that there is also a Sentry issue for it. In those cases, add the json.correlation_id filter and search for the value in Sentry with correlation_id:

Last modified November 14, 2024: Fix broken external links (ac0e3d5e)