Observability alert triage
When the GitLab Observability alert system fires, an issue is automatically created in the contributors-gitlab-com tracker and the team is pinged through the service desk. Follow these steps to follow up on those alerts.
1. Open the alert issue
The issue title looks like:
[FIRING:1] <rule-id> (Error log entry ... error error)
Open the issue. The description contains:
ruleSource: a direct link to the alert rule in the observability UI.related_logs: a pre-filtered log explorer link scoped to the time the alert fired.description: the threshold that was crossed (for example, “observed value: 1, threshold: 0”).
It is known that both the format and available info is not ideal to quickly see the reason why there is an alert. This is a known issue tracked in the
gitlab_o11yproject.
2. Open the logs
Use the related_logs link from the issue description. It opens the
group logs explorer
with the correct filters already applied.
3. Read the error
Expand the log entries. Identify:
- The error message and stack trace.
- Whether the error is isolated (one or two occurrences) or sustained.
- Whether it points to a known cause (for example, a transient DB connection drop, an expired token, a downstream API failure).
4. Document your finding in the issue
Add a comment to the alert issue. Keep it short:
- What the error was.
- Whether it appears transient or recurring.
- Any relevant log excerpt.
Example from contributors-gitlab-com#552:
cause was what looks like a temporary db connection issue
PG::ConnectionBad: connection to server at "127.0.0.1", port 5432 failed: FATAL: Cloud SQL IAM service account authentication faileddidn’t recur, so closing
Alert issues are confidential (created through the service desk)
Keep all sensitive details, including raw log output, stack traces, and internal infrastructure data, inside this confidential issue. Do not copy them verbatim into public issues or MRs. See Act on the finding for sanitization rules.
5. Act on the finding
Choose one of the following paths based on what you found.
Transient, no action needed
The error did not recur and has no impact. Close the alert issue with a short comment explaining the cause.
Needs a fix, low urgency
Create a public issue to track the fix. Include only a sanitized description:
- Describe the class of error (for example, “IAM authentication failure”) without raw log output, user identifiers, or stack traces that could leak internal infrastructure details.
- Link the public issue back to the confidential alert issue for traceability.
- Apply the standard labels:
~"Contributor Success"and the appropriate~type::and~workflow::labels. - Link the public issue to the observability umbrella work item #308 if it is related to a recurring pattern.
- Close the alert issue, referencing the new public issue.
Needs a fix, high urgency
Open an MR directly. Apply the same sanitization rules to the MR description: no raw log output or sensitive data. Link the MR back to the confidential alert issue in a comment on the alert issue, not in the MR description itself.
Unclear or needs a second opinion
Leave the alert issue open, add your findings as a comment, and ping someone from the team.
Data sanitization rules
Alert issues are confidential. Any downstream artifact (public issue, MR, work item comment) must not contain:
- Raw log output with stack traces or internal hostnames.
- User identifiers, email addresses, or account IDs from log entries.
- Internal service account names or IAM role names.
- Connection strings or environment-specific configuration values.
Describe the problem in terms of behavior and impact, not raw infrastructure detail.
6b71c5f4)
