Incident Follow Up Issues
This page documents the process for managing follow-up items from incidents.
Terminology
- Follow Up Item: Action items, bugs, or improvements identified during an incident that need to be addressed after the incident is resolved.
- Incident Lead: The person responsible for coordinating the incident response and ensuring proper follow-up after resolution.
- Corrective Action: A follow-up item aimed at preventing similar incidents in the future (synonymous with InfraDev for labeling purposes).
- InfraDev: Infrastructure development issues that affect the production platform.
Default Issue Locations
We use different default projects for incident follow-up issues based on privacy requirements and team structure:
GitLab.com Team
- Default Location:
gitlab-com/gl-infra/incident-follow-ups
- This project serves as the initial triage point for all follow-up items from GitLab.com incidents
Dedicated Team
- Default Location:
gitlab-com/gl-infra/dedicated-follow-ups
- This private project preserves confidentiality for customer-specific incidents
Incident Lead Responsibilities
Incident Leads have the following responsibilities for managing follow-up items:
-
During the Incident:
- Ensure all follow-up items are properly documented
- Assign initial owners for each follow-up item where possible
-
After the Incident:
- Review all automatically created follow-up issues within 1 business day
- Verify that each issue has appropriate context from the incident
- Ensure each issue is moved to the correct project for the team that will address it
- Apply appropriate labels including team and group labels
- Assign the issue to the correct team member or manager for triage
Issue Triage Process
For GitLab.com Incidents
-
Follow-up items are automatically created in
gitlab-com/gl-infra/incident-follow-ups
-
To attach an existing GitLab issue to an incident as a follow up instead of creating a new one, paste the issue’s URL into the slack channel
-
The Incident Lead reviews each issue and:
- For infrastructure issues: Apply appropriate infrastructure team labels
- For application issues: Move to
gitlab-org/gitlab
using the/move
command - For service-specific issues: Move to the appropriate service project (e.g.,
gitlab-org/gitaly
for Gitaly issues)
-
Apply appropriate group/team labels:
- Engineering team issues should include the group label (e.g.
group::database
) - Infrastructure team issues should include the team label (e.g.
team::database
)
- Engineering team issues should include the group label (e.g.
-
When moving an issue, ensure the
infradev
label is maintained on the issue
For Dedicated Incidents
-
Follow-up items are automatically created in
gitlab-com/gl-infra/dedicated-follow-ups
-
To attach an existing GitLab issue to an incident as a follow up instead of creating a new one, paste the issue’s URL into the slack channel
-
The Incident Lead reviews each issue and:
- Applies appropriate team labels
- Ensures the issue has a clear owner
- Maintains confidentiality by keeping customer-specific information in the private repository
-
Only move issues if the Dedicated team requires a different location (for example, if it’s an issue that needs to be addressed by a team outside of the Dedicated organization)
Tracking and Visibility
To ensure no follow-up items fall through the cracks:
- Weekly Triage: The Production Engineering team performs a weekly triage of all open issues in the incident follow-up repositories to identify any untriaged items
- SaaS Availability Reviews: The status of critical incident follow-up items is reviewed in SaaS Availability meetings
- Engineering Teams: Engineering directors and managers are expected to regularly triage the
infradev
issues assigned to their teams
Labels
The following labels are automatically applied to all incident follow-up issues in both repositories:
infradev
: Applied to all follow-up issuescorrective action
: Applied to all follow-up issues
Additional labels that should be manually applied:
- Team labels (e.g.,
team::database
,team::observability
) - Group labels (e.g.,
group::database
,group::analytics
)
Template and Issue Format
Follow-up issues contain:
- A link to the original incident
- A link to the Post Incident Review (PIR) if one has been created
- Context from the incident
- A description of the required action
- Due date (if specified)
Incident Leads should ensure each issue contains:
- Clear acceptance criteria
- Sufficient context for someone unfamiliar with the incident to understand the issue
- Links to relevant logs, metrics, or other evidence from the incident
References
- Engineering Workflow - InfraDev
- Product Processes - InfraDev
- Incident Management - Corrective Actions
df1d366c
)