Incident Response Lifecycle

The Incident Response Lifecyle working group is intended to document a shared incident response protocol and knowledgebase.

Attributes

Property	Value
Date Created	February 01, 2022
Target End Date	April 30, 2022
Slack	wg-incident-respose-management-framework
Google Doc	Incident Response Management Working Group (internal)
Issue Label	WG-IRM (gitlab-com/-org)

Business Goal

Increase efficiency through common incident response, analysis, documentation, ongoing management and reporting methods.
Increase transparency through improving visibility and communication of incidents to business and e-group
Support results by building our clients’ confidence in GitLab’s ability to quickly resolve and communicate incidents when they occur
Align Incident Management activities and priorities with those of the business
Prepare materials for the creation of training modules for the Engineer Department on the Incident Management Process at GitLab
Highlight dogfooding opportunities

Exit Criteria

Single source of truth documenting incident response management that will be applicable to all areas of Engineering and teams who provide Incident Response
- Each functional area of Engineering will develop their own Incident management requirements for identifying and reacting to service outages or security threats.
Create a comprehensive knowledge base for GitLab team members to help them understand how incident response teams implement the IR process

Outcome

Help teams across GitLab lower MTTR

Other Investigations

Improvements in feedback and learnings from Incident to build resiliance
Service Catalog

What do other companies do?

Pagerduty Response docs

How is IR done today?

SIRT
On-call
Reliability
- Incident Management
Support
- How to Perform CMOC Duties
- Contacting Customers
- Sending Notices (small number of users)

Noted issues

Roles and Responsibilities

Working Group Role	Person	Title
Facilitator	Anna Liisa Moter	Manager Reliability
Exec Sponsor	Steve Loyd	VP Infrastructure
Member	Anthony Fappiano	Manager Reliability
Development Functional Lead	Dan Croft	Senior Engineering Manager, Ops
Member	Sam Goldstein	Director of Engineering, Ops
Member (CMOC)	Kenneth Chu	Support team
Member	Kevin Chu	Group Manager of Product, Monitor

Requirements and Considerations

Actors

Reliability Engineers
SIRT Engineers
Development Team
Quality Team
Support Team

General

As a GitLab team member who can raise an incident, I know how incident can be initiated
As a GitLab team member who can raise an incident, I have a general understanding about incident severity levels
As a GitLab team member who can raise an incident, I understand the high level process of Incident Management and its importance to the business
As a GitLab team member who can raise an incident, I can contact the right team via dedicated slack channel.
As a GitLab team member who can raise an incident, I can easily find a page in the handbook that documents the Incident Response Procedures

SIRT Engineer

As a SIRT Engineer I know how to pull relevant resources from other teams when I need assistance
As a SIRT Engineer I can easily categorized the incident
As a SIRT Engineer I can identify triggers and indicators
As a SIRT Engineer I know where to document the incident details
As a SIRT Engineer I know when to transitions from Incident identification, to mitigation, to remediation, and post to incident activities
As a SIRT Engineer I can follow a reporting process to handoff incidents, or provide updates to Management

Reliability Engineers

As a Reliability Engineer, I know how to level an incident in a manner that is consistent across the org
As a Reliability Engineer, I know how to engage the other roles during an incident
As a Reliability Engineer, I know when to transition from Incident identification, to mitigation to resolution and post-incident activities

Development Team

As a leader in Development who is part of the Incident Manager rotation, I am clear on the role’s responsibilities and how the role supports the Incident Management process.

Quality Teams

Support Team

As a Support Engineer, I know how to create a status page
As a Support Engineer, I know the differences between the Incident Status states on the status page
As a Support Engineer, I know how frequently to update the status page
As a Support Engineer, I know how to engage the Incident Manager or EOC when asking about feedback for an update I am about to post on the status page
As a Support Engineer, I know how notify the stakeholders
As a Support Engineer, I know how to find related tickets in Zendesk and the GitLab issue tracker to help access the impact of an incident
As a Support Engineer, I know how to contact users if their usage of GitLab SaaS was restricted due to an incident

Last modified June 3, 2025: Fix broken links (d7547623)

View page source - Edit this page - please contribute.

Incident Response Lifecycle

Attributes

Business Goal

Exit Criteria

Outcome

Other Investigations

What do other companies do?

How is IR done today?

Noted issues

Related Issues

Roles and Responsibilities

Requirements and Considerations

Actors

General

SIRT Engineer

Reliability Engineers

Development Team

Quality Teams

Support Team