Incident Response Lifecycle
The Incident Response Lifecyle working group is intended to document a shared incident response protocol and knowledgebase.
|Date Created||February 01, 2022|
|Target End Date||April 30, 2022|
|Google Doc||Incident Response Management Working Group (internal)|
|Issue Label||WG-IRM (gitlab-com/-org)|
- Increase efficiency through common incident response, analysis, documentation, ongoing management and reporting methods.
- Increase transparency through improving visibility and communication of incidents to business and e-group
- Support results by building our clients’ confidence in Gitlab’s ability to quickly resolve and communicate incidents when they occur
- Align Incident Management activities and priorities with those of the business
- Prepare materials for the creation of training modules for the Engineer Department on the Incident Management Process at Gitlab
- Highlight dogfooding opportunities
- Single source of truth documenting incident response management that will be applicable to all areas of Engineering and teams who provide Incident Response
- Each functional area of Engineering will develop their own Incident management requirements for identifying and reacting to service outages or security threats.
- Create a comprehensive knowledge base for Gitlab team members to help them understand how incident response teams implement the IR process
- Help teams across GitLab lower MTTR
- Improvements in feedback and learnings from Incident to build resiliance
- Service Catalog
What do other companies do?
How is IR done today?
Roles and Responsibilities
|Working Group Role||Person||Title|
|Facilitator||Anna Liisa Moter||Manager Reliability|
|Exec Sponsor||Steve Loyd||VP Infrastructure|
|Member||Anthony Fappiano||Manager Reliability|
|Development Functional Lead||Dan Croft||Senior Engineering Manager, Ops|
|Member||Sam Goldstein||Director of Engineering, Ops|
|Member (CMOC)||Kenneth Chu||Support team|
|Member||Kevin Chu||Group Manager of Product, Monitor|
Requirements and Considerations
- Reliability Engineers
- SIRT Engineers
- Development Team
- Quality Team
- Support Team
- As a GitLab team member who can raise an incident, I know how incident can be initiated
- As a GitLab team member who can raise an incident, I have a general understanding about incident severity levels
- As a GitLab team member who can raise an incident, I understand the high level process of Incident Management and its importance to the business
- As a GitLab team member who can raise an incident, I can contact the right team via dedicated slack channel.
- As a GitLab team member who can raise an incident, I can easily find a page in the handbook that documents the Incident Response Procedures
- As a SIRT Engineer I know how to pull relevant resources from other teams when I need assistance
- As a SIRT Engineer I can easily categorized the incident
- As a SIRT Engineer I can identify triggers and indicators
- As a SIRT Engineer I know where to document the incident details
- As a SIRT Engineer I know when to transitions from Incident identification, to mitigation, to remediation, and post to incident activities
- As a SIRT Engineer I can follow a reporting process to handoff incidents, or provide updates to Management
- As a Reliability Engineer, I know how to level an incident in a manner that is consistent across the org
- As a Reliability Engineer, I know how to engage the other roles during an incident
- As a Reliability Engineer, I know when to transition from Incident identification, to mitigation to resolution and post-incident activities
- As a leader in Development who is part of the Incident Manager rotation, I am clear on the role’s responsibilities and how the role supports the Incident Management process.
- As a Support Engineer, I know how to create a status page
- As a Support Engineer, I know the differences between the Incident Status states on the status page
- As a Support Engineer, I know how frequently to update the status page
- As a Support Engineer, I know how to engage the Incident Manager or EOC when asking about feedback for an update I am about to post on the status page
- As a Support Engineer, I know how notify the stakeholders
- As a Support Engineer, I know how to find related tickets in Zendesk and the GitLab issue tracker to help access the impact of an incident
- As a Support Engineer, I know how to contact users if their usage of GitLab SaaS was restricted due to an incident
Last modified September 1, 2023: Fix remain markdown errors in working-groups (