Incident Roles - Incident Responder
- As an Incident Responder, your highest priority for the duration of your shift is the stability of GitLab.com.
- When there is uncertainty of the cause of a degradation or outage, the first action of the Incident Responder is to evaluate whether any changes can be reverted. It is always appropriate to toggle (to previous state) any recently changed application feature flags without asking for permission and without hesitation. The next step is to review Change Requests and validate the eligibility criteria for application rollbacks.
- The SSOT for who is the current EOC is the GitLab Production service definition in PagerDuty.
- SREs are responsible for arranging coverage if they will be unavailable for a scheduled shift. To make a request, send a message indicating the days and times for which coverage is requested to the
#eoc-generalSlack channel. If you are unable to find coverage reach out to the EOC coordinator for assistance.
- SREs are responsible for arranging coverage if they will be unavailable for a scheduled shift. To make a request, send a message indicating the days and times for which coverage is requested to the
- Alerts that are routed to PagerDuty require acknowledgment within 15 minutes, otherwise they will be escalated to the oncall Incident Manager.
- Alerts that page PagerDuty will automatically create a triage incident in
#incidents-dotcom-triage.- If it is determined to be a true incident, the triage incident should be accepted by joining the channel and choosing “Accept it”.
- If there are multiple pages/triage incidents created for the same incident, merge them into the primary incident. However, resolving the incident takes precedence. It is fine if a related triage incident auto-closes instead of getting merged while you’re working an incident.
- The triage incident will automatically declined if no action is taken and the generating alert clears.
- Alert-manager alerts in
#alertsand#feed_alerts-generalare an important source of information about the health of the environment and should be monitored during working hours. - If the PagerDuty alert noise is too high, your task as an EOC is clearing out that noise by either fixing the system or changing the alert.
- If you are changing the alert, it is your responsibility to explain the reasons behind it and inform the next EOC that the change occurred.
- Each event (may be multiple related pages) should result in an issue in the
productiontracker. See production queue usage for more details.
- Alerts that page PagerDuty will automatically create a triage incident in
- If sources outside of our alerting are reporting a problem, and you have not received any alerts, it is still your responsibility to investigate. Declare a low severity incident and investigate from there.
- Low severity (S3/S4) incidents (and issues) are cheap, and will allow others a means to communicate their experience if they are also experiencing the issue.
- “No alerts” is not the same as “no problem”
- GitLab.com is a complex system. It is ok to not fully understand the underlying issue or its causes. However, if this is the case, as Incident Responder you should page the IMOC to find a team member with the appropriate expertise. Requesting assistance does not mean relinquishing your responsibility.
- As soon as an S1/S2 incident is declared, join the Zoom room for the incident. The Zoom link is in the bookmarks of the relevant incident channel.
- GitLab works in an asynchronous manner, but incidents require a synchronous response. Our collective goal is high availability of 99.95% and beyond, which means that the timescales over which communication needs to occur during an incident is measured in seconds and minutes, not hours.
- Keep in mind that a GitLab.com incident is not an “infrastructure problem”. It is a company-wide issue, and as Incident Responder, you are leading the response on behalf of the company.
- If you need information or assistance, engage with Engineering teams. If you do not get the response you require within a reasonable period, escalate through the IMOC.
- As Incident Responder, require that those who may be able to assist to join the Zoom call and ensure that they post their findings in Slack and pin (📌) the message to the incident timeline.
- By acknowledging an incident in PagerDuty, you are implying that you are working on it. To further reinforce this acknowledgement, post a note in Slack that you are joining the incident Zoom as soon as possible.
- Be inquisitive. Be vigilant. If you notice that something doesn’t seem right, investigate further.
Last modified October 20, 2025: Updates roles to match incident.io, reorganizes content (
86a8486a)
