Tier 2 On-Call
The Tier-2 SME On-Call program enhances incident response by establishing a second tier of specialized support. Subject Matter Experts (SMEs) provide domain-specific knowledge to help resolve complex incidents faster, improve MTTR (Mean Time To Recover), and increase ownership and accountability for service reliability.
Pilot Program
Tier 2 was introduced at GitLab in 2025. The target for the program is to provide 24/7/365 coverage but in practise many teams are not set up for this coverage.
The Pilot Program aims to onboard subject matter experts to cover ordinary working hours. Also referred to as 27x5 coverage. 90% of S1 and S2 incidents take place during ordinary working hours, so the Pilot was viewed as an acceptable first iteration towards full coverage.
The Pilot Program was cleared with Legal, HR, and the German, Dutch and French Works Councils.
Before promoting a rotation on the Pilot Program to 24x7 coverage, sign-off is required. If you are considering promoting your rotation, please ask in the Slack channel named #tier2-sme-rollout for more details.
Onboarding
To initiate onboarding of a new Tier 2 team, follow the guidelines in Tier 2 Oncall Onboarding
Expectations for rotation owners
- Maintain a 24/7/365 rotation with reliable coverage across time zones (24x5 for pilot rotations)
- Pay special attention to low coverage periods like end-of-year holidays
- Maintain an escalation path to handle unacknowledged pages
- Help define and maintain:
- Escalation rules for your domain
- Onboarding material to help team members join this rotation
- Documentation and runbooks
Expectations for rotation members
- Complete all mandatory onboarding training materials
- Assist with complex, domain-specific incidents that Tier-1 cannot resolve independently
- Acknowledge pages within 15 minutes
- If you feel that you are allocated too many shifts, it is your responsibility to raise this with the rotation owner
- If you will be out of office for any reason (leave, holidays, etc.), you must arrange coverage for your on-call shifts.
- If you will be on extended leave, you need to contact your rotation owner to adjust the on-call schedule
Practical aspects of being on-call
- You don’t need to install anything specific on your phone. The paging system can be set to notify you by email, phonecall, sms or in-app notification.
- You can take breaks during your shift. You need to ensure that you are able to acknowledge a page within 15 minutes and be back at your desk shortly thereafter.
- If you are assigned a shift an you can no longer take it, it is your responsibility to find cover. Use your team’s slack channel and notify your rotation leader that you are looking for cover.
- If a personal emergency happens during your shift, in your message to your manager that you will be away you should let them know that you are on-call for this rotation. The manager will take responsibility for finding you cover.
When does Tier 2 get paged?
Tier 1 EOC or IM requests
Escalation Criteria
The Tier-1 Engineer On-Call (EOC) will perform initial triage and use available documentation before escalating to Tier-2 SMEs. Pages may also be initiated by the Incident Manager (IM) supporting the incident.
Before Escalating to Tier-2
Tier-1 must:
- Follow all recommendations in runbooks and playbooks for the affected area
- Document attempted solutions and outcomes in the incident issue
Resource Locations
By Severity Level
-
S1/S2 Incidents: When the Tier-1 team cannot resolve them independently using runbooks, documentation or other sources. Due to their critical nature, Tier-2 SMEs should expect to be paged for these incidents when domain-specific expertise is needed.
-
S3/S4 Incidents: These typically do not require escalation to Tier-2 SMEs during weekends. However, Tier-1 may escalate S3/S4 incidents in specific circumstances:
- When the customer impact is unclear and requires domain expertise to assess
- When there’s uncertainty about whether the issue might develop into a higher severity incident
- When multiple lower-severity incidents combined create a potentially broader impact
-
If Tier-1 needs help determining whether errors or unusual behavior in a service will affect customers, they may consult with Tier-2 SMEs
e7e655e5
)