On-Call
Expectations for On-Call
- If you are on call, then you are expected to be available and ready to respond to PagerDuty pages or incident.io Escalations as soon as possible, and within any response times set by our Service Level Agreements in the case of Customer Emergencies. If you have plans outside of your workspace during your on-call shift, this may require that you bring a laptop and reliable internet connection with you.
- We take on-call seriously. There are escalation policies in place so that if a first responder does not respond in time, another team member is alerted. Such policies are not expected to be triggered under normal operations, and are intended to cover extreme and unforeseeable circumstances.
- Because GitLab is an asynchronous workflow company, @mentions of On-Call individuals in Slack will be treated like normal messages, and no SLA for response will be associated with them.
- Provide support to the release managers in the release process.
- As noted in the main handbook, after being on-call, make sure that you take time off if you need to. Being available for issues and outages can be taxing, even if you had no pages. Resting after your on-call shift is critical for preventing burnout. Be sure to inform your team of the time you plan to take for time off.
- The expectation is that you take 1-2 days off as “time off in lieu” if you need to recover from your shift.
- Team members in Australia should review the Australia time in lieu policy.
- During on-call duties, it is the team member’s responsibility to act in compliance with local rules and regulations. If ever in doubt, please reach out to your manager and/or aligned People Business Partner.
Customer Emergency On-Call Rotation
- We do 7 days of 8-hour shifts in a follow-the-sun style, based on your location.
- After 10 minutes, if the alert has not been acknowledged, the support manager on call will be alerted. After a further 5 minutes, senior support leadership from all 3 regions will be alerted.
- All tickets that are raised as emergencies will receive the emergency SLA. The on-call engineer’s first action will be to triage the emergency request and work with the customer to find the best path forward.
- After 30 minutes, if the customer has not responded to our initial contact with them, let them know that the emergency ticket will be closed and that you are opening a normal priority ticket on their behalf. Also let them know that they are welcome to open a new emergency ticket if necessary.
- You can view the schedule and the escalation policy on PagerDuty. You can also opt to subscribe to your on-call schedule, which is updated daily.
- After each shift, if there was an alert / incident, the on call person will send a hand off email to the next on call explaining what happened and what’s ongoing, pointing at the right issues with the progress.
- If you need to reach the current on-call engineer and they’re not accessible on Slack (e.g. it’s a weekend, or the end of a shift), you can manually trigger a PagerDuty incident to get their attention, selecting Customer Support as the Impacted Service and assigning it to the relevant Support Engineer.
- See the GitLab Support On-Call Guide for a more comprehensive guide to handling customer emergencies.
GitLab.com On-Call Rotations
Tier 1 Rotations
Tier 1 rotations include SRE EOC for GitLab.com and the Incident Manager rotations.
More details can be found in the Tier 1 section of the on-call handbook pages.
Tier 2 Subject-Matter-Expert (SME) On-Call
This on-call layer contains many different rotations for specialist areas of the product.
More details can be found in the Tier 2 section of the on-call handbook pages
Security Team On-Call Rotation
Security Operations (SecOps)
- SecOps on-call rotation is 7 days of 24-hour shifts.
- After 15 minutes, if the alert has not been acknowledged, the Security Manager on-call is alerted.
- You can view the Security Operations schedule on PagerDuty.
- When on-call, prioritize work that will make the on-call better (that includes building projects, systems, adding metrics, removing noisy alerts). Much like the Production team, we strive to have nothing to do when being on-call, and to have meaningful alerts and pages. The only way of achieving this is by investing time in trying to automate ourselves out of a job.
- The main expectation when on-call is triaging the urgency of a page - if the security of GitLab is at risk, do your best to understand the issue and coordinate an adequate response. If you don’t know what to do, engage the Security manager on-call to help you out.
- More information is available in the Security Operations On-Call Guide and the Security Incident Response Guide.
Security Managers
- Security Manager on-call rotation is 7 days of 12-hour shifts.
- Alerts are sent to the Security Manager on-call if the SecOps on-call page isn’t answered within 15 minutes.
- You can view the Security Manager schedule on PagerDuty.
- The Security Manager on-call is responsible to engage alternative/backup SecOps Engineers in the event the primary is unavailable.
- In the event of a high-impact security incident to GitLab, the Security Manager on-call will be engaged to assist with cross-team/department coordination.
Developer Experience Stage On-Call Rotation
- Developer Experience’s on-call do not include work outside GitLab’s normal business hours. The process is defined on our pipeline on-call rotation page.
- The rotation is on a weekly basis across 3 timezones (APAC, EMEA, AMER) and triage activities happen during each team member’s working hours.
- This on-call rotation is to ensure accurate and stable test pipeline results that directly affects our continuous release process.
- The list of pipelines which are monitored are defined on our pipeline page.
- The schedule and roster is defined on our schedule page.
incident.io
We use incident.io to set the on-call schedules, and to route notifications to the appropriate individual(s).
Swapping On-Call Duty
Team members covering a shift for someone else are responsible for adding the override in incident.io. This can be arranged in the #eoc-general Slack channel or via the Request Coverage feature of incident.io. They can delegate this task back to the requestor, but only after explicitly confirming they will cover the requested shift(s). To set an override, click the “Create Override” button in the upper right of the page, or click the relevant block of time on the schedule view. This action defaults the person in the override to you — incident.io assumes that you’re the person volunteering an override. If you’re processing this for another team member, you’ll need to select their name from the drop-down list. Also see this article for reference.
Adding and removing people from the roster
When adding a new team member to the on-call roster, it’s inevitable that the rotation schedule will shift. The manager adding a new team member will add the individual towards the end of the current rotation to avoid changing the current schedule, if possible. When adding a new team member to the rotation, the manager will raise the topic to their team(s) to make sure everyone has ample time to review the changes.
Slack
In order to facilitate informal conversations around the on-call process and quality of life, as well as coordination of shifts and communication of broader announcements, we have the #eoc-general channel.
3a5e48cd)
