DevOps Tier 2 On-Call Rotation Leader
Welcome to the Rotation Leader role for the DevOps Tier 2 On-Call program. As a Rotation Leader, you are responsible for the health, fairness, and effectiveness of your team’s on-call rotation. This guide outlines your core responsibilities and how to execute them.
Your Role and Responsibilities
As a Rotation Leader, you serve as the primary point of contact for your rotation. Your responsibilities include managing the schedule, onboarding new team members, tracking workload distribution, maintaining team health, and ensuring continuous improvement of the program.
Managing the Schedule
Building and Maintaining the Rotation
You are responsible for the overall structure and composition of your rotation:
- Ensure you have a minimum of 6 people in the rotation to provide sustainable coverage without excessive burnout
- Target 8 people for balanced workload and flexibility
- Maximum 12 people before reassessing team structure
- Distribute coverage across regions (APAC, EMEA, AMER) based on team distribution
- Publish schedules at least one month in advance so team members can plan
Setting Coverage Hours
Define clear coverage expectations for your rotation:
-
Determine if coverage is 24x5 (24 hours, Monday-Friday) or work hours only
-
For work hours, recommend aligning with standard business hours while accommodating timezone diversity
-
Suggested regional splits (UTC):
- APAC: 23:00 - 07:00
- EMEA: 07:00 - 15:00
- AMER: 15:00 - 23:00
-
Allow flexibility where teams naturally don’t align to these times, prioritizing coverage while enabling meaningful contributions
Publishing and Managing the Schedule
- Use Incident.io as your single source of truth for all scheduling
- Make schedules visible to your team and update them regularly
- Create overrides when someone needs coverage swapped
- Communicate schedule changes promptly to affected team members
- Track future scheduling 3-6 months in advance when possible
Staffing and Fairness
Rotation Frequency by Region
Your rotation targets specific frequency based on team size and regional coverage needs:
APAC (Asia-Pacific)
- One week every 6 weeks
- Approximately 43 days per year (16.7% of time)
- Higher frequency due to smallest team size
AMER (Americas)
- One week every 8 weeks
- Approximately 33 days per year (12.5% of time)
- Medium frequency proportional to team size
EMEA (Europe/Middle East/Africa)
- One week every 12 weeks
- Approximately 22 days per year (8.3% of time)
- Lower frequency due to largest team size
Ensuring Fair Distribution
Monitor workload distribution to maintain equity:
- Ensure every engineer gets roughly the same number of shifts
- Track how many times each person has been on-call
- Review who handled the most alerts during their shifts
- Identify anyone carrying more than their fair share
- Watch for patterns where someone consistently gets busier weeks
- Cap on-call duty at no more than once every 4 weeks maximum
Quarterly Fairness Reviews
Every quarter, conduct a review of:
- How many times was each person on-call?
- Were shifts fairly distributed or did some get lucky?
- Did anyone burn out or report unsustainable load?
- Did you meet your coverage goals for each region?
- Rebalance if the distribution became unfair
Onboarding New Team Members
Adding Someone to the Rotation
When adding a new person to your rotation:
- Work with the team member and their manager to confirm readiness
- Add them to Incident.io with their contact information (phone number, email)
- Ensure they complete the Tier 2 first shift preparation checklist
- Provide them with access to all necessary tools and documentation
- Schedule their first shift and communicate it clearly
- Be available to support them during their initial shifts
Required Onboarding Resources
Ensure new team members have access to and understand:
- On-Call Process & Policies
- Team-specific runbooks and playbooks
- Incident.io training and setup
- Your team’s escalation criteria
- Critical dashboards and monitoring platforms
- Communication protocols and Slack channels
Managing Public Holidays
Your Responsibilities
It is very difficult for rotation leaders to know every team member’s public holidays across all regions. Clarify expectations:
- It is the team member’s responsibility to find coverage if they are scheduled on a public holiday
- Team members may voluntarily switch their public holiday to a different day according to company policy
- Exception: For The Netherlands, team members must notify you with at least 2 working days notice, and you (not the team member) are responsible for finding cover, as agreed with the Works Council
Handling Holiday Conflicts
When a team member identifies a public holiday conflict:
- Help them swap shifts with a willing colleague, or
- Create an override to reassign that shift to someone else
Communicate holiday schedules clearly to your team in advance.
Tracking Workload and Health Metrics
Monitoring Alert Volume and Pages
Track metrics that indicate rotation health:
- How many times does each engineer get paged per shift?
- Are pages distributed fairly across the team?
- Are there services generating excessive pages (alert fatigue)?
- Are there services generating too few pages (potential coverage gaps)?
Actions based on volume:
- If too high: Work on alert tuning, reduce false alarms, improve runbooks
- If too low: Verify real issues aren’t being missed; check alert coverage
Tracking Incident Response Quality
Review metrics to understand incident response effectiveness:
- Time to Declare: How quickly are incidents recognized and formally declared?
- Time to Fix: How long from declaration to resolution?
- Total Incident Duration: Full impact window including detection, declaration, fix, and verification
- Target: Aim for declaration within 5-10 minutes and fix times of 30 minutes or less
Monitoring Escalation Patterns
- Track which teams are escalating to your specialists and how often
- Ensure escalations are going to the right teams on first try (target 90%+)
- Identify patterns in what gets escalated and why
- Use this data to improve runbooks or training
Burnout Prevention
Conduct quarterly surveys asking:
- “How sustainable is your on-call workload?”
- “Do you feel supported when on-call?”
- “Have you experienced on-call-related stress or fatigue?”
- “Would you recommend this program to new team members?”
Target 80%+ of engineers rating on-call as sustainable. If scores are low, investigate immediately and take corrective action.
Supporting Team Members
Handling Conflicts and Schedule Changes
When team members need to step back from on-call:
- Accommodate legitimate reasons (extended leave, major projects, etc.)
- Offer breaks between rotations, shift swaps, or reduced frequency as needed
- Document temporary changes clearly in Incident.io
- Reintegrate smoothly when the person is ready
Managing Removals
When someone leaves the rotation permanently:
- Start the offboarding process with them and their manager
- Determine their final shifts
- Remove them from the schedule in Incident.io
- Ensure they hand off active responsibilities to their replacement
Addressing Unsustainable Load
If someone is on-call more frequently than intended or handling excessive alerts:
- Investigate why this is happening (staffing, alert volume, etc.)
- Work with your team and leadership to resolve it
- Document the issue and action taken
- Communicate changes to the affected person
Program Improvement and Learning
Documenting and Improving Runbooks
As escalations come in, identify gaps in documentation:
- When Tier 2 engineers escalate, ask what information would have helped them resolve it faster
- Use post-incident retrospectives to identify runbook gaps
- Prioritize creating or updating runbooks for frequent escalation patterns
- Aim for 15-20 core runbooks covering 80% of common incidents
Conducting Effective Retrospectives
For S1/S2 incidents (or significant S3/S4 incidents):
- Ensure 100% of escalated S1/S2 incidents have a formal retrospective or write-up
- Lead retrospectives in a blameless manner, focusing on system improvements
- Document what was learned and what can be improved
- Track action items and follow up on completion
Tracking Runbook Usage
- Monitor whether incidents reference runbooks
- Ask your team which runbooks are most helpful
- Identify runbooks that aren’t being used and update or remove them
- Create new runbooks based on escalation patterns
Communication and Escalation
Supporting Your Team During Incidents
When your team is on-call:
- Be available for questions and escalation decisions
- Help them determine when to escalate beyond Tier 2
- Provide context on customer impact and business priorities
- Debrief after significant incidents
Communicating with Leadership
Keep leadership informed about:
- Rotation health and any sustainability concerns
- Trends in incident volume and resolution times
- Staffing needs (do you need to hire or redistribute?)
- Alert tuning and runbook improvements
- Cultural health of the rotation
Coordinating with Other Teams
Maintain relationships with:
- Infrastructure and platform teams who may receive escalations
- Incident Managers who page your team
- Other rotation leaders to share best practices
- Your team’s managers for support with team members
When to Escalate to Leadership
Bring issues to your manager or leadership when:
- Someone is unsustainably on-call more than once every 4 weeks
- Burnout or turnover is increasing
- Alert volume is unmanageable
- Staffing levels are insufficient
- The rotation structure isn’t working
- Cultural issues emerge (blaming, lack of escalation, etc.)
Success Indicators
You’ll know your rotation is healthy when:
- Engineers don’t burn out from on-call
- Shifts are fairly distributed across regions and team members
- Incident response times are improving (TTDec and TTFix trending down)
- Retrospectives are blameless and focus on system improvements
- Runbooks are being used and updated regularly
- Escalations are going to the right teams
- New engineers feel supported during their first shifts
- Team members view on-call as manageable and educational
- Cultural health surveys score 80%+ on sustainability
Quick Reference: Key Responsibilities
Schedule Management:
- Maintain 6-12 people in rotation
- Publish schedules 1+ month in advance
- Track fairness quarterly
- Cap individual rotation frequency at once per 4 weeks maximum
Onboarding:
- Add new members to Incident.io
- Provide tool access and documentation
- Support first shifts
- Ensure training completion
Workload Tracking:
- Monitor pages per shift
- Track incident response metrics
- Watch for burnout indicators
- Conduct quarterly fairness reviews
Team Support:
- Help with schedule conflicts and swaps
- Create overrides for absences
- Address unsustainable load
- Support escalations during incidents
Improvement:
- Identify runbook gaps
- Lead blameless retrospectives
- Track metrics and trends
- Share learnings with the team
Related Pages
- Joining and Leaving the Rotation — Manage team member add/removal
- Coverage and Scheduling — Manage and publish schedules
- Time Off and Holidays — Manage holiday coverage for your team
- Measuring Success of the Tier 2 On-Call Program — Track rotation health metrics
- Communication and Culture — Foster blameless culture in your rotation
6f2cba79
)