Measuring Success of the Tier 2 On-Call Program
How do we know if the Tier 2 on-call program is working? We measure it through specific metrics that reflect both operational excellence and engineer well-being. This page explains what we track and why it matters.
Core Success Metrics
Our program focuses on three interconnected metrics that together tell the story of incident response quality:
Time to Declare (TTDec)
What it measures: How quickly do we recognize a problem and formally declare it as an incident?
Why it matters: The faster we declare an incident, the faster we activate our full response machinery. A long time between when something breaks and when we declare it means we’re already losing ground.
What it shows: Good TTDec indicates strong observability, clear ownership, and alert systems that work. Poor TTDec suggests blind spots in our monitoring.
Target: Declare incidents within 5-10 minutes of detection for critical services
How Tier 2 helps: You validate early alert signals, confirm severity, and lead the decision to formally declare incidents promptly.
Time to Fix (TTFix)
What it measures: How long from when we declare an incident to when it’s actually resolved?
Why it matters: This is what customers care about most. Shorter fix times mean less business impact and better reliability.
What it shows: Good TTFix indicates effective troubleshooting, strong runbooks, and skilled on-call engineers. Poor TTFix suggests we need better tools, documentation, or training.
Target: 30 minutes or less for most incidents; under 5 minutes for critical issues
How Tier 2 helps: You own the technical resolution, execute runbooks, coordinate with other teams, and validate the fix through testing.
Total Incident Duration
What it measures: Total elapsed time from when an incident starts to when it’s completely resolved (including verification and monitoring).
Why it matters: This captures the full window of customer impact. It includes detection time, declaration time, fix time, and verification.
What it shows: Trends in incident duration over time reveal whether we’re getting better at preventing and responding to issues.
Target: Reduce overall incident duration by 20-30% within the first year of the program
How Tier 2 helps: You coordinate the response, update status milestones in real time, and ensure proper verification before declaring resolution.
Composite Metrics: Reading the Patterns
When you look at all three metrics together, they tell you what’s really happening in the system. Here’s how to interpret patterns:
↓ All Three Metrics Improving
What it means: Excellent incident management. We’re detecting problems quickly, declaring them promptly, fixing them fast, and minimizing total impact.
What’s working: Strong observability, clear ownership, effective runbooks, skilled team.
Action: Maintain and continue improving. This is the goal state.
Fast Declaration but Slow Fix
What it means: TTDec ↓ but TTFix ↑ — We’re detecting problems quickly, but taking too long to resolve them.
What’s wrong: Our runbooks might be incomplete, team lacks expertise, or we need better tools and access.
Action: Invest in runbook quality, provide training, and audit escalation paths.
Slow Declaration but Quick Fix
What it means: TTDec ↑ but TTFix ↓ — We’re slow to detect problems, but once we do, we fix them fast.
What’s wrong: Gaps in our monitoring and alerting. We’re not seeing problems until they’re severe.
Action: Audit observability, add missing metrics and dashboards, improve alert thresholds.
Both Slow (TTDec ↑ and TTFix ↑)
What it means: We’re detecting problems late and taking too long to fix them.
What’s wrong: This requires comprehensive improvement—both monitoring and response capabilities need work.
Action: Phase 1: Improve observability. Phase 2: Improve troubleshooting. Both are critical.
On-Call Health Metrics
Beyond incident response metrics, we also measure the health and sustainability of the on-call program itself.
Rotation Frequency
What we measure: How often is each engineer on-call?
Target: Maximum 1 week per month per engineer
Why it matters: Sustainable on-call. If someone is on-call too frequently, they’ll burn out.
What to do if high: Add team members to the rotation, improve alerting to reduce pages, or distribute coverage differently.
Alert Volume (Pages Per Shift)
What we measure: How many times does a Tier 2 engineer get paged during their shift?
Target: 2-5 pages per shift is typical; varies by service
Why it matters: Too many pages = alert fatigue. Too few = maybe we’re not detecting real issues.
What to do if too high: Tune alert thresholds, fix noisy monitoring, remove false alarms.
What to do if too low: Verify we’re not missing real issues; check alert coverage.
Escalation Accuracy
What we measure: When Tier 2 escalates an incident, are they escalating to the right team?
Why it matters: Escalating to the wrong person wastes time. Escalating to the right person fixes the issue faster.
Target: 90%+ of escalations go to the correct team on first try
What to do if low: Improve escalation decision tree, clarify when to escalate, provide more context in runbooks.
Burnout Prevention Metrics
Quarterly Surveys
What we ask:
- “How sustainable is your on-call workload?”
- “Do you feel supported when on-call?”
- “Have you experienced on-call-related stress or fatigue?”
- “Would you recommend this program to new team members?”
Target: 80%+ of engineers rate on-call as sustainable
What to do if scores are low: Investigate immediately. On-call burnout is serious and requires action.
Learning and Improvement Metrics
Incident Knowledge Capture
What we measure: For critical incidents (S1/S2), do we create documented learning?
Target: 100% of S1/S2 incidents have a retrospective or formal write-up
Why it matters: Incidents are learning opportunities. If we don’t capture what we learned, we’ll repeat the same mistakes.
Runbook Usage
What we measure: Are engineers actually using runbooks during incidents? Are new runbooks created after incidents?
Target: 80%+ of incident reports reference a runbook
Why it matters: Runbooks save time and reduce errors. If they’re not being used, we’re missing an opportunity.
Runbook Coverage
What we measure: How many documented runbooks/playbooks exist for common scenarios?
Target: Minimum 15-20 core runbooks covering 80% of common incidents
Why it matters: Having documented procedures speeds up response and reduces toil.
Fair Distribution Metrics
Escalation Patterns
What we measure: Do certain teams get escalated to more often? Is load spread fairly?
Why it matters: If one team is always escalating to the same group, it indicates either a real problem (that group owns critical services) or a routing issue.
Action: Monitor patterns, rebalance quarterly if needed.
Baseline vs. Target Metrics
Your rotation leader has established metrics for your program. Understanding the difference matters:
Baseline Metrics (Current State)
These are measured before or at the start of Tier 2:
- “Currently, alerts page EMs an average of 20 times per week”
- “Average time to fix is 45 minutes”
- “Incidents last an average of 2 hours from start to resolution”
Target Metrics (Goal State)
These are what we’re aiming for:
- “With Tier 2, on-call specialists will be paged an average of 5 times per week”
- “Average time to fix will be 20 minutes”
- “Incidents will last an average of 30 minutes”
Progress Dashboard
Your team should have a dashboard showing:
- Current metrics vs. targets
- Trend over time (are we improving?)
- Metrics by team or service
- Areas where we’re ahead of target vs. behind
How to Read the Metrics
Trending the Right Direction
If metrics are improving (TTDec down, TTFix down, total duration down), you’re winning. Celebrate this. But also:
- Ask what’s working and keep doing it
- Identify what changed and document it
- Share best practices with the team
Metrics Plateau or Get Worse
If metrics stop improving or get worse:
- Don’t panic. Investigate why.
- Did something change (new service, team change, alert tuning)?
- Is it a temporary spike or a trend?
- What action is needed?
Using Metrics to Improve
Metrics aren’t about blame. They’re about identifying where to focus effort:
- “Alert volume is 3x what we expected; we should tune thresholds”
- “TTFix is high for database issues; we need better database runbooks”
- “Some services scale well in Tier 2, others struggle; let’s learn why”
Tier 2 Program Success Criteria
Beyond individual metrics, the program itself succeeds when:
Foundation & Standardization (Phase 1)
- All Tier 2 rotations are mapped with clear ownership
- Escalation paths are documented and accessible
- Incident taxonomy is standardized
- Team members are trained on Tier 2 processes
- Structured retrospectives are completed for escalated incidents
Enhancement & Integration (Phase 2)
- Process audit completed with identified Duo use cases
- 15-20 core runbooks/playbooks documented and in use
- At least 80% of team members aware of and using runbooks
- Baseline and target metrics defined
- Evidence that incidents reference runbooks and learnings
- At least 30% reduction in time spent on manual toil
Cultural Success Indicators
Beyond metrics, we measure success by culture:
Blameless Incident Response
- Retrospectives focus on systems, not people
- Engineers feel safe escalating when needed
- Learning is celebrated, not punishment
Knowledge Sharing
- Runbooks are continuously updated
- Team members teach each other
- Patterns are recognized and prevented
Sustainability
- Engineers don’t burn out from on-call
- People feel on-call is manageable and educational
- On-call experience is valued in career growth
Your Role in Measuring Success
As an on-call engineer, you contribute by:
- Providing honest feedback on your experience
- Referencing runbooks and noting when they help or fail
- Participating in retrospectives authentically
- Suggesting improvements based on what you see
- Celebrating wins and learning from failures
Related Pages
- DevOps Rotation Leader — Rotation leaders track these metrics
- Communication and Culture — Blameless culture supports these goals
- Joining and Leaving the Rotation — Understand fairness metrics in your rotation
6f2cba79
)