Measuring Success of the Tier 2 On-Call Program

How do we know if the Tier 2 on-call program is working? We measure it through specific metrics that reflect both operational excellence and engineer well-being. This page explains what we track and why it matters.

Core Success Metrics

Our program focuses on three interconnected metrics that together tell the story of incident response quality:

Time to Declare (TTDec)

What it measures: How quickly do we recognize a problem and formally declare it as an incident?

Why it matters: The faster we declare an incident, the faster we activate our full response machinery. A long time between when something breaks and when we declare it means we’re already losing ground.

What it shows: Good TTDec indicates strong observability, clear ownership, and alert systems that work. Poor TTDec suggests blind spots in our monitoring.

Target: Declare incidents within 5-10 minutes of detection for critical services

How Tier 2 helps: You validate early alert signals, confirm severity, and lead the decision to formally declare incidents promptly.

Time to Fix (TTFix)

What it measures: How long from when we declare an incident to when it’s actually resolved?

Why it matters: This is what customers care about most. Shorter fix times mean less business impact and better reliability.

What it shows: Good TTFix indicates effective troubleshooting, strong runbooks, and skilled on-call engineers. Poor TTFix suggests we need better tools, documentation, or training.

Target: 30 minutes or less for most incidents; under 5 minutes for critical issues

How Tier 2 helps: You own the technical resolution, execute runbooks, coordinate with other teams, and validate the fix through testing.

Total Incident Duration

What it measures: Total elapsed time from when an incident starts to when it’s completely resolved (including verification and monitoring).

Why it matters: This captures the full window of customer impact. It includes detection time, declaration time, fix time, and verification.

What it shows: Trends in incident duration over time reveal whether we’re getting better at preventing and responding to issues.

Target: Reduce overall incident duration by 20-30% within the first year of the program

How Tier 2 helps: You coordinate the response, update status milestones in real time, and ensure proper verification before declaring resolution.

Composite Metrics: Reading the Patterns

When you look at all three metrics together, they tell you what’s really happening in the system. Here’s how to interpret patterns:

↓ All Three Metrics Improving

What it means: Excellent incident management. We’re detecting problems quickly, declaring them promptly, fixing them fast, and minimizing total impact.

What’s working: Strong observability, clear ownership, effective runbooks, skilled team.

Action: Maintain and continue improving. This is the goal state.

Fast Declaration but Slow Fix

What it means: TTDec ↓ but TTFix ↑ — We’re detecting problems quickly, but taking too long to resolve them.

What’s wrong: Our runbooks might be incomplete, team lacks expertise, or we need better tools and access.

Action: Invest in runbook quality, provide training, and audit escalation paths.

Slow Declaration but Quick Fix

What it means: TTDec ↑ but TTFix ↓ — We’re slow to detect problems, but once we do, we fix them fast.

What’s wrong: Gaps in our monitoring and alerting. We’re not seeing problems until they’re severe.

Action: Audit observability, add missing metrics and dashboards, improve alert thresholds.

Both Slow (TTDec ↑ and TTFix ↑)

What it means: We’re detecting problems late and taking too long to fix them.

What’s wrong: This requires comprehensive improvement—both monitoring and response capabilities need work.

Action: Phase 1: Improve observability. Phase 2: Improve troubleshooting. Both are critical.

On-Call Health Metrics

Beyond incident response metrics, we also measure the health and sustainability of the on-call program itself.

Rotation Frequency

What we measure: How often is each engineer on-call?

Target: Maximum 1 week per month per engineer

Why it matters: Sustainable on-call. If someone is on-call too frequently, they’ll burn out.

What to do if high: Add team members to the rotation, improve alerting to reduce pages, or distribute coverage differently.

Alert Volume (Pages Per Shift)

What we measure: How many times does a Tier 2 engineer get paged during their shift?

Target: 2-5 pages per shift is typical; varies by service

Why it matters: Too many pages = alert fatigue. Too few = maybe we’re not detecting real issues.

What to do if too high: Tune alert thresholds, fix noisy monitoring, remove false alarms.

What to do if too low: Verify we’re not missing real issues; check alert coverage.

Escalation Accuracy

What we measure: When Tier 2 escalates an incident, are they escalating to the right team?

Why it matters: Escalating to the wrong person wastes time. Escalating to the right person fixes the issue faster.

Target: 90%+ of escalations go to the correct team on first try

What to do if low: Improve escalation decision tree, clarify when to escalate, provide more context in runbooks.

Burnout Prevention Metrics

Quarterly Surveys

What we ask:

  • “How sustainable is your on-call workload?”
  • “Do you feel supported when on-call?”
  • “Have you experienced on-call-related stress or fatigue?”
  • “Would you recommend this program to new team members?”

Target: 80%+ of engineers rate on-call as sustainable

What to do if scores are low: Investigate immediately. On-call burnout is serious and requires action.

Learning and Improvement Metrics

Incident Knowledge Capture

What we measure: For critical incidents (S1/S2), do we create documented learning?

Target: 100% of S1/S2 incidents have a retrospective or formal write-up

Why it matters: Incidents are learning opportunities. If we don’t capture what we learned, we’ll repeat the same mistakes.

Runbook Usage

What we measure: Are engineers actually using runbooks during incidents? Are new runbooks created after incidents?

Target: 80%+ of incident reports reference a runbook

Why it matters: Runbooks save time and reduce errors. If they’re not being used, we’re missing an opportunity.

Runbook Coverage

What we measure: How many documented runbooks/playbooks exist for common scenarios?

Target: Minimum 15-20 core runbooks covering 80% of common incidents

Why it matters: Having documented procedures speeds up response and reduces toil.

Fair Distribution Metrics

Escalation Patterns

What we measure: Do certain teams get escalated to more often? Is load spread fairly?

Why it matters: If one team is always escalating to the same group, it indicates either a real problem (that group owns critical services) or a routing issue.

Action: Monitor patterns, rebalance quarterly if needed.

Baseline vs. Target Metrics

Your rotation leader has established metrics for your program. Understanding the difference matters:

Baseline Metrics (Current State)

These are measured before or at the start of Tier 2:

  • “Currently, alerts page EMs an average of 20 times per week”
  • “Average time to fix is 45 minutes”
  • “Incidents last an average of 2 hours from start to resolution”

Target Metrics (Goal State)

These are what we’re aiming for:

  • “With Tier 2, on-call specialists will be paged an average of 5 times per week”
  • “Average time to fix will be 20 minutes”
  • “Incidents will last an average of 30 minutes”

Progress Dashboard

Your team should have a dashboard showing:

  • Current metrics vs. targets
  • Trend over time (are we improving?)
  • Metrics by team or service
  • Areas where we’re ahead of target vs. behind

How to Read the Metrics

If metrics are improving (TTDec down, TTFix down, total duration down), you’re winning. Celebrate this. But also:

  • Ask what’s working and keep doing it
  • Identify what changed and document it
  • Share best practices with the team

Metrics Plateau or Get Worse

If metrics stop improving or get worse:

  • Don’t panic. Investigate why.
  • Did something change (new service, team change, alert tuning)?
  • Is it a temporary spike or a trend?
  • What action is needed?

Using Metrics to Improve

Metrics aren’t about blame. They’re about identifying where to focus effort:

  • “Alert volume is 3x what we expected; we should tune thresholds”
  • “TTFix is high for database issues; we need better database runbooks”
  • “Some services scale well in Tier 2, others struggle; let’s learn why”

Tier 2 Program Success Criteria

Beyond individual metrics, the program itself succeeds when:

Foundation & Standardization (Phase 1)

  • All Tier 2 rotations are mapped with clear ownership
  • Escalation paths are documented and accessible
  • Incident taxonomy is standardized
  • Team members are trained on Tier 2 processes
  • Structured retrospectives are completed for escalated incidents

Enhancement & Integration (Phase 2)

  • Process audit completed with identified Duo use cases
  • 15-20 core runbooks/playbooks documented and in use
  • At least 80% of team members aware of and using runbooks
  • Baseline and target metrics defined
  • Evidence that incidents reference runbooks and learnings
  • At least 30% reduction in time spent on manual toil

Cultural Success Indicators

Beyond metrics, we measure success by culture:

Blameless Incident Response

  • Retrospectives focus on systems, not people
  • Engineers feel safe escalating when needed
  • Learning is celebrated, not punishment

Knowledge Sharing

  • Runbooks are continuously updated
  • Team members teach each other
  • Patterns are recognized and prevented

Sustainability

  • Engineers don’t burn out from on-call
  • People feel on-call is manageable and educational
  • On-call experience is valued in career growth

Your Role in Measuring Success

As an on-call engineer, you contribute by:

  • Providing honest feedback on your experience
  • Referencing runbooks and noting when they help or fail
  • Participating in retrospectives authentically
  • Suggesting improvements based on what you see
  • Celebrating wins and learning from failures
Last modified October 19, 2025: drs-add-landing-page-tier2 (6f2cba79)