Flaky tests management and processes

Introduction

A flaky test is an unreliable test that occasionally fails but passes eventually if you retry it enough times. In a test suite, flaky tests are inevitable, so our goal should be to limit their negative impact as soon as possible.

Out of all the factors that affects master pipeline stability, flaky tests contribute to at least 30% of master pipeline failures each month.

Current state and assumptions

Current state Assumptions
master success rate was at 89% for March 2024 We don’t know exactly what would be the success rate without any flaky tests, but we assume we could attain 99%
5200+ ~"failure::flaky-test" issues out of a total of 260,040 tests as of 2024-03-01 It means we identified 1.99% of tests as being flaky. GitHub identified that 25% of their tests were flaky at some point, our reality is probably in between.
Coverage is currently at 98.42% Even if we’d removed the 5200 flaky tests, we don’t expect the coverage to go down meaningfully.
“Average Retry Count” per pipeline is currently at 0.015, it means given RSpec jobs’ current average duration of 23 minutes, this results in an additional 0.015 * 23 = 0.345 minutes on average per pipeline, not including the idle time between the job failing and the time it is retried. Explanation provided by Albert. Given we have approximately 91k pipelines per month, that means flaky tests are wasting 31,395 CI minutes per month. Given our private runners cost us $0.0845 / minute, this means flaky tests are wasting at minimum $2,653 per month of CI minutes. This doesn’t take in account the engineers’ time wasted.

Manual flow to detect flaky tests

When a flaky test fails in an MR, the author might follow the following flow:

graph LR
    A[Test fails in a MR] --> C{Does the failure looks related to the MR?}
    C -->|Yes| D[Try to reproduce and fix the test locally]
    C -->|No| E{Does a flaky test issue exists?}
    E -->|Yes| F[Retry the job and hope that it will pass this time]
    E -->|No| G[Wonder if this is flaky and retry the job]

Why is flaky tests management important?

Flaky tests negatively impact several teams and areas:

Impacted department/team Impacted area Impact description Impact quantification
Development department MR & deployment cycle time Wasted time (by forcing people to look at the failures and retry them manually if needed) A lot of wasted time for all our engineers
Infrastructure department CI compute resources Wasted money At least $2,653 worth of wasted CI compute time per month
Delivery team & Quality department Deployment cycle time Distraction from actual CI failures & regressions, leading to slower detection of those TBD

Flaky tests management process

We started an experiment to automatically open merge requests for very flaky tests to improve overall pipeline stability and duration. To ensure that our product quality is not negatively affected due to test coverage reduction, the following process should be followed:

  1. Groups are responsible for reviewing their test-quarantining merge requests. These merge requests are meant to start a discussion on whether a test is useful or not. In case a test is impacting master’ stability heavily, the Engineering Productivity team can merge these merge requests even without a review from their responsible group. The group should still review the merge request and start a discussion about the quarantined test’s next step.
  2. Once a test is quarantined, its associated issue will be reported in weekly group reports. Groups can also list all of their flaky tests and their quarantined tests (replace group::xxx in the issues list).
  3. The number of quarantined test cases per group is also available as a dashboard.
  4. Groups are responsible for ensuring stability and coverage of their own tests, by getting flaky tests back to running or removing them.

You can leave any feedback about this process in the dedicated issue.

Goals

  • Increase master stability to a solid 95% success rate without manual action
  • Improve productivity - MR merge time - lower “Average Retry Count”
  • Remove doubts on whether master is broken or not
  • Reduce the need to retry a failing job by default
  • Define acceptable thresholds for action like quarantining/focus on refactoring
  • Step towards unlocking Merge train

Additional resources

Last modified April 23, 2024: Link developer docs for flaky specs (b355d378)