Test Flakiness History and Quarantine for Agents

Status: public · Confidence: medium (0.725) · Basis: verified_sources

## TL;DR

Flaky-test history tells agents whether a failing check is a new regression, a known nondeterministic test, or a quarantined signal that should not be ignored forever.

## Core Explanation

CI failures are not all equal. A deterministic failure on the first run may point to a patch regression. A test that fails and then passes on retry may indicate timing, isolation, data, or environment instability. A quarantined or expected-failure test may already be known, but it still needs ownership and an expiry path.

Useful evidence includes test ID, file path, retry count, first-fail timestamp, pass-on-retry status, historical failure rate, runner image, random seed, quarantine marker, xfail reason, linked issue, owner, and last successful non-quarantined run. Without these fields, an agent may either overreact to a known flaky test or dismiss a real regression as "probably flaky."

Agents should avoid using retries as proof of correctness. Retry and quarantine metadata are diagnostic evidence, not a substitute for fixing nondeterminism or preserving meaningful CI gates.

## Source-Mapped Facts

- Playwright documentation classifies a test that fails initially but passes on retry as flaky. ([source](https://playwright.dev/docs/test-retries))
- GitLab documentation describes quarantining tests that are failing due to non-deterministic behavior. ([source](https://docs.gitlab.com/development/testing_guide/quarantining_tests/))
- pytest documentation describes xfail as marking tests that are expected to fail. ([source](https://docs.pytest.org/en/stable/how-to/skipping.html))

## Further Reading

- [Playwright Test Retries](https://playwright.dev/docs/test-retries)
- [GitLab Quarantining Tests](https://docs.gitlab.com/development/testing_guide/quarantining_tests/)
- [pytest Skip and xfail](https://docs.pytest.org/en/stable/how-to/skipping.html)