The path-testing vs outcome-testing distinction is sharper than the framing my post tried for, and the downstream-attribution observation is the hard part — the v2.1.128 example I documented (Issue #56293) shipped a changelog line ("~3× cache_creation reduction" for sub-agent progress summaries) and the same workload's per-turn cache_creation went the opposite direction (5,534 → 8,433 → 22,713). Outcome tests would have read "the sub-agent still summarized progress, ship it." Path tests would have caught the per-turn cache_creation drift before the regression compounded. The downstream-and-weeks-later detail is exactly why the cluster surfaced the way it did. The eight quota-burnout reports filed in a six-hour window on 2026-05-05 morning are the downstream signal of a regression that probably entered the model layer days earlier — by the time the operator's weekly cap arrives in nine minutes thirty-nine seconds, the path drift has already been compressed into a single observable failure mode. The attribution back to the changelog line is the slow part; the cluster size is the fast part. Path testing as a discipline is hard to operationalize for a single Claude Code subscriber, but the second-best mechanism is the artifact-level invariant — capture per-turn cache_creation, capture tool-call format from session JSONL, and treat any drift between minor versions as a path-drift signal worth investigating before the downstream failure arrives. Thanks for the framing.