The catalog is a registry of failure modes we've already seen, each one pinned by an anti-pattern rule and a JUnit lock test. That's a regression net, it catches a known failure recurring, not a new one. Tests run fixed inputs, so drift slips past them. This is a must in our workflow as well if our developers are using claude, codex or anything its a signal to the agentic coding platform as well.
We catch drift production-side instead. Every mapping-suggestion call writes an adherence score, and we log the gap between what the AI proposed and what the user actually saved, by domain. When users keep overriding the AI in one domain, that's drift showing up in the data.
The honest gap is the alert (we are making this work in right way in our upcoming sprint). Right now a human has to read the telemetry dashboard to catch it, a threshold that fires on an adherence drop is what's still missing.
ethanwalker
Treating prompts like code is the right framing. We added a CI hook with Promptfoo that blocks any merge where regression-test scores drop more than 5%. The hardest part wasn't writing the eval set, it was getting the team to maintain it as prompts evolved. Curious if your catalog covers the silent-degradation case where prompts pass eval but drift in real-world distribution.