Discussion on "Treating LLM prompts like code: a regression catalog for AI failures"

Amit Mishra · 2026-05-20T02:05:44.634Z

A field note on turning prompt-engineering folklore into structured, regression-tested artifacts. A few months ago, an LLM call in the system I work on confidently paired a refund with a beneficiary

The catalog is a registry of failure modes we've already seen, each one pinned by an anti-pattern rule and a JUnit lock test. That's a regression net, it catches a known failure recurring, not a new one. Tests run fixed inputs, so drift slips past them. This is a must in our workflow as well if our developers are using claude, codex or anything its a signal to the agentic coding platform as well.

We catch drift production-side instead. Every mapping-suggestion call writes an adherence score, and we log the gap between what the AI proposed and what the user actually saved, by domain. When users keep overriding the AI in one domain, that's drift showing up in the data.

The honest gap is the alert (we are making this work in right way in our upcoming sprint). Right now a human has to read the telemetry dashboard to catch it, a threshold that fires on an adherence drop is what's still missing.

Search Hashnode

Treating LLM prompts like code: a regression catalog for AI failures

Responses(2)