E
Treating prompts like code is the right framing. We added a CI hook with Promptfoo that blocks any merge where regression-test scores drop more than 5%. The hardest part wasn't writing the eval set, it was getting the team to maintain it as prompts evolved. Curious if your catalog covers the silent-degradation case where prompts pass eval but drift in real-world distribution.
