I like this approach. AI agent output can look good in one example, but fail in the full workflow. Treating prompt and agent changes like code changes makes sense to me. You need baseline tests, edge cases, and regression checks before trusting it in production.