This is a really sharp reframe. The two-phase assertion pipeline is the key insight here — most teams I've seen burn money on LLM-as-judge for failures that a simple regex or JSON schema validation would catch in milliseconds.
One thing I'd add: the semantic drift problem gets worse when you're optimizing prompts for AI search engines (GEO), not just apps. You optimize for structure and format compliance, but the factual density and citation-readiness of the content degrades silently. Same pattern you described, different domain.
The calibration engine idea (Pearson r against human-graded outputs) is underappreciated. Most prompt testing stops at "did it pass/fail" without asking "does our evaluator even agree with humans?" That's the equivalent of shipping code with tests that don't actually assert the right behavior.
Curious about one thing: how do you handle prompt regression across model updates? A prompt that scores 0.95 on GPT-4o today might drop to 0.7 when the provider updates the model silently. Do you version-lock models in your pipeline, or re-run the full test suite on a schedule?