The Evaluation Bottleneck: Building a "Golden Dataset" Without Losing Your Mind
If I see one more "vibe check" evaluation in a pull request, I’m going to scream.
You know the drill. You tweak the prompt, you run a few queries in the playground, it "feels" better, and you merge. Two days later, a user asks a question about a spec...
ivandimov.dev4 min read