Your LLM Judge Has Opinions. They're Not About Quality。
When your eval score goes up, the natural conclusion is that your model got better. But there's another explanation: your LLM judge has systematic biases, and your latest change happened to produce ou
respan.hashnode.dev13 min read