Your LLM Judge Has Opinions. They're Not About Quality。
Apr 28 · 13 min read · When your eval score goes up, the natural conclusion is that your model got better. But there's another explanation: your LLM judge has systematic biases, and your latest change happened to produce ou