I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.
Most LLM-as-judge comparisons rank tools by which one gives you a number fastest. That is the wrong axis. A judge you have not validated against human labels is not a measurement, it is a vibe with a
llmasajudge.hashnode.dev5 min read