Discussion on "I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read."

mayaandersson · 2026-06-25T17:54:43.128Z

Most LLM-as-judge comparisons rank tools by which one gives you a number fastest. That is the wrong axis. A judge you have not validated against human labels is not a measurement, it is a vibe with a

Discussion on "I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read." | Hashnode

Search Hashnode

I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.

Responses