Calibration set size for LLM-as-judge: when 50 traces is enough and when 200 is mandatory
TL;DR. The human-labeled calibration set you use to validate an LLM-as-judge does not need a fixed size. It needs a size that depends on how balanced your labels are. For roughly balanced binary crite
llmasajudge.hashnode.dev11 min read