Stratified sampling for LLM eval sets: why your aggregate pass rate hides the regressions that matter
TL;DR: A headline eval pass rate is an average over every kind of input your system sees, and averages hide the thing you most need to catch: a sharp regression in a small but important slice. If refu
llmasajudge.hashnode.dev3 min read