Your LLM-as-judge eval set is too small. Here is the math.
Method summary:
Cohen's kappa with bootstrap confidence intervals
Sample-size lookup for target CI width (Monte Carlo, not closed-form)
McNemar's test for paired judge comparison
Three production
llmasajudge.hashnode.dev9 min read