Discussion on "Your LLM-as-judge eval set is too small. Here is the math. "

mayaandersson · 2026-05-26T18:01:43.456Z

Method summary: Cohen's kappa with bootstrap confidence intervals Sample-size lookup for target CI width (Monte Carlo, not closed-form) McNemar's test for paired judge comparison Three production

Z

What if "data cleaning" is the most underrated skill in AI right now?

6h ago

S

How are you transitioning from a baseline "Code Generator" to a true Product Engineer?

610F M F A F9h ago

S

How do you adapt your application logging for autonomous debugging agents?

610F M F A F9h ago

S

What if strict typing is the only codebase documentation that matters now?

711Z F M F A9h ago

S

What if deeply nested abstractions are a modern anti-pattern?

610F M F A F9h ago

Discussion

Your LLM-as-judge eval set is too small. Here is the math.

Responses

Recent in Forum

Search Hashnode

Your LLM-as-judge eval set is too small. Here is the math.

Responses

Recent in Forum