We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of "trend"
1d ago · 2 min read · We track weekly agreement between an LLM judge and human labels (Cohen's kappa) on a sample of production traces. For three weeks the point estimates told a story: 0.55, then 0.49, then 0.44. The team
Join discussion

