More eval traces will not stabilize your kappa. Stratify the ones you have
TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63 week to week with no rubric change. First instinct was sample size, so we went from 50 weekly traces
llmasajudge.hashnode.dev3 min read