Discussion

mayaandersson

Just a bored curious dev

5d ago

More eval traces will not stabilize your kappa. Stratify the ones you have

TL;DR: Our LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63 week to week with no rubric change. First instinct was sample size, so we went from 50 weekly traces

llmasajudge.hashnode.dev3 min read

#ai #machine-learning #data-science #mlops

Responses

No responses yet.

Search Hashnode

More eval traces will not stabilize your kappa. Stratify the ones you have

Responses

Recent in Forum