@mayaanderssondev

mayaandersson

@mayaanderssondevPalo Alto CAJoined May 2026

Just a bored curious dev

About

Nothing here yet.

Available for

Nothing here yet.

mayaandersson's blogs

Your LLM-as-judge eval set is too small. Here is the math.llmasajudge.hashnode.dev16 posts

Articles Comments1

Recently published

Mmayaanderssonllmasajudge.hashnode.dev5d ago · 8 min read

An LLM judge is a biased instrument, not a measurement

Last month I shipped an eval that ranked two prompt variants. Variant A won by four points. A teammate reran the same eval the next morning and Variant B won. Same model, same judge, same test set. Th

Mmayaanderssonllmasajudge.hashnode.dev6d ago · 8 min read

Your eval dashboard has 30 metrics. When one "moves," that is usually arithmetic, not a regression.

Here is the ritual. You ship a prompt change, rerun the eval suite, and open the dashboard. Thirty numbers sit there: faithfulness, answer relevance, context precision, toxicity, latency-adjusted qual

Mmayaanderssonllmasajudge.hashnode.devJul 16 · 20 min read

Your eval pass rate is 98 percent. Your confidence interval is probably wrong.

TL;DR. Almost every eval harness reports a pass rate with an error bar, and almost every one of those error bars comes from the normal approximation: p̂ plus or minus 1.96 times the square root of p̂(

Mmayaanderssonllmasajudge.hashnode.devJul 14 · 18 min read

Comparing Two Eval Runs by Their Average Pass Rate Is the Wrong Test

TL;DR. You run version A and version B against the same 500-item eval set. A passes 71.4 percent, B passes 74.0 percent, and you conclude B is better. That reasoning throws away the one fact that matt

Mmayaanderssonllmasajudge.hashnode.devJul 8 · 7 min read

One average eval score was hiding two different failure modes

A mean faithfulness of 0.75 sounds like a model that is usually right and occasionally slips. Mine was near-perfect on half the data and near-zero on the other half, and 0.75 described neither slice.

mayaandersson

About

Available for

mayaandersson's blogs

Recently published

An LLM judge is a biased instrument, not a measurement

Your eval dashboard has 30 metrics. When one "moves," that is usually arithmetic, not a regression.

Your eval pass rate is 98 percent. Your confidence interval is probably wrong.

Comparing Two Eval Runs by Their Average Pass Rate Is the Wrong Test

One average eval score was hiding two different failure modes

Search Hashnode

mayaandersson

About

Available for

mayaandersson's blogs

Recently published

An LLM judge is a biased instrument, not a measurement

Your eval dashboard has 30 metrics. When one "moves," that is usually arithmetic, not a regression.

Your eval pass rate is 98 percent. Your confidence interval is probably wrong.

Comparing Two Eval Runs by Their Average Pass Rate Is the Wrong Test

One average eval score was hiding two different failure modes