Tag feed

#statistics

814 posts215 followers

Explore Hashnode

Alternatives

Trending tags this week

Mmayaanderssonllmasajudge.hashnode.dev4d ago · 20 min read

Your eval pass rate is 98 percent. Your confidence interval is probably wrong.

TL;DR. Almost every eval harness reports a pass rate with an error bar, and almost every one of those error bars comes from the normal approximation: p̂ plus or minus 1.96 times the square root of p̂(

0

Mmayaanderssonllmasajudge.hashnode.dev6d ago · 18 min read

Comparing Two Eval Runs by Their Average Pass Rate Is the Wrong Test

TL;DR. You run version A and version B against the same 500-item eval set. A passes 71.4 percent, B passes 74.0 percent, and you conclude B is better. That reasoning throws away the one fact that matt

0

ABAman Beherabeingamanforever.hashnode.devJul 8 · 6 min read

GSoC 2026 / Week 3: summary(), disp(), and the plot

Hi again !! Two-week silence on my end. I was on bed rest after a medical procedure and away from the keyboard for a while, so I am catching up now with Week 3 and Week 4 landing back to back ahead of

0

Mmayaanderssonllmasajudge.hashnode.devJul 8 · 7 min read

One average eval score was hiding two different failure modes

A mean faithfulness of 0.75 sounds like a model that is usually right and occasionally slips. Mine was near-perfect on half the data and near-zero on the other half, and 0.75 described neither slice.

0

Mmayaanderssonllmasajudge.hashnode.devJul 7 · 8 min read

Your LLM-as-judge has a position bias you are not measuring

If your pairwise judge sees answer A before answer B, it tends to prefer A. If you never swap the order, every win-rate you report is contaminated by which slot you happened to put each answer in. The

0

BSBerkan Sesensesenai.hashnode.devJun 29 · 13 min read

AIC and BIC: Choosing the Right Model Without Overfitting

Imagine you're fitting a curve to noisy data. A straight line misses the shape entirely, so you try a quadratic, then a cubic, then keep going. By degree 10 the curve passes through nearly every point

0

ASAnton Sarokaanton-saroka.hashnode.devJun 20 · 9 min read

CADE — An Interesting Approach to Finding Anomalies in Multidimensional Data

This is a translation of my original article on habr.com. Introduction One way to search for anomalies in a dataset is to use the probability density function corresponding to the data as a measure of

0

ASAnton Sarokaanton-saroka.hashnode.devJun 20 · 7 min read

What Is the Distribution of Sample Quantiles?

This is a translation of my original article on habr.com. Sample means, sample variances, sample quantiles, and other statistics are random variables by nature. Knowing their distributions helps us bu

0

Mmayaanderssonllmasajudge.hashnode.devJun 16 · 3 min read

Stratified sampling for LLM eval sets: why your aggregate pass rate hides the regressions that matter

TL;DR: A headline eval pass rate is an average over every kind of input your system sees, and averages hide the thing you most need to catch: a sharp regression in a small but important slice. If refu

0

AAdarshaasteriskz.hashnode.devJun 11 · 11 min read

Building an Autonomous Monte Carlo Engine to Predict the 2026 World Cup

Today is the start of the 2026 FIFA World Cup, the largest sporting competition every four years. As a fun project, I decided to build a model to predict the tournament. With sports, you never really

0

#statistics

Search Hashnode

#statistics

Explore Hashnode

Trending tags this week

Your eval pass rate is 98 percent. Your confidence interval is probably wrong.

Comparing Two Eval Runs by Their Average Pass Rate Is the Wrong Test

GSoC 2026 / Week 3: summary(), disp(), and the plot

One average eval score was hiding two different failure modes

Your LLM-as-judge has a position bias you are not measuring

AIC and BIC: Choosing the Right Model Without Overfitting

CADE — An Interesting Approach to Finding Anomalies in Multidimensional Data

What Is the Distribution of Sample Quantiles?

Stratified sampling for LLM eval sets: why your aggregate pass rate hides the regressions that matter

Building an Autonomous Monte Carlo Engine to Predict the 2026 World Cup