I built a tiny Ollama benchmark, then a variance test flipped the ranking)

What it is: A short writeup on benchmarking 6 local Ollama models for a delegation pool — and what changed when I ran the same prompt 3 times instead of once. Single-shot ranking put qwen3.5:9b first; variance testing showed it was bimodal (byte-identical buggy output on 2 of 3 runs at temp 0.2) and gemma4:latest was the only byte-stable perfect model.

Why: I was about to wire routing rules into my main agent based on a single-run benchmark. The post is a reminder to myself (and anyone else picking models for delegation) that single-shot LLM benchmarks lie in both directions — they flatter unstable winners and punish stable losers. Plus a gotcha I lost an evening to: Qwen3 thinking models return empty responses on constrained prompts unless you set "think": false.

Link: alwaysbuilding.hashnode.dev/the-variance-test-tha…

Would love to hear your thoughts!

Thread

I built a tiny Ollama benchmark, then a variance test flipped the ranking)

Responses

Recent in Forum

Search Hashnode

I built a tiny Ollama benchmark, then a variance test flipped the ranking)

Responses

Recent in Forum