What it is: A short writeup on benchmarking 6 local Ollama models for a delegation pool — and what changed when I ran the same prompt 3 times instead of once. Single-shot ranking put qwen3.5:9b first; variance testing showed it was bimodal (byte-identical buggy output on 2 of 3 runs at temp 0.2) and gemma4:latest was the only byte-stable perfect model.
Why: I was about to wire routing rules into my main agent based on a single-run benchmark. The post is a reminder to myself (and anyone else picking models for delegation) that single-shot LLM benchmarks lie in both directions — they flatter unstable winners and punish stable losers. Plus a gotcha I lost an evening to: Qwen3 thinking models return empty responses on constrained prompts unless you set "think": false.
Link: alwaysbuilding.hashnode.dev/the-variance-test-tha…
Would love to hear your thoughts!
No responses yet.