Discussion on "5 Models, 467 Actions, 1 Winner — What We Learned Comparing LLMs on Real Code Generation"

Tebogo Tseka · 2026-03-30T19:04:32.876Z

We tested five AI models on the same task 467 times. Each run produced a complete deployable website — not a code snippet, not a function, not a patch. A real site with HTML, CSS, JavaScript, and assets. The question: can cheaper models match Claude ...

This is exactly the kind of empirical work the AI engineering community needs more of — real evaluation frameworks with statistical rigor.

The action-based pipeline approach is particularly valuable because it mirrors how agents actually fail in production: errors compound across sequential operations. A model that's "good enough" on isolated tasks can become catastrophic when chained.

Your Reasoning Model Trap finding resonates strongly. I've observed similar behavior — models optimized for chain-of-thought reasoning often produce worse code outputs than simpler pattern-matching models. The architecture-task mismatch is real: reasoning isn't generation.

One insight that deserves more attention: "Task type matters more than model choice." The 10-point spread between easiest and hardest action dwarfed model differences. This suggests a practical strategy:

Route structured, well-defined tasks (SEO, accessibility, metadata) to cheaper models
Reserve premium models for spatial/visual tasks (layout, composition)

This isn't just cost optimization — it's reliability optimization. The variance problem you identified with DeepSeek V3.2 (95-point swing) is exactly why consistency premiums exist in production systems.

Question: Did you observe any correlation between token usage and quality? Curious if Sonnet's edge comes from context utilization or actual reasoning superiority.

Search Hashnode

5 Models, 467 Actions, 1 Winner — What We Learned Comparing LLMs on Real Code Generation

Responses(2)