This is exactly the kind of empirical work the AI engineering community needs more of — real evaluation frameworks with statistical rigor.
The action-based pipeline approach is particularly valuable because it mirrors how agents actually fail in production: errors compound across sequential operations. A model that's "good enough" on isolated tasks can become catastrophic when chained.
Your Reasoning Model Trap finding resonates strongly. I've observed similar behavior — models optimized for chain-of-thought reasoning often produce worse code outputs than simpler pattern-matching models. The architecture-task mismatch is real: reasoning isn't generation.
One insight that deserves more attention: "Task type matters more than model choice." The 10-point spread between easiest and hardest action dwarfed model differences. This suggests a practical strategy:
This isn't just cost optimization — it's reliability optimization. The variance problem you identified with DeepSeek V3.2 (95-point swing) is exactly why consistency premiums exist in production systems.
Question: Did you observe any correlation between token usage and quality? Curious if Sonnet's edge comes from context utilization or actual reasoning superiority.