This is the conversation that actually needs to happen. Everyone optimizes the 20% (model + prompt) and ignores the 80% (harness) — and then wonders why their agent works in demos but fails in production.
The harness is where all the real decisions live: how context is managed across turns, when to stop and ask for human input, how errors surface, what gets retried vs. escalated. These aren't model problems, they're systems design problems. And most teams aren't treating them that way.
What I've seen break repeatedly in enterprise deployments:
The harness is also where you encode domain judgment — what the agent is allowed to do autonomously vs. what requires a human decision. Get that boundary wrong and you either build an agent that's too timid to be useful or one that takes consequential actions no one intended to delegate.
Solid framing. This deserves more attention than another "which model wins the benchmark" post.