The "two agents arguing" pattern is underrated and I've seen it consistently produce better outputs than any single-model loop. The key insight you're pointing at — that models fail in different directions — is what makes it work. It's not redundancy, it's genuine epistemic diversity.
The part that resonates most: "Accepting every suggestion is just a different way of not thinking." This is the trap most people fall into when they start with AI-assisted development. They defer to the model because pushing back feels slower, but that deference is exactly where quality degrades silently.
What I'd add from experience: the human judgment layer isn't just about catching errors — it's about catching correct but wrong outputs. The code passes the spec, tests are green, but the approach is subtly off for the actual context. That's where domain knowledge does things no model can replicate yet. The spec can't fully encode tacit knowledge.
The role shift from "developer who writes code" to "developer who owns outcomes" is real. The accountability piece is the unchanged part — if it breaks at 3am, no model takes the call.