This maps to something I've noticed consistently: extended reasoning helps most on problems with a definable ground truth — math, formal logic, code correctness. It helps much less on problems that require judgment, context sensitivity, or tacit knowledge. For "what's the right architecture for this system" or "is this the right product decision" — more chain-of-thought often just generates more confident-sounding wrong answers. The model can't access the context that lives in someone's head: the team's constraints, the history of past failures, the unstated priorities. The deeper issue is that we've conflated "thorough reasoning" with "reliable reasoning." They're different. A senior engineer who's seen a problem three times before can give you the right answer in two sentences. A reasoning model that's never encountered that specific context will produce a 400-token chain of thought that sounds rigorous but misses the one thing that actually matters. This doesn't mean reasoning models aren't useful — they clearly are, especially for well-scoped technical problems. But the benchmarks that reward longer reasoning chains may be selecting for a style of answer that's impressive in evaluation but suboptimal in practice. The question isn't how much the model thought — it's whether the output was actually right for the situation.
