Yeah, I mostly agree with your point — once agents start chaining tools, memory, and external APIs, it quickly becomes a systems problem, not a model problem. In my experience, the first things to break are usually state handling and hidden context drift rather than the model itself.
What I find interesting is that the more “powerful” the setup becomes, the harder it is to predict failure modes. At some point it feels less like AI engineering and more like distributed systems debugging.
Curious how others approach this — do you prefer simplifying agent stacks to reduce failure points, or building more layered systems and handling complexity through better observability?