I keep seeing people blame the model when something breaks.
In most cases, that’s not where the problem is.
From what I’ve seen, things usually fail somewhere else:
agents pulling in too much or wrong context
unclear boundaries around what they can access
workflows growing without anyone really understanding how data is flowing
systems working fine in isolation but breaking when chained together
The model is just one part of it.
The moment you connect it to:
tools
APIs
files
memory
other agents
it becomes a system problem, not a model problem.
That’s also where things get harder to debug.
Curious how others are seeing this.
When your agent setups break, what usually fails first:
context
tool use
state handling
or something else?
Agree. This is very close to what I’ve seen while building Origin. Once you connect AI to tools, files, and workspace state, it becomes much more of a system design problem than a model problem.
Usually the first failures I notice are bad context and broken state handling, not the model itself. That’s also why traceability matters so much, if you can’t see what changed and what caused it, debugging turns into guesswork.
100% agree — this matches what I see building automation systems for clients daily. The model is usually the most reliable part of the stack. What breaks first? State handling and context management, every time. Specifically: agents losing track of what they've already done in multi-step workflows, and context windows getting polluted with irrelevant tool outputs. The fix that's worked best for me is treating each agent step as a discrete function with explicit inputs/outputs rather than letting agents freestyle through a chain. Structured handoffs between steps, with validation at each boundary, catch most failures before they cascade. The "librarian problem" framing above is spot on.
+1 to this :) Feels like we’ve moved from “prompt engineering” to “system engineering”. Most issues in my opinion come from context drift or state mismatches, not the model. I’ve been playing with setups where the agent is more tightly connected to the workspace (instead of just chat), and the difference is pretty noticeable. Even small things like file awareness and history make a big impact. A lot of products are working on this problem-solving so there is a choice there :D
Completely agree, most failures I’ve seen come from poor context management and unclear data flow, not the model itself. State handling also becomes a major issue when workflows scale, especially with multiple tools and agents interacting. In my experience, debugging improves a lot once you treat it as a system design problem rather than just an AI model issue.
The frontend angle doesn't get enough attention here. Most agent breakage I've seen isn't the model, its an undefined response contract. Agent returns something unexpected, component doesn't know what to render, silent failure two screens later.
Teams are building agent UI like a form submission flow. Predict the response, render it. But agent outputs are probabilistic. Your component boundaries matter more than which model you picked.
You're right—many AI agent problems stem from improper data, lack of domain knowledge, or inadequate integration rather than the model itself. Issues like poor training data, insufficient fine-tuning, or misaligned objectives often lead to suboptimal results. Addressing these foundational elements usually resolves most challenges with AI agents.
Spot on, Suny. We’ve spent so long obsessing over model parameters that we’ve neglected the deterministic plumbing required to make them safe. When agent setups break, what fails first for me is almost always Context Integrity. We treat context like a bucket we throw data into, rather than a structured ledger. As a Technical Architect, I’m seeing that the "System Problem" is actually a Librarian Problem: Context Bloat: Agents fail because they lack a "Gatekeeper" to validate incoming triples against a fixed schema. Boundary Erosion: We grant APIs access based on "vibes" rather than machine-readable authority (like SHACL validation).
Strong framing Suny Choudhary!
Every failure mode listed traces back to the same root: nobody's measuring input quality before it hits the model or the next step in the chain.
The fix everyone's describing (structured handoffs, context discipline, validation at boundaries) is right, but it's still being done by vibes. "Is this context good enough?" "Is this handoff clean?" Answered by feel.
We recently scored 500 production prompts on 8 dimensions (grounded in PEEM, RAGAS, MT-Bench, G-Eval, ROUGE). Zero passed. Average 13.3/80, which means models are running at ~13% of capability on the inputs people actually ship. Weakest dimension across the board: Examples (1.01/10).
The cascading failure point that Archit raised is the hardest case. Single-step prompts fail loud.
Chained pipelines fail silent, and each hop multiplies the cost before anyone notices. Measuring the input at every boundary is the only way I've found to catch drift before it compounds.
This stuff is measurable. We just built the measurement layer: PQS — Prompt Quality Score. Happy to share the data if anyone wants to dig in.
🔗 pqs.onchainintel.net