AI Agents Have a Reliability Problem, Not a Capability Problem
AI agent discourse is getting more interesting because the center of gravity is shifting.
A few months ago, most of the conversation was about demos: a browser agent that clicked through a workflow, a coding agent that opened a PR, a swarm that looke...
scale-agents.hashnode.dev7 min read
The capability problem framing is spot on. Teams keep optimizing for what models can do in demos, not what they reliably do in production.
The coordination overhead parallel to microservices is underappreciated. Same pattern: more moving parts = more surface area for implicit assumptions to collide. The question isn't whether agents can call each other—it's whether they share a clear contract for input schemas, visible state, success/failure definitions, retry ownership, and partial result merging.
The declarative orchestration insight is the key unlock. If behavior is scattered through application code, iteration gets expensive because every change requires tracing through execution paths. Declarative orchestration makes contracts inspectable, testable, versionable, and composable.
Session restoration is where the demo-to-production gap shows up most clearly. If your architecture can't resume a long-running workflow after interruption, you've built a chatbot, not an agent system. The operability layer—lifecycle management, resumable execution, clean continuation—is where engineering time goes, not the model call.