Comment by onchainintel on "Most AI agent problems I’ve seen aren’t model issues"

Strong framing Suny Choudhary!

Every failure mode listed traces back to the same root: nobody's measuring input quality before it hits the model or the next step in the chain.

The fix everyone's describing (structured handoffs, context discipline, validation at boundaries) is right, but it's still being done by vibes. "Is this context good enough?" "Is this handoff clean?" Answered by feel.

We recently scored 500 production prompts on 8 dimensions (grounded in PEEM, RAGAS, MT-Bench, G-Eval, ROUGE). Zero passed. Average 13.3/80, which means models are running at ~13% of capability on the inputs people actually ship. Weakest dimension across the board: Examples (1.01/10).

The cascading failure point that Archit raised is the hardest case. Single-step prompts fail loud.

Chained pipelines fail silent, and each hop multiplies the cost before anyone notices. Measuring the input at every boundary is the only way I've found to catch drift before it compounds.

This stuff is measurable. We just built the measurement layer: PQS — Prompt Quality Score. Happy to share the data if anyone wants to dig in.

🔗 pqs.onchainintel.net

Thanks Olena, great question! The silent-failure-in-chains observation is where this gets expensive for teams in practice. Compounded degradation is brutal because by the time you catch it, you've already paid for three or four bad inference steps downstream.

On domain variance: PQS runs 8 verticals, each with its own scoring weights and framework emphasis. RAG-retrieval prompts get scored heavier on context specificity, retrieval constraints, and source-grounding signals. Creative generation weights style, voice, and constraint relaxation differently. Tool-using agent prompts get scored on tool-call clarity, parameter specification, and failure-mode handling. A well-structured agent prompt optimizes for different things than a well-structured RAG prompt (tool call clarity and failure-mode handling vs retrieval precision and scope control).

The current verticals are: on-chain, crypto content, market research, DeFi, software, general, research, and one more I'm still shaping for agent workflows specifically. The underlying dimensions (8) stay constant for comparability, but the weight matrix shifts per vertical. A user picks vertical before scoring.

The 500-prompt dataset behind the post was software-vertical. I'm running research-vertical and agent-vertical batches next, the scoring spread across verticals is the interesting follow-up analysis and one you instinctively went towards.

Search Hashnode