Thread

Suny Choudhary

Building AI Security for LLMs | CEO @ LangProtect

Apr 14

Most AI agent problems I’ve seen aren’t model issues

I keep seeing people blame the model when something breaks.

In most cases, that’s not where the problem is.

From what I’ve seen, things usually fail somewhere else:

agents pulling in too much or wrong context
unclear boundaries around what they can access
workflows growing without anyone really understanding how data is flowing
systems working fine in isolation but breaking when chained together

The model is just one part of it.

The moment you connect it to:
tools
APIs
files
memory
other agents

it becomes a system problem, not a model problem.

That’s also where things get harder to debug.

Curious how others are seeing this.

When your agent setups break, what usually fails first:
context
tool use
state handling
or something else?

#aiagents #llm #cybersecurity #devops #ai

Responses(21)

Strong framing Suny Choudhary!

Every failure mode listed traces back to the same root: nobody's measuring input quality before it hits the model or the next step in the chain.

The fix everyone's describing (structured handoffs, context discipline, validation at boundaries) is right, but it's still being done by vibes. "Is this context good enough?" "Is this handoff clean?" Answered by feel.

We recently scored 500 production prompts on 8 dimensions (grounded in PEEM, RAGAS, MT-Bench, G-Eval, ROUGE). Zero passed. Average 13.3/80, which means models are running at ~13% of capability on the inputs people actually ship. Weakest dimension across the board: Examples (1.01/10).

The cascading failure point that Archit raised is the hardest case. Single-step prompts fail loud.

Chained pipelines fail silent, and each hop multiplies the cost before anyone notices. Measuring the input at every boundary is the only way I've found to catch drift before it compounds.

This stuff is measurable. We just built the measurement layer: PQS — Prompt Quality Score. Happy to share the data if anyone wants to dig in.

🔗 pqs.onchainintel.net

Thanks Olena, great question! The silent-failure-in-chains observation is where this gets expensive for teams in practice. Compounded degradation is brutal because by the time you catch it, you've already paid for three or four bad inference steps downstream.

On domain variance: PQS runs 8 verticals, each with its own scoring weights and framework emphasis. RAG-retrieval prompts get scored heavier on context specificity, retrieval constraints, and source-grounding signals. Creative generation weights style, voice, and constraint relaxation differently. Tool-using agent prompts get scored on tool-call clarity, parameter specification, and failure-mode handling. A well-structured agent prompt optimizes for different things than a well-structured RAG prompt (tool call clarity and failure-mode handling vs retrieval precision and scope control).

The current verticals are: on-chain, crypto content, market research, DeFi, software, general, research, and one more I'm still shaping for agent workflows specifically. The underlying dimensions (8) stay constant for comparability, but the weight matrix shifts per vertical. A user picks vertical before scoring.

The 500-prompt dataset behind the post was software-vertical. I'm running research-vertical and agent-vertical batches next, the scoring spread across verticals is the interesting follow-up analysis and one you instinctively went towards.

Search Hashnode

Most AI agent problems I’ve seen aren’t model issues

Responses(21)

Recent in Forum