We scored 500 production prompts across 8 dimensions. Zero passed.

Average score was 13.3 out of 80. That means the prompts people are actually shipping are running models at ~13% of their capability.

The breakdown was worse than I expected:

Examples: 1.01 / 10 (the weakest dimension, by a lot)
Constraints: 1.09 / 10
Role definition: 1.18 / 10
Chain of thought: 1.19 / 10
Context: 1.51 / 10
Output format: 1.90 / 10
Specificity: 2.21 / 10
Clarity: 3.19 / 10 (the strongest, and still failing)

Rewriting the same 500 prompts and scoring again: average 68.5 / 80.

A 425% improvement. Same model, same task, just a better input.

Some things that surprised me:

The weakest dimension wasn't clarity or specificity. It was examples. Nobody shows the model what "good" looks like before asking for it.
Clarity was the highest scoring dimension and still scored 3.19. The floor on "I can understand what you're asking" is lower than people think.
The gap between average and rewritten isn't marginal. It's 5x. The model wasn't the problem. The input was.

The adjacent thread this week about agent failures being context/state problems rather than model problems is right, but one layer up from it: most of what gets blamed on context or state starts with a prompt that was already broken before anyone chained anything.

Curious whether others are measuring input quality systematically or mostly eyeballing it.

What's your team's process?

You're half right and the other half is the gap worth naming. The bottom-three dimensions in the 500-prompt batch were:

Examples: 1.01 / 10 Constraints: 1.09 / 10 Role Definition: 1.18 / 10

Your "parseable structure" concern maps directly to Output Format, which scored 1.90, not the worst but still failing. Your "handles malformed input" concern is adjacent but technically a different frame.

PQS scores the prompt before it hits the model. "Graceful failure on malformed input" is about robustness under adversarial conditions at runtime, which is post-inference territory and not where we score.

Both matter. They're sequential, not competing. Fix the input quality first, then harden for adversarial conditions. Skipping step 1 and going straight to adversarial hardening is how teams end up with bulletproof wrappers around garbage prompts.

The counterintuitive finding for me: Examples at 1.01 was worse than any output-side dimension. Nobody shows the model what "good" looks like before asking for it, and the downstream cost of that omission is larger than any adversarial input handler can compensate for.

What's your typical audit finding on examples? I'd bet it's also the dimension most teams think they don't need.

Thread

We scored 500 production prompts across 8 dimensions. Zero passed.

Responses(2)

Recent in Forum

Search Hashnode

We scored 500 production prompts across 8 dimensions. Zero passed.

Responses(2)

Recent in Forum