Average score was 13.3 out of 80. That means the prompts people are actually shipping are running models at ~13% of their capability.
The breakdown was worse than I expected:
Examples: 1.01 / 10 (the weakest dimension, by a lot)
Constraints: 1.09 / 10
Role definition: 1.18 / 10
Chain of thought: 1.19 / 10
Context: 1.51 / 10
Output format: 1.90 / 10
Specificity: 2.21 / 10
Clarity: 3.19 / 10 (the strongest, and still failing)
Rewriting the same 500 prompts and scoring again: average 68.5 / 80.
A 425% improvement. Same model, same task, just a better input.
Some things that surprised me:
The weakest dimension wasn't clarity or specificity. It was examples. Nobody shows the model what "good" looks like before asking for it.
Clarity was the highest scoring dimension and still scored 3.19. The floor on "I can understand what you're asking" is lower than people think.
The gap between average and rewritten isn't marginal. It's 5x. The model wasn't the problem. The input was.
The adjacent thread this week about agent failures being context/state problems rather than model problems is right, but one layer up from it: most of what gets blamed on context or state starts with a prompt that was already broken before anyone chained anything.
Curious whether others are measuring input quality systematically or mostly eyeballing it.
What's your team's process?
No responses yet.