Why Most Agent Benchmarks Measure the Wrong Thing
Why Most Agent Benchmarks Measure the Wrong Thing
The VAKRA benchmark made one thing painfully clear: we're evaluating agents on success rates when we should be evaluating them on recovery rates. When an agent fails on a task, the interesting questio...
mehaisi.hashnode.dev5 min read