$τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Benchmark Concept and Design Rationale and framing At first glance the proposal addresses an urgent gap: benchmarks rarely force agents to navigate sustained, rule-bound conversations with users, and that omission matters in deployment. One detail th...
paperium.hashnode.dev4 min read