28 Real Tasks Reveal What AI Leaderboards Miss
Feb 25 · 11 min read · Originally published on MakerPulse.
4.61 versus 4.55.
That's the gap between the top two models in our first AgentPulse benchmark run: GPT-5.2 and Gemini 3.1 Pro, separated by six hundredths of a poi