#benchmarks articles

GAgpt ai clipsgptaiclips.hashnode.devJun 4 · 3 min read

Claude Mythos Reportedly Beats Opus 4.6 by 13 Points on SWE-bench — and GPT-5.6 Looks Imminent

Two leaks dropped in roughly the same 48-hour window and they form a coherent picture: Anthropic shipping Mythos with unusually large headroom over its own Opus 4.6, and OpenAI staging a near-term GPT

0

JKJangwook Kimeffloow.hashnode.devMay 8 · 10 min read

Agent Test-Time Scaling Has a Ceiling: CMU Research 2026

There is a popular assumption baked into many agentic AI systems: if an agent doesn't succeed on the first attempt, just let it try again. Give it more turns. Sample more trajectories. Add a reflection step. More compute at inference time should mean...

0

VPVarun Pratap Bhardwajqualixar.hashnode.devApr 26 · 7 min read

GPT-5.5 vs Claude vs Gemini: The Avengers Problem Nobody Talks About

Every week someone asks me: "Which AI model should I use?" My answer has been the same since January: yes. Not all of them. Not randomly. But if you're using a single model for everything in April 2026, you're bringing a hammer to a world that needs ...

0

김김이더radar92.hashnode.devApr 24 · 5 min read

GPT-5.5 Is Out — What the Numbers Actually Say

More posts at radarlog.kr. Yesterday (April 23, 2026) OpenAI released GPT-5.5. Codename "Spud." The surprising part isn't the model itself. GPT-5.4 shipped six weeks ago. OpenAI's Chief Scientist Jakub Pachocki said during the briefing that the last...

0

PEPedro Eugeniotheweeklyprompt.newsApr 17 · 3 min read

The One-in-Three Problem

The demos look great. The videos are impressive. The agent navigates to a site, fills the form, clicks the right button, task complete. That is a real thing. It happens. Then a new benchmark drops, measures 153 everyday tasks across 144 live websites...

0

AMAamer Mehaisimehaisi.hashnode.devApr 14 · 4 min read

The ARC-AGI Benchmark: When Narrow AI Meets General Intelligence

The ARC-AGI Benchmark: When Narrow AI Meets General Intelligence OpenAI's o3 model hitting 67% on the ARC-AGI benchmark is being framed as progress toward general intelligence. The framing is wrong. What we're seeing isn't intelligence becoming more ...

0

RSRahul Sehrawatai-zero-to-hero.hashnode.devApr 12 · 11 min read

Evaluation — How Do You Know an LLM Is Any Good?

Here is a question that sounds simple and is not. You've built a thing that uses an LLM. It could be a chatbot, a summarizer, an email drafter, a code assistant, anything. Before you ship it, you want to know if it's good. A more specific version: yo...

0

NBNicolò Boschinicoloboschi.hashnode.devApr 2 · 5 min read

Why 10 million tokens is the only memory benchmark that matters

Originally published at https://nicoloboschi.com/posts/20260402 TL;DR: Memory benchmarks died when context windows hit 1M tokens — just dump everything in the prompt. BEAM tests at 10M where that trick fails. Hindsight scores 64.1% there, 58% ahead ...

0

AWAlan Westalan-west.hashnode.devMar 29 · 6 min read

Qwen 3.5 Small: Four Models, Zero API Cost. A Quick Benchmark.

Alibaba just dropped four models and said "here, they're free." The Qwen 3.5 Small family — 0.8B, 2B, 4B, and 9B parameter models — is fully open source under Apache 2.0. No gated access, no usage restrictions, no phone-home telemetry. Download the w...

0

NBNicolò Boschinicoloboschi.hashnode.devMar 26 · 6 min read

That's not how you do business

Originally published at https://nicoloboschi.com/posts/20260324 TL;DR Supermemory published a "~99% SOTA" memory benchmark result that was actually a stunt - and the backlash was immediate and deserved. Gaming benchmarks erodes trust in an industry ...

0

#benchmarks

#benchmarks

Explore Hashnode

Trending tags this week

Claude Mythos Reportedly Beats Opus 4.6 by 13 Points on SWE-bench — and GPT-5.6 Looks Imminent

Agent Test-Time Scaling Has a Ceiling: CMU Research 2026

GPT-5.5 vs Claude vs Gemini: The Avengers Problem Nobody Talks About

GPT-5.5 Is Out — What the Numbers Actually Say

The One-in-Three Problem

The ARC-AGI Benchmark: When Narrow AI Meets General Intelligence

Evaluation — How Do You Know an LLM Is Any Good?

Why 10 million tokens is the only memory benchmark that matters

Qwen 3.5 Small: Four Models, Zero API Cost. A Quick Benchmark.

That's not how you do business

#benchmarks

Search Hashnode

#benchmarks

Explore Hashnode

Trending tags this week

Claude Mythos Reportedly Beats Opus 4.6 by 13 Points on SWE-bench — and GPT-5.6 Looks Imminent

Agent Test-Time Scaling Has a Ceiling: CMU Research 2026

GPT-5.5 vs Claude vs Gemini: The Avengers Problem Nobody Talks About

GPT-5.5 Is Out — What the Numbers Actually Say

The One-in-Three Problem

The ARC-AGI Benchmark: When Narrow AI Meets General Intelligence

Evaluation — How Do You Know an LLM Is Any Good?

Why 10 million tokens is the only memory benchmark that matters

Qwen 3.5 Small: Four Models, Zero API Cost. A Quick Benchmark.

That's not how you do business