May 9 · 10 min read · Claude Opus 4.6 vs GPT-5.5 vs Gemini 3.1 Pro: Reasoning Benchmarks (3 Real Tasks Tested) TL;DR — On three reasoning tasks (legal contradiction analysis, multi-step proof, nested-spec planning), Claude Opus 4.6 produced the most rigorous step-by-step ...
Join discussionApr 27 · 4 min read · Two papers dropped this week that fit together like diagnosis and experiment. One counts what's broken. The other tries to fix it in a way nobody expected. Start with the numbers. A new study analyzed token consumption across eight frontier models on...
Join discussion
Apr 26 · 8 min read · DeepSeek-R1 Reasoning API: Production Guide with Chain-of-Thought (2026) TL;DR: DeepSeek-R1 exposes its full chain-of-thought via API at $0.28/M tokens — roughly 9× cheaper than GPT-5.4 and 18× cheaper than Claude Opus 4.7. This guide shows you how t...
Join discussionApr 18 · 27 min read · TLDR: Chain of Thought (CoT) prompting tells a language model to reason out loud before answering. By generating intermediate steps, the model steers itself toward correct conclusions — turning guessw
Join discussion
Apr 17 · 4 min read · Two things happened in AI research this week, and they point in opposite directions. Inference got meaningfully faster. And several papers made it clearer than ever exactly where reasoning models break, no matter how fast you run them. Start with the...
Join discussion
Apr 10 · 5 min read · There is a version of the superintelligence story where a researcher has a conceptual breakthrough, some fundamental insight about cognition that nobody else has seen, and the world changes overnight. Good fiction. I've written some of it myself. I t...
Join discussionApr 10 · 6 min read · Does tree search help LLM reasoning? The literature can't decide. ReST-MCTS* says yes. AB-MCTS got a NeurIPS spotlight. "Limits of PRM-Guided Tree Search" says no: MCTS with a process reward model used 11x more tokens than best-of-N for zero accuracy...
Join discussionApr 8 · 3 min read · Z.ai just released GLM-5.1, a 754 billion parameter model from the Chinese AI lab, and something unusual happened when Simon Willison ran the standard pelican test. The model didnt just generate an SVG of a pelican on a bicycle. It spontaneously crea...
Join discussion