5d ago · 5 min read · Long-horizon reasoning is where production LLM agents tend to quietly break. A model can produce a plausible-looking chain of thought, accept a wrong intermediate answer, and continue building on that error for every step that follows. By the time th...
Join discussion
5d ago · 5 min read · Speculative decoding became the standard inference speedup technique through 2024 and 2025. The idea: a small draft model generates a sequence of candidate tokens, and a larger target model verifies them in parallel — accepting the longest valid pref...
Join discussion
6d ago · 10 min read · The scaling-is-everything story has a new challenger. On May 6, 2026, Zyphra released ZAYA1-8B — an open-weight Mixture-of-Experts reasoning model with 8.4 billion total parameters and fewer than 800 million active per token. On AIME 2025, a benchmar...
Join discussion
May 5 · 6 min read · At Google Cloud Next 2026 (April 22), Google announced something it had never done before: two different eighth-generation TPU chips with distinct silicon designs for distinct jobs. TPU 8t handles training. TPU 8i handles inference. The split is a ha...
Join discussion
May 5 · 7 min read · For the past three years, the AI industry has operated under a simple assumption: more centralized compute solves everything. Bigger clusters. Bigger data centers. Bigger power contracts. The logic was intuitive and, for training workloads, largely c...
Join discussion
May 2 · 9 min read · On April 17, 2026, The Information reported that OpenAI will pay Cerebras more than $20 billion over the next three years for access to servers powered by Cerebras' wafer-scale chips. The deal could also grant OpenAI a minority equity stake of up to ...
Join discussionApr 30 · 6 min read · The default path for most AI teams today is single-vendor, single-cloud: pick NVIDIA, pick AWS (or GCP, or Azure), and build everything around that stack. It works until it doesn't — hyperscaler credi
Join discussionApr 27 · 4 min read · Two papers dropped this week that fit together like diagnosis and experiment. One counts what's broken. The other tries to fix it in a way nobody expected. Start with the numbers. A new study analyzed token consumption across eight frontier models on...
Join discussion
Apr 28 · 11 min read · Mixture-of-Experts models have dominated the open-weight frontier in 2026. Llama 4 Scout (17B-16E), Llama 4 Maverick (17B-128E), DeepSeek V4-Pro (1.6T-49B active), and Qwen3.6-Plus all use sparse expert routing to scale parameters without proportiona...
Join discussion