May 11 · 16 min read · Tic-tac-toe is a solved game. Any competent adult can force a draw every time. But can an agent figure that out with zero human knowledge? Give two agents a blank board, a few simple rules about wins
Join discussion
May 10 · 9 min read · Group Relative Policy Optimization (GRPO) became the dominant approach for training reasoning models after DeepSeek-R1 (arXiv:2501.12948) showed it could reach OpenAI o1-level math performance without a separate value model. But GRPO has a quiet flaw...
Join discussion
May 1 · 3 min read · Introduction The recent advancements in GPU technology have set new standards for performance, especially in the realm of artificial intelligence (AI). Nvidia's H200 GPU, boasting an impressive 282GB of VRAM, is a game-changer for developers focusing...
Join discussionApr 26 · 3 min read · Understanding Contextual Bandits Contextual bandits are a sophisticated reinforcement learning approach that combines the exploration-exploitation dilemma, allowing algorithms to make decisions based on contextual information. They have broad applica...
Join discussion