© 2026 Hashnode
Group Relative Policy Optimization (GRPO) became the dominant approach for training reasoning models after DeepSeek-R1 (arXiv:2501.12948) showed it could reach OpenAI o1-level math performance without a separate value model. But GRPO has a quiet flaw...

Running LLMs on-device means fighting two constraints simultaneously: memory and latency. The KV-cache — the buffer that stores past token representations so the model does not recompute them — is often the bottleneck on both fronts. A paper publishe...
