DRA-GRPO: Fixing Diversity Collapse in Reasoning Models
Group Relative Policy Optimization (GRPO) became the dominant approach for training reasoning models after DeepSeek-R1 (arXiv:2501.12948) showed it could reach OpenAI o1-level math performance without a separate value model. But GRPO has a quiet flaw...
effloow.hashnode.dev9 min read