DRA-GRPO: Fixing Diversity Collapse in Reasoning Models
May 10 · 9 min read · Group Relative Policy Optimization (GRPO) became the dominant approach for training reasoning models after DeepSeek-R1 (arXiv:2501.12948) showed it could reach OpenAI o1-level math performance without a separate value model. But GRPO has a quiet flaw...
Join discussion














