DeepSeek GRPO Explanation (Why we need it? How does it work? What are the findings?)
GRPO: Efficient RLHF via Relative Policy Optimization (Firstly introduced by DeepSeekMath, reference)
Why GRPO?
Problem with PPO: Slow, memory-intensive, and prone to reward overfitting in large-scale RLHF.
GRPO’s Advantage: A compute-efficient ...
huanganni.hashnode.dev2 min read