RL in the Pre-train Space: Why Training on P(y) Beats Training on P(y|x)
5d ago · 7 min read · RLVR (Reinforcement Learning with Verifiable Rewards) has been the go-to recipe for boosting LLM reasoning since DeepSeek-R1 made it mainstream. The formula is simple: give the model math problems, check the answers, reward correct reasoning chains. ...
Join discussion