RLHF in Practice: From Human Preferences to Better LLM Policies
6d ago · 10 min read · TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
Join discussion