Beyond RLHF: Aligning LLMs with Direct Preference Optimization (DPO)
Introduction
Training a Large Language Model (LLM) used to require two steps; first, predict the next word; next, rank its answers to fine-tune the behaviors.
This second part, known as Reinforcement
kuriko-iwai.com13 min read