When Confidence Becomes Overconfidence
Calibration Collapse After RLHF; and How to Fix It Without Retraining
Reinforcement Learning from Human Feedback makes language models more helpful and less harmful. It also makes them systematically
okonu.hashnode.dev11 min read