Reinforcement Learning (RL), especially RLHF (Reinforcement Learning from Human Feedback), was once seen as a promising path to improve large language model reasoning. It helped align models with human preferences and refine outputs. However, its limitations became clear — RL tends to optimize for appearing right rather than actually thinking right. As a result, focus has shifted toward alternative methods like supervised fine-tuning, tool use, and architectural innovations that promote true reasoning over reward hacking.