Thanks for reading and for your reply. However, you are overlooking key findings in the research that clearly indicate a lack of actual ‘reasoning’ or ‘thinking’. It’s important to recognize that your framing reflects anthropomorphism, a cognitive bias that attributes human traits to AI.
While reinforcement learning introduces a different signal compared to next-token prediction during pre-training, the underlying backpropagation mechanism used in post-training remains unchanged. Adjusting weights through this process does not amount to cognitive reasoning. The transformer’s behavior during inference also remains the same. Post-training is more accurately described as ‘stochastic funneling’ or ‘manifold sculpting’, modeling existing distributions to favor the examples seen during fine-tuning, based on the data and tasks involved.
Reinforcement Learning (RL), especially RLHF (Reinforcement Learning from Human Feedback), was once seen as a promising path to improve large language model reasoning. It helped align models with human preferences and refine outputs. However, its limitations became clear — RL tends to optimize for appearing right rather than actually thinking right. As a result, focus has shifted toward alternative methods like supervised fine-tuning, tool use, and architectural innovations that promote true reasoning over reward hacking.