Mar 9 · 10 min read · TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
Join discussionMar 9 · 13 min read · TLDR: A raw LLM is a super-smart parrot that read the entire internet — including its worst parts. RLHF (Reinforcement Learning from Human Feedback) is the training pipeline that transforms it from a pattern-matching engine into an assistant that is ...
Join discussionFeb 16 · 2 min read · I first heard of the contextual bandit algorithm a couple of years back as an undergrad. I never gave much thought to it. Recently, I started working on reinforcement learning for the thrill of picking up my HRI research again and shaking off the dus...
Join discussion
Feb 13 · 4 min read · Let us get into it in a easy way okay, imagine the following case: What if machines didn’t learn from instructions or labels, but learned the same way humans do — by making decisions, facing consequences, and slowly improving over time? That idea i...
Join discussionFeb 6 · 8 min read · Summary (for those who need to get back to scrolling)This post continues an ongoing series documenting my attempt to train a chess engine from scratch. Here, I focus on why supervised pre-training of value-based RL agents (DDQN / Dueling DDQN) led to...
Join discussion
Feb 4 · 8 min read · Imagine a world where humanoid robots seamlessly integrate into our daily lives, performing complex tasks with intelligence and adaptability. This isn't science fiction anymore; it's the rapidly approaching reality of 2025. At the heart of this revol...
Join discussion
Jan 24 · 6 min read · Ever since I watched the Google DeepMind documentary, one thing has kept playing in my head: AlphaGo, reinforcement learning, and how that concept could be brought into football. That question stuck with me. So I did a bit of digging to see how AI is...
Join discussion