JHJongseok Hanindont-like-ai.hashnode.dev00How Variance Breaks Deep LearningMar 18 · 7 min read · If you've ever trained a neural network from scratch, you know the dread of a sudden NaN loss. A smooth training loop suddenly explodes to infinity, and the entire learning process collapses. What triJoin discussion
JHJongseok Hanindont-like-ai.hashnode.dev00V3: Fine-Grained Mixture of Experts (MoE)Mar 13 · 8 min read · Today, we are shifting our focus to the engine room. How does DeepSeek scale up to hundreds of billions of parameters without requiring an unthinkable amount of compute to run? The answer is its highlJoin discussion
JHJongseok Hanindont-like-ai.hashnode.dev00Rotary Positional Embeddings (RoPE)Mar 12 · 5 min read · Today, we are tackling a different problem: How does a LLM know where words are? Transformers, by design, are permutation invariant. Without explicit help, they view a beautifully structured sentence Join discussion
JHJongseok Hanindont-like-ai.hashnode.dev00V2: Multi-Head Latent Attention (MLA)Mar 11 · 5 min read · While standard attention mechanisms have served us well, if we want to tackle the major bottlenecks in scaling large language models, we have to look closely at the KV cache. The conceptual explanatioJoin discussion