LWLewis Wonincliolabs.hashnode.dev·May 17 · 19 min readDebugging Multi-node GPU trainingTable of Contents 1. The Problem 2. Physical Hardware 3. Intra-Node Communication: NVLink 4. Inter-Node Communication: InfiniBand and RDMA 5. PCIe Topology: Why GPU-NIC Affinity Matters 6. The S00
LWLewis Wonincliolabs.hashnode.dev·Apr 26 · 9 min readFrom Loss=36 to Convergence: Integrating Whisper+Gemma2 into Megatron's TransformerEngineFrom Loss=36 to Convergence: Integrating Whisper+Gemma2 into Megatron's TransformerEngine When we started debugging our AudioLLM on the Megatron trainer, our loss started at 36. This did not make sens00
LWLewis Wonincliolabs.hashnode.dev·Mar 27 · 12 min readThe MDS Shim — Zero-Conversion Data Loading for 800+ DatasetsWe have about 800 datasets in Mosaic MDS format, with tens of millions of multimodal samples — each one an audio clip, an instruction, and a target response — spread across thousands of compressed sha00
LWLewis Wonincliolabs.hashnode.dev·Mar 20 · 11 min readWhy We Moved an AudioLLM to MegatronWe trained our 10B-parameter AudioLLM — a Whisper speech encoder fused with a Gemma2 9B text decoder — using Megatron with Mosaic Streaming to handle training data. The wall The architecture is a Whis00