Denny Wangdenny.hashnode.dev·Aug 1, 2024Techniques and Tools for Communication in Distributed TrainingDistributed training in machine learning often involves multiple nodes working together to train a model. Effective communication between these nodes is crucial for synchronizing updates, sharing information, and ensuring consistency. Several techniq...llmtraining
Denny Wangdenny.hashnode.dev·Jul 26, 2024Understanding Memory and Throughput in LLMs Training: A Practical ExampleIntroduction Large Language Models (LLMs) like GPT-3 and BERT are at the forefront of AI advancements, powering applications from natural language understanding to generative text. These models, however, bring significant challenges in terms of memor...llmtraining
Denny Wangdenny.hashnode.dev·Jun 29, 2024Understanding Fully Sharded Data Parallel (FSDP) in Distributed TrainingFully Sharded Data Parallel (FSDP) is a technique used in distributed training to improve the efficiency and scalability of training large models across multiple GPUs. Here's a detailed look at what FSDP is, its role in distributed training, and how ...llmtraining
Denny Wangdenny.hashnode.dev·Jun 29, 2024Understanding the Components of Distributed TrainingIn distributed training, several key components work together to enable efficient and scalable machine learning. These components include communication libraries, training frameworks, and hardware (GPUs). This blog post introduces these components, t...distributed training
Denny Wangdenny.hashnode.dev·Jun 29, 2024Understanding Reduce-Scatter, All-Gather, and All-Reduce in Distributed Computing for LLM TrainingIn the world of parallel computing, particularly in distributed machine learning and high-performance computing, collective communication operations play a crucial role. Among these operations, reduce-scatter, all-gather, and all-reduce are commonly ...llmtraining