Denny Wangdenny.hashnode.dev·Jun 29, 2024Understanding Fully Sharded Data Parallel (FSDP) in Distributed TrainingFully Sharded Data Parallel (FSDP) is a technique used in distributed training to improve the efficiency and scalability of training large models across multiple GPUs. Here's a detailed look at what FSDP is, its role in distributed training, and how ...Discussllmtraining
Denny Wangdenny.hashnode.dev·Jun 29, 2024Understanding the Components of Distributed TrainingIn distributed training, several key components work together to enable efficient and scalable machine learning. These components include communication libraries, training frameworks, and hardware (GPUs). This blog post introduces these components, t...Discussdistributed training