Distributed Deep Learning: A Brief Introduction

Slide 1

Slide 1 text

Distributed Deep Learning: A Brief Introduction Kuncahyo Setyo Nugroho NVIDIA AI R&D Center - BINUS University

Slide 2

Slide 2 text

About Me Ir. Kuncahyo Setyo Nugroho, S.Kom., M.Kom., IPP. Role & Organization: Location: AI Researcher NVIDIA AI Research & Development Center Bioinformatics & Data Science Research Center Bina Nusantara (BINUS) University @Anggrek Campus University Ambassador & Certified Instructor NVIDIA Deep Learning Institute – Platinum Tier Instructor Jakarta 11530, Indonesia ksnugroho26{at}gmail.com | kuncahyo.nugroho{at}binus.edu ksnugroho.id

Slide 3

Slide 3 text

Outline ▪ Background & Motivation ▪ Basic Parallelization Strategies Overview ▪ Challenges in Distributed Deep Learning

Slide 4

Slide 4 text

Background & Motivation

Slide 5

Slide 5 text

Prepared and presented by: Kuncahyo Setyo Nugroho

Slide 6

Slide 6 text

Prepared and presented by: Kuncahyo Setyo Nugroho Exploding Datasets Power-law relationship between dataset size and accuracy. J. Hestness et al., “Deep Learning Scaling is Predictable, Empirically,” 2017. https://arxiv.org/abs/1712.00409

Slide 7

Slide 7 text

Prepared and presented by: Kuncahyo Setyo Nugroho Exploding Model Complexity Better model always comes with higher computational cost. S. Zhu et al., “Intelligent Computing: The Latest Advances, Challenges, and Future,” Intelligent Computing, vol. 2, Jan. 2023, doi: https://doi.org/10.34133/icomputing.0006.

Slide 8

Slide 8 text

Prepared and presented by: Kuncahyo Setyo Nugroho Larger Models & Datasets Take Longer Time to Train! Illustration was generated using ChatGPT-4o

Slide 9

Slide 9 text

Prepared and presented by: Kuncahyo Setyo Nugroho Implications: Good & Bad News Good News: Requirements are predictable Bad News: The numbers can be massive We can estimate the amount of data needed to improve performance. Training large models often requires thousands of GPU-hours We can predict the compute resources required as models scale. The cost of compute and data curation can be prohibitive. Scaling laws give us a roadmap — we know what to expect. Deep learning has turned impossible problems into merely expensive ones.

Slide 10

Slide 10 text

Prepared and presented by: Kuncahyo Setyo Nugroho Distributed Training is Necessary! ▪ Developers / researchers’ time are more valuable than hardware. ▪ If a training take 10 GPU days ▪ Parallelize with distributed training. ▪ 1024 GPUs can finish in 14 minutes (ideally)! ▪ Deep learning is experimental; we need to train quickly to iterate. The develop and research cycle will be greatly boosted. ▪ Short iteration time is fundamental for success! Idea Code Experiment

Slide 11

Slide 11 text

Prepared and presented by: Kuncahyo Setyo Nugroho Large-scale Distributed Training Example Source: https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit SUMMIT Super-Computer: ▪ CPU: 2 x 16 Core IBM POWER9 (connected via dual NVLINK bricks, 25GB/s each side) ▪ GPU: 6 x NVIDIA Tesla V100 ▪ RAM: 512 GB DDR4 memory ▪ Storage: 250 PB

Slide 12

Slide 12 text

Prepared and presented by: Kuncahyo Setyo Nugroho Large-scale Distributed Training Example (Contd.) J. Lin, C. Gan, and S. Han, “TSM: Temporal Shift Module for Efficient Video Understanding,” 2018. https://arxiv.org/abs/1811.08383 Models Training Time Accuracy 1 Nodes (6 GPUs) 49h 50min 74.1 % 12 Nodes (768 GPUs) 28min 74.1 % 256 Nodes (1536 GPUs) 14min (211x) 74.0 % Train a Video Model (TSM) on 660 Hours of Videos. Speedup the training by 200x, from 2 days to 14 minutes. ▪ Dataset: Kinetics (240k training videos) × 100 epoch ▪ Model setup: 8-frame ResNet-50 TSM for video recognition

Slide 13

Slide 13 text

Basic Parallelization Strategies Overview

Slide 14

Slide 14 text

Prepared and presented by: Kuncahyo Setyo Nugroho 1 - Data Parallelism Training Dataset ... Do: split the data! Worker - 1 Worker - 2 Worker - 𝒏 Data Subset #1 Data Subset #2 Data Subset #𝒏 Full Data The dataset is split across workers, each processing a different subset.

Slide 15

Slide 15 text

Prepared and presented by: Kuncahyo Setyo Nugroho 1 - Data Parallelism (Contd.) Deep Learning Model Worker - 1 ... Worker - 2 Worker - 𝒏 Same model across workers! Data Subset #1 Data Subset #2 Data Subset #𝒏 Each worker runs the same model replica, but sees a different data shard. Gradients are synced after. Full Data Model copy on 1st worker Model copy on 2nd worker Model copy on 𝑛rd worker

Slide 16

Slide 16 text

Prepared and presented by: Kuncahyo Setyo Nugroho 2 - Model Parallelism Training Dataset Full Data Worker - 1 Full Data Worker - 2 Single copy of data! ... Worker - 𝒏 The model is split across workers, each processing part of the network on the full data.

Slide 17

Slide 17 text

Prepared and presented by: Kuncahyo Setyo Nugroho 2 - Model Parallelism (Contd.) Full Data Worker - 1 Full Data Worker - 2 ... Worker - 𝒏 Do: Split the model! Deep Learning Model Layer on 1st worker Layer on 2nd worker Layer on 3rd worker Each worker holds a different part of the model. Activations are passed between them.

Slide 18

Slide 18 text

Prepared and presented by: Kuncahyo Setyo Nugroho 3 - Tensor Parallelism Full Data Worker - 1 Worker - 𝒏 Deep Learning Model Neurons on 1st worker Neurons on 𝑛rd worker Training Dataset Full Data Single copy of data! The model is too large for a single device, so we split each layer across GPUs. Each GPU processes part of a tensor operation, like a slice of neurons or matrix rows.

Slide 19

Slide 19 text

Prepared and presented by: Kuncahyo Setyo Nugroho Strategy Comparison Data Parallelism Model Parallelism Tensor Parallelism Basic idea Dataset is split, same model copied on each device. Model is split, each device runs a different layer or block. Tensors (weights/activations) are split across devices. What is split Input data (batch). Model layers. Tensor computations inside a layer. When to use When model fits in one GPU but data is large. When model is too big for a single GPU. When even one layer is too large for one device. Scalability Good for Increasing batch size and throughput. Good for Scaling very deep or wide models. Good for Scaling extremely large models (e.g., LLMs). Implementation Relatively simple. More complex. More complex.

Slide 20

Slide 20 text

Prepared and presented by: Kuncahyo Setyo Nugroho Distributed Training Frameworks Efficient for LLMs and memory optimization. Built-in native distributed training. Simplifies distributed training setup.

Slide 21

Slide 21 text

Hands-on: Data Parallelism using PyTorch DDP

Slide 22

Slide 22 text

Challenges in Distributed Deep Learning

Slide 23

Slide 23 text

Prepared and presented by: Kuncahyo Setyo Nugroho Bottlenecks of Distributed Training Large datasets, large model, more computation, more memory! Solution: distributed training Communication Communication is essential!

Slide 24

Slide 24 text

Prepared and presented by: Kuncahyo Setyo Nugroho Communication Overhead Key Impacts of Communication Bottleneck: ▪ Synchronization latency limits speedup. ▪ Gradients are synced more often as nodes increase. ▪ Communication can dominate computation time. ▪ Bandwidth limits training efficiency at large scale. S. Han and W. J. Dally, “Bandwidth-efficient deep learning,” vol. 14, pp. 1–6, Jun. 2018, doi: https://doi.org/10.1145/3195970.3199847.

Slide 25

Slide 25 text

Prepared and presented by: Kuncahyo Setyo Nugroho High Network Latency Slows Training L. Zhu, H. Lin, Y. Lu, Y. Lin, and S. Han, “Delayed Gradient Averaging: Tolerate the Communication Latency for Federated Learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 29995–30007, Dec. 2021 (Link)

Slide 26

Slide 26 text

Prepared and presented by: Kuncahyo Setyo Nugroho Large Batch Challenge Keskar, Nitish Shirish, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tak, “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima,” arXiv.org, 2016. https://arxiv.org/abs/1609.04836. Global Batch Size: 𝐵𝑔𝑙𝑜𝑏𝑎𝑙 = 𝐵𝑙𝑜𝑐𝑎𝑙 × 𝑁 Larger clusters → Larger global batch size. Two Key Challenges: ▪ Lower generalization at very large batch size (e.g., >8000). Models tend to converge to sharp minima → lower test performance. ▪ Scalability drops with more workers, especially with high communication overhead.

Slide 27

Slide 27 text

Prepared and presented by: Kuncahyo Setyo Nugroho SGD Stops to Scale Large Batch Size Scaling Limitations: ▪ At a certain batch size, SGD stops scaling effectively. ▪ Increasing batch size beyond a point no longer reduces training steps. ▪ Diminishing returns are observed when batch size increases too much. C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl, “Measuring the Effects of Data Parallelism on Neural Network Training,” arXiv.org, 2018. https://arxiv.org/abs/1811.03600.

Slide 28

Slide 28 text

Prepared and presented by: Kuncahyo Setyo Nugroho Summary of Key Challenges Level Main Issues Effects System (Infrastructure) Level Communication overhead, bandwidth bottlenecks. ▪ Slower training due to gradient sync. ▪ Network latency limits scalability. Algorithmic- Level Poor generalization, SGD scaling limitations. ▪ Sharp minima with large batches. ▪ Diminishing performance with bigger batch sizes. *Balancing hardware efficiency and model performance requires addressing both communication constraints and optimization behavior.

Slide 29

Slide 29 text

Prepared and presented by: Kuncahyo Setyo Nugroho Book References

Slide 30

Slide 30 text

Thank You https://www.linkedin.com/in/ksnugroho [email protected] [email protected]