Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Servey: Alpa: Automating Inter- and Intra- Operator Parallelism for Distributed Deep Learning

Servey: Alpa: Automating Inter- and Intra- Operator Parallelism for Distributed Deep Learning

servey of Zheng, Lianmin, et al. "Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning." arXiv preprint arXiv:2201.12023 (2022).

Nariaki Tateiwa

April 22, 2023
Tweet

More Decks by Nariaki Tateiwa

Other Decks in Technology

Transcript

  1. Alpa: Automating Inter- and Intra- Operator Parallelism for Distributed Deep

    Learning 2022-12-05 Zheng, Lianmin, et al. "Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning." arXiv preprint arXiv:2201.12023 (2022).
  2. billion = 10 Figure 1 of "Efficient large-scale language model

    training on gpu clusters using megatron-lm." 3
  3. [30] Woo-Yeon Lee, et al. Automating system configuration of distributed

    machine learning. In 2019 IEEE 39th International Con ference on Distributed Computing Systems (ICDCS), pages 2057–2067. IEEE, 2019. [31] Dmitry Lepikhin, et al. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020. [17] Shiqing Fan, et al. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021. [38] Deepak Narayanan, et al. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019. [55] Minjie Wang, et al. Supporting very large models using automatic dataflow graph partitioning. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1–17, 2019. 8
  4. ? 1. 2. 2 (1) (ILP) (2) (DP) DP 2

    mesh 1: ("PipeDream: generalized pipeline parallelism for DNN training.", "Automatic graph partitioning for very large-scale deep learning.") 2: 4 XPU Stage M 2 mesh 10
  5. Python(16K LoC) C++(6K Loc) Jax, XLA device mesh worker Ray

    XLA NCCL Jax XLA 1: XLA Accelerated Linear Algebra https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/xla Jax: https://github.com/google/jax, Ray: https://github.com/ray-project/ray, NCCL: https://github.com/NVIDIA/nccl 11
  6. 700 Throughput(FLOPS) Intra( ) Inter( ) GPT-3 Megatron-LM v2 Gshared

    MoE DeepSpeed Wide-ResNet PP-DP( ) Wide-ResNet pipeline DP 12
  7. ? On Optimizing the Communication of Model Parallelism." arXiv e-prints

    (2022): arXiv-2211. [17] Dapple: A pipelined data parallel approach for training large models, 2021. [38] Pipedream: generalized pipeline parallelism for dnn training, 2019. [55] Supporting very large models using automatic dataflow graph partitioning, 2019. 17