Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elements of Learning Systems (Tianqi Chen, Carnegie Mellon University | OctoML)

Elements of Learning Systems (Tianqi Chen, Carnegie Mellon University | OctoML)

Data, models, and computing are the three pillars that enable machine learning to solve real-world problems at scale. Making progress on these three domains requires not only disruptive algorithmic advances but also systems innovations that can continue to squeeze more efficiency out of modern hardware. Learning systems are at the center of every intelligent application nowadays. However, the ever-growing demand for applications and hardware specialization creates a huge engineering burden for these systems, most of which rely on heuristics or manual optimization. In this talk, I will discuss approaches to reduce these manual efforts. I will cover several aspects of such learning systems, including scalability, ease of use, and more automation. I will discuss these elements using the real-world learning systems that I built -- XGBoost and Apache TVM.

Af07bbf978a0989644b039ae6b8904a5?s=128

Anyscale
PRO

July 19, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Elements of Learning Systems Tianqi Chen

  2. Software Landscape Wide Range of Applications Python LLVM/Cla ng .

    … Software Libraries •Broad coverage. •Compute performance is less critical. •Engineers handles most optimizations
  3. AI Software Landscape ML Models Data Comput e •Diverse and

    fast evolving models •Big data •Specialized compute acceleration Transformer, ResNet, LSTM … …
  4. Learning Systems Data science for everyone Deploy AI everywhere Acknowledgement:

    work from many open-source community members
  5. Elements of Learning Systems Accessible and scalable Intelligent automated by

    machine learning Full stack, systems and hardware co-design
  6. Elements of Learning Systems Accessible and scalable Intelligent automated by

    machine learning Full stack, systems and hardware co-design
  7. Learning Systems Data science for everyone Deploy AI everywhere Acknowledgement:

    work from many open-source community members
  8. A Scalable Tree Boosting System XGBoost: A Scalable Tree Boosting

    System. Chen and Guestrin. KDD 16 System for gradient tree boosting
  9. Impact of XGBoost De-facto tool for data science 498 contributors,

    21k stars on GitHub XGBoost: A Scalable Tree Boosting System. Chen and Guestrin. KDD 16
  10. Technical Highlights •Automatic miss-value modeling •Sparse-aware tree learning • Weighted

    quantile sketch for approximate tree learning • Cache-aware parallelization • Out of core computation XGBoost: A Scalable Tree Boosting System. Chen and Guestrin. KDD 16
  11. Automatic Missing-Value Modeling Samples V1 V2 X1 ? TRUE X2

    15 ? X3 25 FALSE Missing Values V1 < 20 V2? Y N N Y X2 X1 X3 Default directions
  12. Platform Agnostic Learning System Platform A Platform B SVM A

    MF A SVM A MF A Rabit abstration XGBoost Rabit abstration XGBoost Single Machine XGBoost Implementation of learning packages tied to each platform Depend on reusable abstractions(allreduce), not platforms. Portable to different platforms.
  13. Use XGBoost as a Part of Your Platform In any

    language On any Platform • YARN, MPI, Dask, Spark, Ray ... • Easily extendible to distributed executors
  14. Learning Systems Data science for everyone Deploy AI everywhere Acknowledgement:

    work from many open-source community members
  15. Elements of Learning Systems Accessible and scalable Intelligent automated by

    machine learning Full stack systems and hardware co-design
  16. Problem: Machine Learning Deployment Model Hardware Backends Deploy

  17. Existing Deep Learning Frameworks High-level data flow graph Hardware Primitive

    Tensor operators such as Conv2D Offload to heavily optimized DNN operator library Frameworks
  18. Limitations of Existing Approach cuDNN Frameworks Engineering intensive New operators

    introduced
  19. Machine Learning based Program Optimizer TVM: Learning-based Learning System High-level

    data flow graph and optimizations Hardware Frameworks
  20. Problem Setting Search Space of Possible Program Optimizations Low-level Program

    Variants
  21. Example Instance in a Search Space Vanilla Code Search Space

    of Possible Program Optimizations
  22. Example Instance in a Search Space Loop Tiling for Locality

    Search Space of Possible Program Optimizations
  23. Example Instance in a Search Space Map to Accelerators Search

    Space of Possible Program Optimizations
  24. Optimization Choices in a Search Space Loop Transformations Thread Bindings

    Cache Locality Thread Cooperation Tensorization Latency Hiding Billions of possible optimization choices
  25. Problem Formalization Search Space Expression AutoTVM Progra m Cost: Execute

    Time Code Generator Optimization Configuratio n
  26. Statistical Cost Model Search Space Expression AutoTVM Code Generator Training

    data Benefit: Automatically adapt to hardware type Statistical Cost Model learnin g
  27. Optimization Search Space Construction Hardware Search space: Possible mappings from

    the expression to valid hardware programs Based on Halide’s compute/schedul e separation What is the search space
  28. Search Space for CPUs L1D L1I L2 L3 L1D L1I

    L2 Compute Primitives scala r vecto r Memory Subsystem Loop Transformations Cache Locality Vectorization CPU s Reuse primitives from prior work: Halide, Loopy
  29. Hardware-aware Search Space CPUs GPUs TPU-like specialized Accelerators

  30. Search Space for TPU-like Specialized Accelerators Tensor Compute Primitives Unified

    Buffer Acc FIFO Explicitly Managed Memory Subsystem TPU s
  31. Tensorization Challenge Compute primitives scalar vector tensor Challenge: Build systems

    to support emerging tensor instructions
  32. Tensorization Challenge A = tvm.placeholder((8, 8)) B = tvm.placeholder((8,)) k

    = tvm.reduce_axis((0, 8)) C = tvm.compute((8, 8), lambda y, x: tvm.sum(A[k, y] * B[k], axis=k)) Tensorization
  33. TVM: End to end Machine Learning Optimizer

  34. TVM Case Study TVM log 2 fold improvement over baseline

    Over 60 model x hardware benchmarking studies Each study compared TVM against best* baseline on the target Sorted by ascending log 2 gain over baseline Model x hardware comparison points
  35. Why ML Compilation: TVM Case Study at OctoML Model x

    hardware comparison points TVM log 2 fold improvement over baseline 34x for Yolo-V3 on a MIPS based camera platform 5.3x: video analysis model on Nvidia T4 against TensorRT 4x: random forest on Nvidia 1070 against XGBoost 2.5x: MobilenetV3 on ARM A72 CPU Source:
  36. New Hardware Support -- Apple M1 Source: • 22% faster

    than CoreML on CPU • 49% faster on GPU • After only a few engineering weeks of effort
  37. Open Source Community Apache Top-level Project Independent governance Open Source

    Code Open Development Open Governance
  38. Elements of Learning Systems Accessible and scalable Intelligent automated by

    machine learning Full stack, systems and hardware co-design
  39. Thank You •XGBoost https://xgboost.ai •Apache TVM https://tvm.apache.org/