Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elements of Learning Systems (Tianqi Chen, Carnegie Mellon University | OctoML)

Elements of Learning Systems (Tianqi Chen, Carnegie Mellon University | OctoML)

Data, models, and computing are the three pillars that enable machine learning to solve real-world problems at scale. Making progress on these three domains requires not only disruptive algorithmic advances but also systems innovations that can continue to squeeze more efficiency out of modern hardware. Learning systems are at the center of every intelligent application nowadays. However, the ever-growing demand for applications and hardware specialization creates a huge engineering burden for these systems, most of which rely on heuristics or manual optimization. In this talk, I will discuss approaches to reduce these manual efforts. I will cover several aspects of such learning systems, including scalability, ease of use, and more automation. I will discuss these elements using the real-world learning systems that I built -- XGBoost and Apache TVM.

Anyscale

July 19, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Software Landscape Wide Range of Applications Python LLVM/Cla ng .

    … Software Libraries •Broad coverage. •Compute performance is less critical. •Engineers handles most optimizations
  2. AI Software Landscape ML Models Data Comput e •Diverse and

    fast evolving models •Big data •Specialized compute acceleration Transformer, ResNet, LSTM … …
  3. Elements of Learning Systems Accessible and scalable Intelligent automated by

    machine learning Full stack, systems and hardware co-design
  4. Elements of Learning Systems Accessible and scalable Intelligent automated by

    machine learning Full stack, systems and hardware co-design
  5. A Scalable Tree Boosting System XGBoost: A Scalable Tree Boosting

    System. Chen and Guestrin. KDD 16 System for gradient tree boosting
  6. Impact of XGBoost De-facto tool for data science 498 contributors,

    21k stars on GitHub XGBoost: A Scalable Tree Boosting System. Chen and Guestrin. KDD 16
  7. Technical Highlights •Automatic miss-value modeling •Sparse-aware tree learning • Weighted

    quantile sketch for approximate tree learning • Cache-aware parallelization • Out of core computation XGBoost: A Scalable Tree Boosting System. Chen and Guestrin. KDD 16
  8. Automatic Missing-Value Modeling Samples V1 V2 X1 ? TRUE X2

    15 ? X3 25 FALSE Missing Values V1 < 20 V2? Y N N Y X2 X1 X3 Default directions
  9. Platform Agnostic Learning System Platform A Platform B SVM A

    MF A SVM A MF A Rabit abstration XGBoost Rabit abstration XGBoost Single Machine XGBoost Implementation of learning packages tied to each platform Depend on reusable abstractions(allreduce), not platforms. Portable to different platforms.
  10. Use XGBoost as a Part of Your Platform In any

    language On any Platform • YARN, MPI, Dask, Spark, Ray ... • Easily extendible to distributed executors
  11. Elements of Learning Systems Accessible and scalable Intelligent automated by

    machine learning Full stack systems and hardware co-design
  12. Existing Deep Learning Frameworks High-level data flow graph Hardware Primitive

    Tensor operators such as Conv2D Offload to heavily optimized DNN operator library Frameworks
  13. Example Instance in a Search Space Loop Tiling for Locality

    Search Space of Possible Program Optimizations
  14. Example Instance in a Search Space Map to Accelerators Search

    Space of Possible Program Optimizations
  15. Optimization Choices in a Search Space Loop Transformations Thread Bindings

    Cache Locality Thread Cooperation Tensorization Latency Hiding Billions of possible optimization choices
  16. Statistical Cost Model Search Space Expression AutoTVM Code Generator Training

    data Benefit: Automatically adapt to hardware type Statistical Cost Model learnin g
  17. Optimization Search Space Construction Hardware Search space: Possible mappings from

    the expression to valid hardware programs Based on Halide’s compute/schedul e separation What is the search space
  18. Search Space for CPUs L1D L1I L2 L3 L1D L1I

    L2 Compute Primitives scala r vecto r Memory Subsystem Loop Transformations Cache Locality Vectorization CPU s Reuse primitives from prior work: Halide, Loopy
  19. Search Space for TPU-like Specialized Accelerators Tensor Compute Primitives Unified

    Buffer Acc FIFO Explicitly Managed Memory Subsystem TPU s
  20. Tensorization Challenge A = tvm.placeholder((8, 8)) B = tvm.placeholder((8,)) k

    = tvm.reduce_axis((0, 8)) C = tvm.compute((8, 8), lambda y, x: tvm.sum(A[k, y] * B[k], axis=k)) Tensorization
  21. TVM Case Study TVM log 2 fold improvement over baseline

    Over 60 model x hardware benchmarking studies Each study compared TVM against best* baseline on the target Sorted by ascending log 2 gain over baseline Model x hardware comparison points
  22. Why ML Compilation: TVM Case Study at OctoML Model x

    hardware comparison points TVM log 2 fold improvement over baseline 34x for Yolo-V3 on a MIPS based camera platform 5.3x: video analysis model on Nvidia T4 against TensorRT 4x: random forest on Nvidia 1070 against XGBoost 2.5x: MobilenetV3 on ARM A72 CPU Source:
  23. New Hardware Support -- Apple M1 Source: • 22% faster

    than CoreML on CPU • 49% faster on GPU • After only a few engineering weeks of effort
  24. Elements of Learning Systems Accessible and scalable Intelligent automated by

    machine learning Full stack, systems and hardware co-design