Elements of Learning Systems (Tianqi Chen, Carnegie Mellon University | OctoML)

Elements of Learning Systems Tianqi Chen

Software Landscape Wide Range of Applications Python LLVM/Cla ng .
… Software Libraries •Broad coverage. •Compute performance is less critical. •Engineers handles most optimizations

AI Software Landscape ML Models Data Comput e •Diverse and
fast evolving models •Big data •Specialized compute acceleration Transformer, ResNet, LSTM … …

Learning Systems Data science for everyone Deploy AI everywhere Acknowledgement:
work from many open-source community members

Elements of Learning Systems Accessible and scalable Intelligent automated by
machine learning Full stack, systems and hardware co-design

A Scalable Tree Boosting System XGBoost: A Scalable Tree Boosting
System. Chen and Guestrin. KDD 16 System for gradient tree boosting

Impact of XGBoost De-facto tool for data science 498 contributors,
21k stars on GitHub XGBoost: A Scalable Tree Boosting System. Chen and Guestrin. KDD 16

Technical Highlights •Automatic miss-value modeling •Sparse-aware tree learning • Weighted
quantile sketch for approximate tree learning • Cache-aware parallelization • Out of core computation XGBoost: A Scalable Tree Boosting System. Chen and Guestrin. KDD 16

Automatic Missing-Value Modeling Samples V1 V2 X1 ? TRUE X2
15 ? X3 25 FALSE Missing Values V1 < 20 V2? Y N N Y X2 X1 X3 Default directions

Platform Agnostic Learning System Platform A Platform B SVM A
MF A SVM A MF A Rabit abstration XGBoost Rabit abstration XGBoost Single Machine XGBoost Implementation of learning packages tied to each platform Depend on reusable abstractions(allreduce), not platforms. Portable to different platforms.

Use XGBoost as a Part of Your Platform In any
language On any Platform • YARN, MPI, Dask, Spark, Ray ... • Easily extendible to distributed executors

machine learning Full stack systems and hardware co-design

Problem: Machine Learning Deployment Model Hardware Backends Deploy

Existing Deep Learning Frameworks High-level data ﬂow graph Hardware Primitive
Tensor operators such as Conv2D Oﬄoad to heavily optimized DNN operator library Frameworks

Limitations of Existing Approach cuDNN Frameworks Engineering intensive New operators
introduced

Machine Learning based Program Optimizer TVM: Learning-based Learning System High-level
data ﬂow graph and optimizations Hardware Frameworks

Problem Setting Search Space of Possible Program Optimizations Low-level Program
Variants

Example Instance in a Search Space Vanilla Code Search Space
of Possible Program Optimizations

Example Instance in a Search Space Loop Tiling for Locality
Search Space of Possible Program Optimizations

Example Instance in a Search Space Map to Accelerators Search
Space of Possible Program Optimizations

Optimization Choices in a Search Space Loop Transformations Thread Bindings
Cache Locality Thread Cooperation Tensorization Latency Hiding Billions of possible optimization choices

Problem Formalization Search Space Expression AutoTVM Progra m Cost: Execute
Time Code Generator Optimization Configuratio n

Statistical Cost Model Search Space Expression AutoTVM Code Generator Training
data Benefit: Automatically adapt to hardware type Statistical Cost Model learnin g

Optimization Search Space Construction Hardware Search space: Possible mappings from
the expression to valid hardware programs Based on Halide’s compute/schedul e separation What is the search space

Search Space for CPUs L1D L1I L2 L3 L1D L1I
L2 Compute Primitives scala r vecto r Memory Subsystem Loop Transformations Cache Locality Vectorization CPU s Reuse primitives from prior work: Halide, Loopy

Hardware-aware Search Space CPUs GPUs TPU-like specialized Accelerators

Search Space for TPU-like Specialized Accelerators Tensor Compute Primitives Uniﬁed
Buﬀer Acc FIFO Explicitly Managed Memory Subsystem TPU s

Tensorization Challenge Compute primitives scalar vector tensor Challenge: Build systems
to support emerging tensor instructions

Tensorization Challenge A = tvm.placeholder((8, 8)) B = tvm.placeholder((8,)) k
= tvm.reduce_axis((0, 8)) C = tvm.compute((8, 8), lambda y, x: tvm.sum(A[k, y] * B[k], axis=k)) Tensorization

TVM: End to end Machine Learning Optimizer

TVM Case Study TVM log 2 fold improvement over baseline
Over 60 model x hardware benchmarking studies Each study compared TVM against best* baseline on the target Sorted by ascending log 2 gain over baseline Model x hardware comparison points

Why ML Compilation: TVM Case Study at OctoML Model x
hardware comparison points TVM log 2 fold improvement over baseline 34x for Yolo-V3 on a MIPS based camera platform 5.3x: video analysis model on Nvidia T4 against TensorRT 4x: random forest on Nvidia 1070 against XGBoost 2.5x: MobilenetV3 on ARM A72 CPU Source:

New Hardware Support -- Apple M1 Source: • 22% faster
than CoreML on CPU • 49% faster on GPU • After only a few engineering weeks of effort

Open Source Community Apache Top-level Project Independent governance Open Source
Code Open Development Open Governance

machine learning Full stack, systems and hardware co-design

Thank You •XGBoost https://xgboost.ai •Apache TVM https://tvm.apache.org/

Elements of Learning Systems (Tianqi Chen, Carn...

Elements of Learning Systems (Tianqi Chen, Carnegie Mellon University | OctoML)

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript