Streaming distributed execution across. CPUs and GPUs

Slide 1

Slide 1 text

Streaming distributed execution across CPUs and GPUs June 21, 2023 Eric Liang Email: [email protected]

Slide 2

Slide 2 text

Talk Overview ● ML Inference and Training workloads ● How it relates to Ray Data streaming (new feature in Ray 2.4!) ● Examples ● Backend overview

Slide 3

Slide 3 text

About me ● Technical lead for Ray / OSS at Anyscale ● Previously: ○ PhD in systems / ML at Berkeley ○ Staff eng @ Databricks, storage infra @ Google

Slide 4

Slide 4 text

ML workloads and Data

Slide 5

Slide 5 text

ML Workloads ● Where does data processing come in for ML workloads? ETL Pipeline Preprocessing Training / Inference

Slide 6

Slide 6 text

ML Workloads ETL Pipeline Preprocessing Training / Inference Resizing images, Decoding videos Data augmentation Using PyTorch, TF, HuggingFace, LLMs, etc. Ingesting latest data, Joining data tables

Slide 7

Slide 7 text

ML Workloads ETL Pipeline Preprocessing Training / Inference CPU CPU GPU Resizing images, Decoding videos Data augmentation Using PyTorch, TF, HuggingFace, LLMs, etc. Ingesting latest data, Joining data tables

Slide 8

Slide 8 text

ML Workloads ETL Pipeline Preprocessing Training / Inference CPU CPU GPU Resizing images, Decoding videos Data augmentation usual scope of ML teams Ingesting latest data, Joining data tables Using PyTorch, TF, HuggingFace, LLMs, etc.

Slide 9

Slide 9 text

Our vision for simplifying this with Ray ETL Pipeline Preprocessing Training / Inference

Slide 10

Slide 10 text

Our vision for simplifying this with Ray ETL Pipeline Preprocessing Training / Inference

Slide 11

Slide 11 text

Our vision for simplifying this with Ray ETL Pipeline Preprocessing Training / Inference

Slide 12

Slide 12 text

Ray Data: Overview

Slide 13

Slide 13 text

Ray Data overview

Slide 14

Slide 14 text

Ray Data overview

Slide 15

Slide 15 text

Ray Data overview ray.data.Dataset Node 1 Block Node 2 Block Block Node 3 Block Blocks

Slide 16

Slide 16 text

Ray Data overview Powered by ray.data.Dataset

Slide 17

Slide 17 text

Ray Data overview High performance distributed IO ds = ray.data.read_parquet("s3://some/bucket") ds = ray.data.read_csv("/tmp/some_file.csv") Leverages Apache Arrow’s high-performance single-threaded IO Parallelized using Ray’s high-throughput task execution Scales to PiB-scale jobs in production (Amazon) Read from storage Transform data ds = ds.map_batches(batch_func) ds = ds.map(func) ds.iter_batches() -> Iterator ds.write_parquet("s3://some/bucket") Consume data

Slide 18

Slide 18 text

Ray Data Streaming

Slide 19

Slide 19 text

Bulk execution Previous versions of Ray Data (<2.4) used bulk execution strategy What is bulk execution? ● Load all data into memory ● Apply transformations on in-memory data in bulk ● Out of memory? -> spill blocks to disk ● Similar to Spark's execution model (bulk synchronous parallel)

Slide 20

Slide 20 text

Streaming (pipelined) execution ● Default execution strategy for Ray Data in 2.4 ● Same data transformations API ● Instead of executing operations in bulk, build a pipeline of operators ● Data blocks are streamed through operators, reducing memory use and avoiding spilling to disk

Slide 21

Slide 21 text

Preprocessing can often be the bottleneck ● Example: video decoding prior to inference / training ● Too expensive to run on just GPU nodes: needs scaling out ● Large intermediate data: uses lots of memory

Slide 22

Slide 22 text

Ray Data streaming avoids the bottleneck ● E.g., intermediate video frames streamed through memory ● Decoding can be oﬄoaded onto CPU nodes from GPU nodes ● Intermediate frames kept purely in (cluster) memory

Slide 23

Slide 23 text

Inference <> Training ● Same streaming pipeline can easily be used for training too!

Slide 24

Slide 24 text

Inference <> Training ● Same streaming pipeline can easily be used for training too! Split[3] Worker [0] Worker [1] Worker [2]

Slide 25

Slide 25 text

Streaming performance beneﬁts deep dive

Slide 26

Slide 26 text

Performance beneﬁts overview Bulk execution Streaming execution CPU-only pipelines (single-stage) Heterogeneous CPU+GPU pipelines (multi-stage)

Slide 27

Slide 27 text

Performance beneﬁts overview Bulk execution Streaming execution CPU-only pipelines (single-stage) - Memory optimal - Good for inference - Bad for training Heterogeneous CPU+GPU pipelines (multi-stage)

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Performance beneﬁts overview Bulk execution Streaming execution CPU-only pipelines (single-stage) - Memory optimal - Good for inference - Bad for training - Memory optimal - Good for inference - Good for training Heterogeneous CPU+GPU pipelines (multi-stage) - Memory inefficient - Slower for inference - Bad for training

Slide 30

Slide 30 text

Slide 31

Slide 31 text

In more detail: simple batch inference job Logical data flow:

Slide 32

Slide 32 text

A simple batch inference job

Slide 33

Slide 33 text

A simple batch inference job

Slide 34

Slide 34 text

A simple batch inference job

Slide 35

Slide 35 text

A simple batch inference job

Slide 36

Slide 36 text

A simple batch inference job

Slide 37

Slide 37 text

A simple batch inference job

Slide 38

Slide 38 text

This is a single stage pipeline

Slide 39

Slide 39 text

Bulk physical execution output 1 output 2 output 3

Slide 40

Slide 40 text

Bulk physical execution -- single stage ● Memory usage is optimal (no intermediate data) ● Good for inference ● Not good for distributed training (cannot consume results incrementally)

Slide 41

Slide 41 text

Streaming physical execution Operator (Stage 1) data partition 1 data partition 2 data partition 3

Slide 42

Slide 42 text

Streaming physical execution Operator (Stage 1) data partition 1 data partition 2 data partition 3

Slide 43

Slide 43 text

Streaming physical execution Operator (Stage 1) data partition 1 data partition 2 data partition 3 output 1

Slide 44

Slide 44 text

Streaming physical execution Operator (Stage 1) data partition 1 data partition 2 data partition 3 output 1

Slide 45

Slide 45 text

Streaming physical execution Operator (Stage 1) data partition 1 data partition 2 data partition 3 output 1 output 2

Slide 46

Slide 46 text

Streaming physical execution Operator (Stage 1) data partition 1 data partition 2 data partition 3 output 1 output 2

Slide 47

Slide 47 text

Streaming physical execution Operator (Stage 1) data partition 1 data partition 2 data partition 3 output 1 output 2 output 3

Slide 48

Slide 48 text

Streaming physical execution -- single stage ● Memory usage is optimal (no intermediate data) ● Good for inference ● Good for distributed training

Slide 49

Slide 49 text

In more detail: multi-stage (heterogeneous) pipeline GPU

Slide 50

Slide 50 text

Heterogeneous pipeline (CPU + GPU)

Slide 51

Slide 51 text

Heterogeneous pipeline (CPU + GPU)

Slide 52

Slide 52 text

Heterogeneous pipeline (CPU + GPU) GPU

Slide 53

Slide 53 text

Compare against bulk vs streaming

Slide 54

Slide 54 text

Bulk physical execution

Slide 55

Slide 55 text

Bulk physical execution

Slide 56

Slide 56 text

Bulk physical execution

Slide 57

Slide 57 text

Bulk physical execution -- multi stage ● Memory usage is ineﬃcient (disk spilling) ● Slower for inference ● Bad for distributed training