Slide 1

Slide 1 text

Streaming distributed execution across CPUs and GPUs June 21, 2023 Eric Liang Email: [email protected]

Slide 2

Slide 2 text

Talk Overview ● ML Inference and Training workloads ● How it relates to Ray Data streaming (new feature in Ray 2.4!) ● Examples ● Backend overview

Slide 3

Slide 3 text

About me ● Technical lead for Ray / OSS at Anyscale ● Previously: ○ PhD in systems / ML at Berkeley ○ Staff eng @ Databricks, storage infra @ Google

Slide 4

Slide 4 text

ML workloads and Data

Slide 5

Slide 5 text

ML Workloads ● Where does data processing come in for ML workloads? ETL Pipeline Preprocessing Training / Inference

Slide 6

Slide 6 text

ML Workloads ETL Pipeline Preprocessing Training / Inference Resizing images, Decoding videos Data augmentation Using PyTorch, TF, HuggingFace, LLMs, etc. Ingesting latest data, Joining data tables

Slide 7

Slide 7 text

ML Workloads ETL Pipeline Preprocessing Training / Inference CPU CPU GPU Resizing images, Decoding videos Data augmentation Using PyTorch, TF, HuggingFace, LLMs, etc. Ingesting latest data, Joining data tables

Slide 8

Slide 8 text

ML Workloads ETL Pipeline Preprocessing Training / Inference CPU CPU GPU Resizing images, Decoding videos Data augmentation usual scope of ML teams Ingesting latest data, Joining data tables Using PyTorch, TF, HuggingFace, LLMs, etc.

Slide 9

Slide 9 text

Our vision for simplifying this with Ray ETL Pipeline Preprocessing Training / Inference

Slide 10

Slide 10 text

Our vision for simplifying this with Ray ETL Pipeline Preprocessing Training / Inference

Slide 11

Slide 11 text

Our vision for simplifying this with Ray ETL Pipeline Preprocessing Training / Inference

Slide 12

Slide 12 text

Ray Data: Overview

Slide 13

Slide 13 text

Ray Data overview

Slide 14

Slide 14 text

Ray Data overview

Slide 15

Slide 15 text

Ray Data overview ray.data.Dataset Node 1 Block Node 2 Block Block Node 3 Block Blocks

Slide 16

Slide 16 text

Ray Data overview Powered by ray.data.Dataset

Slide 17

Slide 17 text

Ray Data overview High performance distributed IO ds = ray.data.read_parquet("s3://some/bucket") ds = ray.data.read_csv("/tmp/some_file.csv") Leverages Apache Arrow’s high-performance single-threaded IO Parallelized using Ray’s high-throughput task execution Scales to PiB-scale jobs in production (Amazon) Read from storage Transform data ds = ds.map_batches(batch_func) ds = ds.map(func) ds.iter_batches() -> Iterator ds.write_parquet("s3://some/bucket") Consume data

Slide 18

Slide 18 text

Ray Data Streaming

Slide 19

Slide 19 text

Bulk execution Previous versions of Ray Data (<2.4) used bulk execution strategy What is bulk execution? ● Load all data into memory ● Apply transformations on in-memory data in bulk ● Out of memory? -> spill blocks to disk ● Similar to Spark's execution model (bulk synchronous parallel)

Slide 20

Slide 20 text

Streaming (pipelined) execution ● Default execution strategy for Ray Data in 2.4 ● Same data transformations API ● Instead of executing operations in bulk, build a pipeline of operators ● Data blocks are streamed through operators, reducing memory use and avoiding spilling to disk

Slide 21

Slide 21 text

Preprocessing can often be the bottleneck ● Example: video decoding prior to inference / training ● Too expensive to run on just GPU nodes: needs scaling out ● Large intermediate data: uses lots of memory

Slide 22

Slide 22 text

Ray Data streaming avoids the bottleneck ● E.g., intermediate video frames streamed through memory ● Decoding can be offloaded onto CPU nodes from GPU nodes ● Intermediate frames kept purely in (cluster) memory

Slide 23

Slide 23 text

Inference <> Training ● Same streaming pipeline can easily be used for training too!

Slide 24

Slide 24 text

Inference <> Training ● Same streaming pipeline can easily be used for training too! Split[3] Worker [0] Worker [1] Worker [2]

Slide 25

Slide 25 text

Streaming performance benefits deep dive

Slide 26

Slide 26 text

Performance benefits overview Bulk execution Streaming execution CPU-only pipelines (single-stage) Heterogeneous CPU+GPU pipelines (multi-stage)

Slide 27

Slide 27 text

Performance benefits overview Bulk execution Streaming execution CPU-only pipelines (single-stage) - Memory optimal - Good for inference - Bad for training Heterogeneous CPU+GPU pipelines (multi-stage)

Slide 28

Slide 28 text

Performance benefits overview Bulk execution Streaming execution CPU-only pipelines (single-stage) - Memory optimal - Good for inference - Bad for training - Memory optimal - Good for inference - Good for training Heterogeneous CPU+GPU pipelines (multi-stage)

Slide 29

Slide 29 text

Performance benefits overview Bulk execution Streaming execution CPU-only pipelines (single-stage) - Memory optimal - Good for inference - Bad for training - Memory optimal - Good for inference - Good for training Heterogeneous CPU+GPU pipelines (multi-stage) - Memory inefficient - Slower for inference - Bad for training

Slide 30

Slide 30 text

Performance benefits overview Bulk execution Streaming execution CPU-only pipelines (single-stage) - Memory optimal - Good for inference - Bad for training - Memory optimal - Good for inference - Good for training Heterogeneous CPU+GPU pipelines (multi-stage) - Memory inefficient - Slower for inference - Bad for training - Memory optimal - Good for inference - Good for training

Slide 31

Slide 31 text

In more detail: simple batch inference job Logical data flow:

Slide 32

Slide 32 text

A simple batch inference job

Slide 33

Slide 33 text

A simple batch inference job

Slide 34

Slide 34 text

A simple batch inference job

Slide 35

Slide 35 text

A simple batch inference job

Slide 36

Slide 36 text

A simple batch inference job

Slide 37

Slide 37 text

A simple batch inference job

Slide 38

Slide 38 text

This is a single stage pipeline

Slide 39

Slide 39 text

Bulk physical execution output 1 output 2 output 3

Slide 40

Slide 40 text

Bulk physical execution -- single stage ● Memory usage is optimal (no intermediate data) ● Good for inference ● Not good for distributed training (cannot consume results incrementally)

Slide 41

Slide 41 text

Streaming physical execution Operator (Stage 1) data partition 1 data partition 2 data partition 3

Slide 42

Slide 42 text

Streaming physical execution Operator (Stage 1) data partition 1 data partition 2 data partition 3

Slide 43

Slide 43 text

Streaming physical execution Operator (Stage 1) data partition 1 data partition 2 data partition 3 output 1

Slide 44

Slide 44 text

Streaming physical execution Operator (Stage 1) data partition 1 data partition 2 data partition 3 output 1

Slide 45

Slide 45 text

Streaming physical execution Operator (Stage 1) data partition 1 data partition 2 data partition 3 output 1 output 2

Slide 46

Slide 46 text

Streaming physical execution Operator (Stage 1) data partition 1 data partition 2 data partition 3 output 1 output 2

Slide 47

Slide 47 text

Streaming physical execution Operator (Stage 1) data partition 1 data partition 2 data partition 3 output 1 output 2 output 3

Slide 48

Slide 48 text

Streaming physical execution -- single stage ● Memory usage is optimal (no intermediate data) ● Good for inference ● Good for distributed training

Slide 49

Slide 49 text

In more detail: multi-stage (heterogeneous) pipeline GPU

Slide 50

Slide 50 text

Heterogeneous pipeline (CPU + GPU)

Slide 51

Slide 51 text

Heterogeneous pipeline (CPU + GPU)

Slide 52

Slide 52 text

Heterogeneous pipeline (CPU + GPU) GPU

Slide 53

Slide 53 text

Compare against bulk vs streaming

Slide 54

Slide 54 text

Bulk physical execution

Slide 55

Slide 55 text

Bulk physical execution

Slide 56

Slide 56 text

Bulk physical execution

Slide 57

Slide 57 text

Bulk physical execution -- multi stage ● Memory usage is inefficient (disk spilling) ● Slower for inference ● Bad for distributed training

Slide 58

Slide 58 text

Streaming physical execution Operator (Stage 1) Operator (Stage 2) Operator (Stage 3)

Slide 59

Slide 59 text

Streaming physical execution Operator (Stage 1) Operator (Stage 2) Operator (Stage 3)

Slide 60

Slide 60 text

Streaming physical execution Operator (Stage 1) Operator (Stage 2) Operator (Stage 3)

Slide 61

Slide 61 text

Streaming physical execution Operator (Stage 1) Operator (Stage 2) Operator (Stage 3)

Slide 62

Slide 62 text

Streaming physical execution Operator (Stage 1) Operator (Stage 2) Operator (Stage 3)

Slide 63

Slide 63 text

Streaming physical execution Operator (Stage 1) Operator (Stage 2) Operator (Stage 3)

Slide 64

Slide 64 text

Streaming physical execution -- multi stage ● Memory usage is optimal (no intermediate data) ● Good for inference ● Good for distributed training

Slide 65

Slide 65 text

Comparison to other systems ● DataFrame systems (e.g., Spark) ○ Ray Data streaming is more memory efficient ○ Ray Data supports heterogeneous clusters ○ Execution model a better fit for distributed training ● ML ingest libraries: TF Data / Torch Data / Petastorm ○ Ray Data supports scaling preprocessing out to a cluster

Slide 66

Slide 66 text

Video Inference Example

Slide 67

Slide 67 text

Four stage pipeline Logical data flow:

Slide 68

Slide 68 text

Four stage pipeline

Slide 69

Slide 69 text

Four stage pipeline

Slide 70

Slide 70 text

Four stage pipeline

Slide 71

Slide 71 text

Putting it together

Slide 72

Slide 72 text

Putting it together

Slide 73

Slide 73 text

Putting it together

Slide 74

Slide 74 text

Putting it together

Slide 75

Slide 75 text

Putting it together

Slide 76

Slide 76 text

Streaming execution plan

Slide 77

Slide 77 text

Running this on a Ray cluster $ python workload.py

Slide 78

Slide 78 text

Running this on a Ray cluster $ python workload.py 2023-05-02 15:10:01,105 INFO streaming_executor.py:91 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(decode_frames)] -> ActorPoolMapOperator[MapBatches(FrameAnnotator)] -> ActorPoolMapOperator[MapBatches(FrameClassifier)] -> TaskPoolMapOperator[Write]

Slide 79

Slide 79 text

Running this on a Ray cluster $ python workload.py 2023-05-02 15:10:01,105 INFO streaming_executor.py:91 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(decode_frames)] -> ActorPoolMapOperator[MapBatches(FrameAnnotator)] -> ActorPoolMapOperator[MapBatches(FrameClassifier)] -> TaskPoolMapOperator[Write] Running: 25.0/112.0 CPU, 2.0/2.0 GPU, 33.19 GiB/32.27 GiB object_store_memory: 28%|███▍ | 285/1000 [01:40<03:12, 3.71it/s]

Slide 80

Slide 80 text

Running this on a Ray cluster $ python workload.py 2023-05-02 15:10:01,105 INFO streaming_executor.py:91 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(decode_frames)] -> ActorPoolMapOperator[MapBatches(FrameAnnotator)] -> ActorPoolMapOperator[MapBatches(FrameClassifier)] -> TaskPoolMapOperator[Write] Running: 0.0/112.0 CPU, 0.0/2.0 GPU, 0.0 GiB/32.27 GiB object_store_memory: 100%|████████████| 1000/1000 [05:13<00:00, 3.85it/s]

Slide 81

Slide 81 text

Ray dashboard observability: active tasks

Slide 82

Slide 82 text

Ray dashboard observability: actor state

Slide 83

Slide 83 text

Ray dashboard observability: network

Slide 84

Slide 84 text

Backend overview

Slide 85

Slide 85 text

How does Ray data streaming work? ● Each transformation is implemented as an operator ● Use Ray tasks and actors for execution of operators ○ default -> use Ray tasks ○ actor pool -> use Ray actors ● Intermediate data blocks in Ray object store ● Memory usage of operators is limited (backpressure) to enable efficient streaming without spilling to disk

Slide 86

Slide 86 text

Advantages of using Ray core primitives ● Heterogeneous clusters support ● Fault tolerance out of the box ○ Lineage-based reconstruction of tasks and actors operations → your ML job will survive failures during preprocessing ● Resilient object store layer ○ Spills to disk in case of unexpectedly high memory usage: slowdown instead of a crash ○ Can also do large-scale shuffles in a pinch ● Easy to add data locality optimizations for both task + actor ops

Slide 87

Slide 87 text

Scalability ● Stress tests on a 20TiB array dataset ● 500 machines

Slide 88

Slide 88 text

Training ● Can use pipelines for training as well ● Swap map_batches(Model) call for streaming_split(K) Inference ray.data.read_datasource(...) \ .map_batches(preprocess) .map_batches(Model, compute=ActorPoolStrategy(...)) \ .write_datasource(...) Training iters = ray.data.read_datasource(...) \ .map_batches(preprocess) \ .streaming_split(len(workers)) for i, w in enumerate(workers): w.set_data_iterator.remote(iters[i]) ## in worker for batch in it.iter_batches(batch_size=32): model.forward(batch)...

Slide 89

Slide 89 text

Advantages of streaming for Training ● Example of accelerating an expensive Read/Preprocessing operation by adding CPU nodes to a cluster

Slide 90

Slide 90 text

Summary ● Ray Data streaming scales batch inference and training workloads ● More efficient computation model than bulk processing ● Simple API for composing streaming topologies Next steps: ● Streaming Inference is available in 2.4: docs.ray.io ● Ray Train integration coming in 2.6 ● We're hardening streaming to work robustly at 100+ node clusters, 10M+ input files. Contact us!