Streaming distributed execution across. CPUs and GPUs

Streaming distributed execution across CPUs and GPUs June 21, 2023
Eric Liang Email: ekl@anyscale.com

Talk Overview • ML Inference and Training workloads • How
it relates to Ray Data streaming (new feature in Ray 2.4!) • Examples • Backend overview

About me • Technical lead for Ray / OSS at
Anyscale • Previously: ◦ PhD in systems / ML at Berkeley ◦ Staff eng @ Databricks, storage infra @ Google

ML workloads and Data

ML Workloads • Where does data processing come in for
ML workloads? ETL Pipeline Preprocessing Training / Inference

ML Workloads ETL Pipeline Preprocessing Training / Inference Resizing images,
Decoding videos Data augmentation Using PyTorch, TF, HuggingFace, LLMs, etc. Ingesting latest data, Joining data tables

ML Workloads ETL Pipeline Preprocessing Training / Inference CPU CPU
GPU Resizing images, Decoding videos Data augmentation Using PyTorch, TF, HuggingFace, LLMs, etc. Ingesting latest data, Joining data tables

ML Workloads ETL Pipeline Preprocessing Training / Inference CPU CPU
GPU Resizing images, Decoding videos Data augmentation usual scope of ML teams Ingesting latest data, Joining data tables Using PyTorch, TF, HuggingFace, LLMs, etc.

Our vision for simplifying this with Ray ETL Pipeline Preprocessing
Training / Inference

Ray Data: Overview

Ray Data overview

Ray Data overview ray.data.Dataset Node 1 Block Node 2 Block
Block Node 3 Block Blocks

Ray Data overview Powered by ray.data.Dataset

Ray Data overview High performance distributed IO ds = ray.data.read_parquet("s3://some/bucket")
ds = ray.data.read_csv("/tmp/some_file.csv") Leverages Apache Arrow’s high-performance single-threaded IO Parallelized using Ray’s high-throughput task execution Scales to PiB-scale jobs in production (Amazon) Read from storage Transform data ds = ds.map_batches(batch_func) ds = ds.map(func) ds.iter_batches() -> Iterator ds.write_parquet("s3://some/bucket") Consume data

Ray Data Streaming

Bulk execution Previous versions of Ray Data (<2.4) used bulk
execution strategy What is bulk execution? • Load all data into memory • Apply transformations on in-memory data in bulk • Out of memory? -> spill blocks to disk • Similar to Spark's execution model (bulk synchronous parallel)

Streaming (pipelined) execution • Default execution strategy for Ray Data
in 2.4 • Same data transformations API • Instead of executing operations in bulk, build a pipeline of operators • Data blocks are streamed through operators, reducing memory use and avoiding spilling to disk

Preprocessing can often be the bottleneck • Example: video decoding
prior to inference / training • Too expensive to run on just GPU nodes: needs scaling out • Large intermediate data: uses lots of memory

Ray Data streaming avoids the bottleneck • E.g., intermediate video
frames streamed through memory • Decoding can be oﬄoaded onto CPU nodes from GPU nodes • Intermediate frames kept purely in (cluster) memory

Inference <> Training • Same streaming pipeline can easily be
used for training too!

Inference <> Training • Same streaming pipeline can easily be
used for training too! Split[3] Worker [0] Worker [1] Worker [2]

Streaming performance beneﬁts deep dive

Performance beneﬁts overview Bulk execution Streaming execution CPU-only pipelines (single-stage)
Heterogeneous CPU+GPU pipelines (multi-stage)

- Memory optimal - Good for inference - Bad for training Heterogeneous CPU+GPU pipelines (multi-stage)

- Memory optimal - Good for inference - Bad for training - Memory optimal - Good for inference - Good for training Heterogeneous CPU+GPU pipelines (multi-stage)

- Memory optimal - Good for inference - Bad for training - Memory optimal - Good for inference - Good for training Heterogeneous CPU+GPU pipelines (multi-stage) - Memory inefficient - Slower for inference - Bad for training

- Memory optimal - Good for inference - Bad for training - Memory optimal - Good for inference - Good for training Heterogeneous CPU+GPU pipelines (multi-stage) - Memory inefficient - Slower for inference - Bad for training - Memory optimal - Good for inference - Good for training

In more detail: simple batch inference job Logical data flow:

A simple batch inference job

This is a single stage pipeline

Bulk physical execution output 1 output 2 output 3

Bulk physical execution -- single stage • Memory usage is
optimal (no intermediate data) • Good for inference • Not good for distributed training (cannot consume results incrementally)

Streaming physical execution Operator (Stage 1) data partition 1 data
partition 2 data partition 3

partition 2 data partition 3 output 1

partition 2 data partition 3 output 1 output 2

partition 2 data partition 3 output 1 output 2 output 3

Streaming physical execution -- single stage • Memory usage is
optimal (no intermediate data) • Good for inference • Good for distributed training

In more detail: multi-stage (heterogeneous) pipeline GPU

Heterogeneous pipeline (CPU + GPU)

Heterogeneous pipeline (CPU + GPU) GPU

Compare against bulk vs streaming

Bulk physical execution

Bulk physical execution -- multi stage • Memory usage is
ineﬃcient (disk spilling) • Slower for inference • Bad for distributed training

Streaming physical execution Operator (Stage 1) Operator (Stage 2) Operator
(Stage 3)

Streaming physical execution -- multi stage • Memory usage is
optimal (no intermediate data) • Good for inference • Good for distributed training

Comparison to other systems • DataFrame systems (e.g., Spark) ◦
Ray Data streaming is more memory eﬃcient ◦ Ray Data supports heterogeneous clusters ◦ Execution model a better ﬁt for distributed training • ML ingest libraries: TF Data / Torch Data / Petastorm ◦ Ray Data supports scaling preprocessing out to a cluster

Video Inference Example

Four stage pipeline Logical data flow:

Four stage pipeline

Putting it together

Streaming execution plan

Running this on a Ray cluster $ python workload.py

Running this on a Ray cluster $ python workload.py 2023-05-02
15:10:01,105 INFO streaming_executor.py:91 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(decode_frames)] -> ActorPoolMapOperator[MapBatches(FrameAnnotator)] -> ActorPoolMapOperator[MapBatches(FrameClassifier)] -> TaskPoolMapOperator[Write]

15:10:01,105 INFO streaming_executor.py:91 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(decode_frames)] -> ActorPoolMapOperator[MapBatches(FrameAnnotator)] -> ActorPoolMapOperator[MapBatches(FrameClassifier)] -> TaskPoolMapOperator[Write] Running: 25.0/112.0 CPU, 2.0/2.0 GPU, 33.19 GiB/32.27 GiB object_store_memory: 28%|███▍ | 285/1000 [01:40<03:12, 3.71it/s]

15:10:01,105 INFO streaming_executor.py:91 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(decode_frames)] -> ActorPoolMapOperator[MapBatches(FrameAnnotator)] -> ActorPoolMapOperator[MapBatches(FrameClassifier)] -> TaskPoolMapOperator[Write] Running: 0.0/112.0 CPU, 0.0/2.0 GPU, 0.0 GiB/32.27 GiB object_store_memory: 100%|████████████| 1000/1000 [05:13<00:00, 3.85it/s]

Ray dashboard observability: active tasks

Ray dashboard observability: actor state

Ray dashboard observability: network

Backend overview

How does Ray data streaming work? • Each transformation is
implemented as an operator • Use Ray tasks and actors for execution of operators ◦ default -> use Ray tasks ◦ actor pool -> use Ray actors • Intermediate data blocks in Ray object store • Memory usage of operators is limited (backpressure) to enable eﬃcient streaming without spilling to disk

Advantages of using Ray core primitives • Heterogeneous clusters support
• Fault tolerance out of the box ◦ Lineage-based reconstruction of tasks and actors operations → your ML job will survive failures during preprocessing • Resilient object store layer ◦ Spills to disk in case of unexpectedly high memory usage: slowdown instead of a crash ◦ Can also do large-scale shuﬄes in a pinch • Easy to add data locality optimizations for both task + actor ops

Scalability • Stress tests on a 20TiB array dataset •
500 machines

Training • Can use pipelines for training as well •
Swap map_batches(Model) call for streaming_split(K) Inference ray.data.read_datasource(...) \ .map_batches(preprocess) .map_batches(Model, compute=ActorPoolStrategy(...)) \ .write_datasource(...) Training iters = ray.data.read_datasource(...) \ .map_batches(preprocess) \ .streaming_split(len(workers)) for i, w in enumerate(workers): w.set_data_iterator.remote(iters[i]) ## in worker for batch in it.iter_batches(batch_size=32): model.forward(batch)...

Advantages of streaming for Training • Example of accelerating an
expensive Read/Preprocessing operation by adding CPU nodes to a cluster

Summary • Ray Data streaming scales batch inference and training
workloads • More eﬃcient computation model than bulk processing • Simple API for composing streaming topologies Next steps: • Streaming Inference is available in 2.4: docs.ray.io • Ray Train integration coming in 2.6 • We're hardening streaming to work robustly at 100+ node clusters, 10M+ input ﬁles. Contact us!

Streaming distributed execution across. CPUs an...

Streaming distributed execution across. CPUs and GPUs

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript