Ray Datasets: Scalable data preprocessing for distributed ML

Slide 1

Slide 1 text

Ray Datasets: Scalable Data Preprocessing for Distributed ML Clark Zinzow Software Engineer

Slide 2

Slide 2 text

01 02 03 State of ML Training/Scoring Pipelines Introducing Ray Datasets Use Cases Overview

Slide 3

Slide 3 text

Existing ML Pipelines Preprocess Checkpoint Train

Slide 4

Slide 4 text

Existing ML Pipelines

Slide 5

Slide 5 text

● Performance overheads ○ Serialization/Deserialization ○ Data materialized to external storage ● Implementation/Operational Complexity ○ Cross-lang, cross-workload ○ CPUs vs GPUs ● Missing operations ○ Per-epoch shuffling ■ How to do a fast, in-memory, distributed shuffle? Challenges

Slide 6

Slide 6 text

Built on Ray Ray Datasets

Slide 7

Slide 7 text

01 02 03 Universal Data Loading Last Mile Preprocessing Parallel GPU/CPU Compute Ray Datasets ray.data.Dataset Node 1 Block Node 2 Block Block Node 3 Block Blocks

Slide 8

Slide 8 text

01 02 03 Universal Data Loading Last Mile Preprocessing Parallel GPU/CPU Compute Ray Datasets ray.data.Dataset Node 1 Block Node 2 Block Block Node 3 Block Blocks Not a DataFrame library!

Slide 9

Slide 9 text

Use case: Data loading for ML Training ETL from unstructured data Load data / last-mile preprocessing ML Training Ray Datasets + integrations Ray Train (Horovod, PyTorch, etc.) Large-scale relational data processing (Spark, Flink, Snowflake, Delta Lake etc.) Ray cluster

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Universal Data Loader - Supported Data Sources

Slide 12

Slide 12 text

Universal Data Loader - I/O API and Performance High performance distributed IO ds = ray.data.read_parquet("s3://some/bucket") ds = ray.data.read_csv("/tmp/some_file.csv") Leverages Apache Arrow’s high-performance single-threaded IO Parallelized using Ray’s high-throughput task execution Scales to PiB-scale jobs in production (Amazon) Read from storage Convert in memory ds = ray.data.from_pandas(df) df = ds.to_pandas() ds = ray.data.from_dask(dask_df) dask_df = ds.to_dask() ds.write_parquet("s3://some/bucket") ds.write_csv("/tmp/some_file.csv") Saving to storage

Slide 13

Slide 13 text

Mapper Mapper Reducer Reducer Input Output Last-Mile Preprocessing • Basic transformations and aggregations • Map, batch map, filter • Stats aggregations • Global shuffle operations • Sort • Random shuffle • Groupby ray.data.read_parquet("foo.parquet") \ .map_batches( process_batch, batch_size=512) \ .repartition(16) \ .groupby("col_1").mean("col_2") ray.data.read_parquet("foo.parquet") \ .filter(lambda x: x < 0) \ .map(lambda x: x**2) \ .random_shuffle() \ .write_parquet("bar.parquet")

Slide 14

Slide 14 text

Compute - Pipelining CPU/GPU compute GPU Idle t=0 t=100s t=200s t=300s loading preprocessing inference t=0 t=100s t=200s t=300s GPU Idle loading preprocessing inference Pipeline the execution of stages in your workflow 1. Parallelizes stage execution (lower latency) 2. Reduces idle time of stage-specific resources (lowers cost) Without pipelining: With pipelining:

Slide 15

Slide 15 text

Windowing can use memory more efficiently Execute your pipeline over windows of the full dataset This limits your cluster resource utilization (lowers cost) Pipeline Pipeline Pipeline Pipeline Without pipelining: With pipelining: Each window is processed sequentially Window 1: Window 2: Window 3: Might spill to disk if dataset is large! Allows us to keep all data in memory

Slide 16

Slide 16 text

● Efficient data layer ○ Zero-copy reads, shared-memory object store ○ Locality-aware scheduling ○ Object transfer protocols ● General purpose ○ Resource-based scheduling ○ Highly scalable ○ Robust primitives ○ Easy to programmatically compose distributed programs Why Ray?

Slide 17

Slide 17 text

Efficient batch inference Use Cases Scalable shuffled ML ingest

Slide 18

Slide 18 text

Background Scalable Shuffled ML Ingest

Slide 19

Slide 19 text

Loading data into one or more model trainers ML Ingest - Overview Trainer #1 Data source Data loader Trainer #2 Trainer #3 Primary functions: • Loading the data from storage • Partitioning the data into a shard per trainer • Batching the data into GPU batches • Shuffling the data before each epoch • Local shuffling (status quo) • Global shuffling (better but hard)

Slide 20

Slide 20 text

TL;DR: For certain models, per-epoch shuffling gives you much better model accuracy after a set number of epochs • Shuffling decorrelates samples within and across batches • Per-epoch shuffling reduces variance, improves generalization, and prevents overfitting • Batches are more representative of entire dataset, improving estimate of “true” full-dataset gradient • Gradient updates on individual samples are independent of sample ordering • Results in improved statistical gain of each step in the training process Aside - Why shuffle before each epoch?

Slide 21

Slide 21 text

• Need parallelized, scalable reading from storage • Need to avoid wasting idle GPUs while reading/shuffling data Why is scalable shuffled ML ingest hard? GPU Idle t=0 t=100s t=200s shuffled data loading training GPU Idle GPU Idle GPU Idle Epoch 1 Epoch 2 Epoch 3 Epoch 4

Slide 22

Slide 22 text

• The full pipeline may not fit in memory • Local shuffles hurt model accuracy, but… • Global per-epoch shuffling difficult to do efficiently in a distributed setting Why is scalable shuffled ML ingest hard? - continued • API should still be simple!

Slide 23

Slide 23 text

Massively scalable parallel IO, supporting many storage backends and formats Datasets solution - scalable IO ray.data.read_parquet("s3://some/training/data/bucket", parallelism=20) S3 dataset Read task 1 Read task 3 Read task 2 Trainer #1 Trainer #2 Trainer #3

Slide 24

Slide 24 text

Easy and efficient pipelining of preprocessing and shuffling with training, keeping the GPU saturated Datasets solution - pipelining stages GPU Idle t=0 t=100s t=200s shuffled data loading training Unpipelined Pipelined GPU Idle t=0 t=100s t=200s shuffled data loading training GPU Idle GPU Idle GPU Idle Epoch 1 Epoch 2 Epoch 3 Epoch 4 Epoch 1 Epoch 2 Epoch 3 Epoch 4

Slide 25

Slide 25 text

Efficient distributed in-memory per-epoch shuffling Datasets solution - per-epoch shuffling ray.data.read_parquet(training_data_dir) \ .repeat(num_epochs) \ .random_shuffle_each_window() \ .split( num_shards, equal=True, locality_hints=training_actors) Locality-aware placement of shuffle shards onto trainers

Slide 26

Slide 26 text

Limit data loading, preprocessing, and shuffling to a smaller window of the full dataset Datasets solution - dataset windowing Shuffled data loading Shuffled data loading Shuffled data loading Trainers Trainers Trainers Window 1: Window 2: Window 3: One epoch This trades off shuffle quality with cluster size

Slide 27

Slide 27 text

Creating a training pipeline for small data on a single node is simple Datasets solution - simple API shards = ray.data.read_parquet(training_data_dir) \ .window(blocks_per_window=20) \ .repeat(num_epochs) \ .random_shuffle_each_window() \ .split(num_shards, equal=True) ray.get([trainer.remote(shard) for shard in shards]) @ray.remote def trainer(data: DatasetPipeline): for epoch_ds in data.iter_epochs(): for X, y in epoch_ds.to_torch( label_column="label", batch_size=512): # Train on batch. ds = ray.data.read_parquet(training_data_dir) \ .repeat(num_epochs) \ .random_shuffle_each_window() ray.get(trainer.remote(ds)) Extending this pipeline to large data on a multi-node cluster is easy

Slide 28

Slide 28 text

Case studies Scalable Shuffled ML Ingest

Slide 29

Slide 29 text

• Common problem 1: Existing solution is a bottleneck in training pipeline • Common problem 2: Models are shuffle-sensitive, where shuffle quality effects model accuracy • Case study 1: high-tech ML platform startup • Existing solution: Pandas → S3 → Petastorm → Horovod • Petastorm: ML ingest library from Uber • Ray’s solution: Dask-on-Ray → Datasets → Horovod • Case study 2: large tech company in transport space • Existing solution: S3 → Petastorm → Horovod • Ray’s solution: S3 → Datasets → Horovod • Both case studies suffered from both problems! Case studies - background

Slide 30

Slide 30 text

Case Study 1: high-tech ML platform startup Dask-on-Ray → Datasets → Horovod • Dask-on-Ray and Datasets was 8x faster than Pandas + S3+ Petastorm, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single g4dn.4xlarge instance Case studies - benchmark results Case Study 2: large transport tech company S3 → Datasets → Horovod • Datasets from S3 was 4x faster than Petastorm from S3 • Benchmark: 1.5 TB synthetic tabular dataset, 16 nodes (40 vCPUs, 180 GB RAM), 2 shuffle windows Aggregate Throughput Petastorm 2.16 GB/s Datasets 8.18 GB/s

Slide 31

Slide 31 text

User 1: ML platform startup Dask-on-Ray → Datasets → Horovod • Dask-on-Ray and Datasets was 10x faster than Pandas + S3+ Petastorm, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single g4dn.4xlarge instance Case study - benchmark results User 2: large transport tech company S3 → Datasets → Horovod • Datasets from S3 was 4x faster than Petastorm from S3 • Benchmark: 1.5 TB synthetic tabular dataset, 70 shuffle workers (c5.18xlarge), 16 trainers (c5.18xlarge), 3 shuffle windows Throughput Petastorm 1.8 GB/s Datasets 7.38 GB/s Ray Datasets gives higher quality shuffle AND better performance, even at small scales!

Slide 32

Slide 32 text

We’re actively working on more comprehensive, open benchmarks at large data scales Stay tuned!

Slide 33

Slide 33 text

Background Efficient Batch Inference

Slide 34

Slide 34 text

Run model inference over a large number of images. Primary functions: ● Loading data from storage ● Minor preprocessing before inference ● Perform model inference on batches of data (possibly on the GPU) ● Save the result to storage And do it all efficiently! What is batch inference?

Slide 35

Slide 35 text

• Scheduling • Don’t needlessly transfer data around the cluster • Only reserve GPUs for a fraction of the time • Need to keep the GPU saturated to keep costs down What are the challenges with distributed batch inference? t=0 t=100s t=200s t=300s GPU Idle loading preprocessing inference

Slide 36

Slide 36 text

Datasets solution - scheduling • Automatic data locality • Utilize Ray’s resource scheduling def preprocess(row): pass def infer(batch): pass dataset = ray.data \ .from_binary_files("s3://my-bucket") \ .map(preprocess) \ .map_batches(infer, num_gpus=0.25) \ .write_parquet("gcs://results_dir")

Slide 37

Slide 37 text

• Pipelining of data loading, preprocessing, and inference stages Datasets solution - pipelining GPU Idle t=0 t=100s t=200s t=300s loading preprocessing inference pipe: DatasetPipeline = ray.data \ .read_binary_files("s3://bucket/image-dir") \ .window(blocks_per_window=2) \ .map(preprocess) \ .map_batches(BatchInferModel, batch_size=256, num_gpus=1) \ .write_json("/tmp/results")

Slide 38

Slide 38 text

Case studies Efficient Batch Inference

Slide 39

Slide 39 text

• Case study 1: startup processing aerial imagery • Problem: Looking for scalable/resource-efficient batch inference • Case study 2: high-tech ML platform startup • Problem: Existing Pandas → Torch solution: not scalable, inefficient use of resources (idle GPU) • Integration: Dask-on-Ray → Ray Datasets → Torch Case studies - background

Slide 40

Slide 40 text

Slide 41

Slide 41 text

• Dask-on-Ray + Datasets + Torch was 5x faster than Pandas + Torch, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single r5d.4xlarge instance Case studies - benchmark results Ray Datasets provides better resource efficiency and better performance, even at small scales!

Slide 42

Slide 42 text

Recently added features: • Lazy compute mode • Automatic task fusion • Move semantics to reduce memory consumption • Detailed execution statistics Upcoming features: • More data processing operations • Groupby and aggregation performance improvements • A suite of high-level preprocessing operations • Distributed shuffle performance improvements • Better out-of-core performance Current and Future Work

Slide 43

Slide 43 text

Lots of room for improvement in current ML pipelines: ● Reducing cost and performance overheads ● Supporting ML operations like shuffling at scale ● Ease of use Ray Datasets hits the mark by providing: ● Efficient data movement between steps in a pipeline ● Hyper-scalable implementations of ML operations ● Simple expression of complex ML pipelines ● ML ingest: better shuffled data and better performance than the status quo ● Batch inference: better efficiency and performance than the status quo Summary

Slide 44

Slide 44 text

Ray: https://github.com/ray-project/ray Ray Datasets: https://docs.ray.io/en/master/data/dataset.html# Join the Ray Discussion Forum: https://discuss.ray.io/ We’re hiring! https://jobs.lever.co/anyscale Questions? Thank you