Unifying Preprocessing and Training at Scale with Ray Datasets

Unifying Data preprocessing and training with Ray Datasets [PUBLIC] Alex
Wu Software Engineer Clark Zinzow Software Engineer

01 02 03 State of ML Training/Scoring Pipelines Introducing Ray
Datasets Use Cases Overview

Existing ML Pipelines Preprocess Checkpoint Train

Existing ML Pipelines

• Performance overheads ◦ Serialization/Deserialization ◦ Data materialized to external
storage • Implementation/Operational Complexity ◦ Cross-lang, cross-workload ◦ CPUs vs GPUs • Missing operations ◦ Per-epoch shuffling ▪ How to do a fast, in-memory, distributed shuffle? Challenges

Built on Ray Ray Datasets

• Efficient data layer ◦ Zero-copy reads, shared-memory object store
◦ Locality-aware scheduling ◦ Object transfer protocols • General purpose ◦ Resource-based scheduling ◦ Highly scalable ◦ Robust primitives ◦ Easy to programmatically compose distributed programs Why Ray?

01 02 03 Universal Data Loading Last Mile Preprocessing Parallel
GPU/CPU Compute Ray Datasets ray.data.Dataset Node 1 Block Node 2 Block Block Node 3 Block Blocks

01 02 03 Universal Data Loading Last Mile Preprocessing Parallel
GPU/CPU Compute Ray Datasets ray.data.Dataset Node 1 Block Node 2 Block Block Node 3 Block Blocks Not a DataFrame library!

Use case: Data loading for ML Training ETL from unstructured
data Load data / last-mile preprocessing ML Training Ray Datasets + integrations Ray Train (Horovod, PyTorch, etc.) Large-scale relational data processing (Spark, Flink, Snowflake, Delta Lake etc.) Ray cluster

Powered by Universal Data Loader ray.data.Dataset

Universal Data Loader - Supported Data Sources

Universal Data Loader - I/O API and Performance High performance
distributed IO ds = ray.data.read_parquet("s3://some/bucket") ds = ray.data.read_csv("/tmp/some_file.csv") Leverages Apache Arrow’s high-performance single-threaded IO Parallelized using Ray’s high-throughput task execution Scales to PiB-scale jobs in production (Amazon) Read from storage Convert in memory ds = ray.data.from_pandas(df) df = ds.to_pandas() ds = ray.data.from_dask(dask_df) dask_df = ds.to_dask() ds.write_parquet("s3://some/bucket") ds.write_csv("/tmp/some_file.csv") Saving to storage

Mapper Mapper Reducer Reducer Input Output Last-Mile Preprocessing • Basic
transformations and aggregations • Map, batch map, filter • Stats aggregations • Global shuffle operations • Sort • Random shuffle • Groupby ray.data.read_parquet("foo.parquet") \ .map_batches( process_batch, batch_size=512) \ .repartition(16) \ .groupby("col_1").mean("col_2") ray.data.read_parquet("foo.parquet") \ .filter(lambda x: x < 0) \ .map(lambda x: x**2) \ .random_shuffle() \ .write_parquet("bar.parquet")

Compute - Pipelining CPU/GPU compute GPU Idle t=0 t=100s t=200s
t=300s loading preprocessing inference t=0 t=100s t=200s t=300s GPU Idle loading preprocessing inference Pipeline the execution of stages in your workflow 1. Parallelizes stage execution (lower latency) 2. Reduces idle time of stage-specific resources (lowers cost) Without pipelining: With pipelining:

Windowing can use memory more efficiently Execute your pipeline over
windows of the full dataset This limits your cluster resource utilization (lowers cost) Pipeline Pipeline Pipeline Pipeline Without pipelining: With pipelining: Each window is processed sequentially Window 1: Window 2: Window 3: Might spill to disk if dataset is large! Allows us to keep all data in memory

Efficient batch inference Use Cases Scalable shuffled ML ingest

Background Scalable Shuffled ML Ingest

Loading data into one or more model trainers ML Ingest
- Overview Trainer #1 Data source Data loader Trainer #2 Trainer #3 Primary functions: • Loading the data from storage • Partitioning the data into a shard per trainer • Batching the data into GPU batches • Shuffling the data before each epoch • Local shuffling (status quo) • Global shuffling (better but hard)

TL;DR: For certain models, per-epoch shuffling gives you much better
model accuracy after a set number of epochs • Shuffling decorrelates samples within and across batches • Per-epoch shuffling reduces variance, improves generalization, and prevents overfitting • Batches are more representative of entire dataset, improving estimate of “true” full-dataset gradient • Gradient updates on individual samples are independent of sample ordering • Results in improved statistical gain of each step in the training process Aside - Why shuffle before each epoch?

• Need parallelized, scalable reading from storage • Need to
avoid wasting idle GPUs while reading/shuffling data Why is scalable shuffled ML ingest hard? GPU Idle t=0 t=100s t=200s shuffled data loading training GPU Idle GPU Idle GPU Idle Epoch 1 Epoch 2 Epoch 3 Epoch 4

• The full pipeline may not fit in memory •
Local shuffles hurt model accuracy, but… • Global per-epoch shuffling difficult to do efficiently in a distributed setting Why is scalable shuffled ML ingest hard? - continued • API should still be simple!

Massively scalable parallel IO, supporting many storage backends and formats
Datasets solution - scalable IO ray.data.read_parquet("s3://some/training/data/bucket", parallelism=20) S3 dataset Read task 1 Read task 3 Read task 2 Trainer #1 Trainer #2 Trainer #3

Easy and efficient pipelining of preprocessing and shuffling with training,
keeping the GPU saturated Datasets solution - pipelining stages GPU Idle t=0 t=100s t=200s shuffled data loading training Unpipelined Pipelined GPU Idle t=0 t=100s t=200s shuffled data loading training GPU Idle GPU Idle GPU Idle Epoch 1 Epoch 2 Epoch 3 Epoch 4 Epoch 1 Epoch 2 Epoch 3 Epoch 4

Efficient distributed in-memory per-epoch shuffling Datasets solution - per-epoch shuffling
ray.data.read_parquet(training_data_dir) \ .repeat(num_epochs) \ .random_shuffle_each_window() \ .split( num_shards, equal=True, locality_hints=training_actors) Locality-aware placement of shuffle shards onto trainers

Limit data loading, preprocessing, and shuffling to a smaller window
of the full dataset Datasets solution - dataset windowing Shuffled data loading Shuffled data loading Shuffled data loading Trainers Trainers Trainers Window 1: Window 2: Window 3: One epoch This trades off shuffle quality with cluster size

Creating a training pipeline for small data on a single
node is simple Datasets solution - simple API shards = ray.data.read_parquet(training_data_dir) \ .window(blocks_per_window=20) \ .repeat(num_epochs) \ .random_shuffle_each_window() \ .split(num_shards, equal=True) ray.get([trainer.remote(shard) for shard in shards]) @ray.remote def trainer(data: DatasetPipeline): for epoch_ds in data.iter_epochs(): for X, y in epoch_ds.to_torch( label_column="label", batch_size=512): # Train on batch. ds = ray.data.read_parquet(training_data_dir) \ .repeat(num_epochs) \ .random_shuffle_each_window() ray.get(trainer.remote(ds)) Extending this pipeline to large data on a multi-node cluster is easy

Case studies Scalable Shuffled ML Ingest

• Common problem 1: Existing solution is a bottleneck in
training pipeline • Common problem 2: Models are shuffle-sensitive, where shuffle quality effects model accuracy • Case study 1: high-tech ML platform startup • Existing solution: Pandas → S3 → Petastorm → Horovod • Petastorm: ML ingest library from Uber • Ray’s solution: Dask-on-Ray → Datasets → Horovod • Case study 2: large tech company in transport space • Existing solution: S3 → Petastorm → Horovod • Ray’s solution: S3 → Datasets → Horovod • Both case studies suffered from both problems! Case studies - background

Case Study 1: high-tech ML platform startup Dask-on-Ray → Datasets
→ Horovod • Dask-on-Ray and Datasets was 8x faster than Pandas + S3+ Petastorm, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single g4dn.4xlarge instance Case studies - benchmark results Case Study 2: large transport tech company S3 → Datasets → Horovod • Datasets from S3 was 4x faster than Petastorm from S3 • Benchmark: 1.5 TB synthetic tabular dataset, 16 nodes (40 vCPUs, 180 GB RAM), 2 shuffle windows Aggregate Throughput Petastorm 2.16 GB/s Datasets 8.18 GB/s

User 1: ML platform startup Dask-on-Ray → Datasets → Horovod
• Dask-on-Ray and Datasets was 10x faster than Pandas + S3+ Petastorm, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single g4dn.4xlarge instance Case study - benchmark results User 2: large transport tech company S3 → Datasets → Horovod • Datasets from S3 was 4x faster than Petastorm from S3 • Benchmark: 1.5 TB synthetic tabular dataset, 70 shuffle workers (c5.18xlarge), 16 trainers (c5.18xlarge), 3 shuffle windows Throughput Petastorm 1.8 GB/s Datasets 7.38 GB/s Ray Datasets gives higher quality shuffle AND better performance, even at small scales!

We’re actively working on more comprehensive, open benchmarks at large
data scales Stay tuned!

Background Efficient Batch Inference

Run model inference over a large number of images. Primary
functions: • Loading data from storage • Minor preprocessing before inference • Perform model inference on batches of data (possibly on the GPU) • Save the result to storage And do it all efficiently! What is batch inference?

• Scheduling • Don’t needlessly transfer data around the cluster
• Only reserve GPUs for a fraction of the time • Need to keep the GPU saturated to keep costs down What are the challenges with distributed batch inference? t=0 t=100s t=200s t=300s GPU Idle loading preprocessing inference

Datasets solution - scheduling • Automatic data locality • Utilize
Ray’s resource scheduling def preprocess(row): pass def infer(batch): pass dataset = ray.data \ .from_binary_files("s3://my-bucket") \ .map(preprocess) \ .map_batches(infer, num_gpus=0.25) \ .write_parquet("gcs://results_dir")

• Pipelining of data loading, preprocessing, and inference stages Datasets
solution - pipelining GPU Idle t=0 t=100s t=200s t=300s loading preprocessing inference pipe: DatasetPipeline = ray.data \ .read_binary_files("s3://bucket/image-dir") \ .window(blocks_per_window=2) \ .map(preprocess) \ .map_batches(BatchInferModel, batch_size=256, num_gpus=1) \ .write_json("/tmp/results")

Case studies Efficient Batch Inference

• Case study 1: startup processing aerial imagery • Problem:
Looking for scalable/resource-efficient batch inference • Case study 2: high-tech ML platform startup • Problem: Existing Pandas → Torch solution: not scalable, inefficient use of resources (idle GPU) • Integration: Dask-on-Ray → Ray Datasets → Torch Case studies - background

• Dask-on-Ray + Datasets + Torch was 5x faster than
Pandas + Torch, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single r5d.4xlarge instance Case studies - benchmark results

• Dask-on-Ray + Datasets + Torch was 5x faster than
Pandas + Torch, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single r5d.4xlarge instance Case studies - benchmark results Ray Datasets provides better resource efficiency and better performance, even at small scales!

Lots of room for improvement in current ML pipelines: •
Reducing cost and performance overheads • Supporting ML operations like shuffling at scale • Ease of use Ray Datasets hits the mark by providing: • Efficient data movement between steps in a pipeline • Hyper-scalable implementations of ML operations • Simple expression of complex ML pipelines • ML ingest: better shuffled data and better performance than the status quo • Batch inference: better efficiency and performance than the status quo Summary

Ray: https://github.com/ray-project/ray Ray Datasets: https://docs.ray.io/en/master/data/dataset.html# Join the Ray Discussion Forum:
https://discuss.ray.io/ We’re hiring! https://jobs.lever.co/anyscale Questions? Thank you

Unifying Preprocessing and Training at Scale wi...

Unifying Preprocessing and Training at Scale with Ray Datasets

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript