Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unifying Preprocessing and Training at Scale with Ray Datasets

Unifying Preprocessing and Training at Scale with Ray Datasets

ML tasks such as distributed training and batch inference stretch the abstractions of modern data processing systems. In this talk, we’ll discuss the wide-ranging problems that the Python community faces when building large-scale preprocessing and training pipelines. Some of these problems are caused by the complexity of stitching together distributed systems that weren’t designed to be compatible. For example, creating a pipeline with Dask and Horovod that can efficiently use the CPUs and GPUs in a cluster. Other problems -- like per-epoch dataset shuffling, show a gap between what operations ML practitioners want and what data processing libraries are capable of doing efficiently. We’ll also introduce Ray Datasets, a simple, scalable, and pythonic way of solving these problems.

Af07bbf978a0989644b039ae6b8904a5?s=128

Anyscale
PRO

January 21, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Unifying Data preprocessing and training with Ray Datasets [PUBLIC] Alex

    Wu Software Engineer Clark Zinzow Software Engineer
  2. 01 02 03 State of ML Training/Scoring Pipelines Introducing Ray

    Datasets Use Cases Overview
  3. Existing ML Pipelines Preprocess Checkpoint Train

  4. Existing ML Pipelines

  5. • Performance overheads ◦ Serialization/Deserialization ◦ Data materialized to external

    storage • Implementation/Operational Complexity ◦ Cross-lang, cross-workload ◦ CPUs vs GPUs • Missing operations ◦ Per-epoch shuffling ▪ How to do a fast, in-memory, distributed shuffle? Challenges
  6. Built on Ray Ray Datasets

  7. • Efficient data layer ◦ Zero-copy reads, shared-memory object store

    ◦ Locality-aware scheduling ◦ Object transfer protocols • General purpose ◦ Resource-based scheduling ◦ Highly scalable ◦ Robust primitives ◦ Easy to programmatically compose distributed programs Why Ray?
  8. 01 02 03 Universal Data Loading Last Mile Preprocessing Parallel

    GPU/CPU Compute Ray Datasets ray.data.Dataset Node 1 Block Node 2 Block Block Node 3 Block Blocks
  9. 01 02 03 Universal Data Loading Last Mile Preprocessing Parallel

    GPU/CPU Compute Ray Datasets ray.data.Dataset Node 1 Block Node 2 Block Block Node 3 Block Blocks Not a DataFrame library!
  10. Use case: Data loading for ML Training ETL from unstructured

    data Load data / last-mile preprocessing ML Training Ray Datasets + integrations Ray Train (Horovod, PyTorch, etc.) Large-scale relational data processing (Spark, Flink, Snowflake, Delta Lake etc.) Ray cluster
  11. Powered by Universal Data Loader ray.data.Dataset

  12. Universal Data Loader - Supported Data Sources

  13. Universal Data Loader - I/O API and Performance High performance

    distributed IO ds = ray.data.read_parquet("s3://some/bucket") ds = ray.data.read_csv("/tmp/some_file.csv") Leverages Apache Arrow’s high-performance single-threaded IO Parallelized using Ray’s high-throughput task execution Scales to PiB-scale jobs in production (Amazon) Read from storage Convert in memory ds = ray.data.from_pandas(df) df = ds.to_pandas() ds = ray.data.from_dask(dask_df) dask_df = ds.to_dask() ds.write_parquet("s3://some/bucket") ds.write_csv("/tmp/some_file.csv") Saving to storage
  14. Mapper Mapper Reducer Reducer Input Output Last-Mile Preprocessing • Basic

    transformations and aggregations • Map, batch map, filter • Stats aggregations • Global shuffle operations • Sort • Random shuffle • Groupby ray.data.read_parquet("foo.parquet") \ .map_batches( process_batch, batch_size=512) \ .repartition(16) \ .groupby("col_1").mean("col_2") ray.data.read_parquet("foo.parquet") \ .filter(lambda x: x < 0) \ .map(lambda x: x**2) \ .random_shuffle() \ .write_parquet("bar.parquet")
  15. Compute - Pipelining CPU/GPU compute GPU Idle t=0 t=100s t=200s

    t=300s loading preprocessing inference t=0 t=100s t=200s t=300s GPU Idle loading preprocessing inference Pipeline the execution of stages in your workflow 1. Parallelizes stage execution (lower latency) 2. Reduces idle time of stage-specific resources (lowers cost) Without pipelining: With pipelining:
  16. Windowing can use memory more efficiently Execute your pipeline over

    windows of the full dataset This limits your cluster resource utilization (lowers cost) Pipeline Pipeline Pipeline Pipeline Without pipelining: With pipelining: Each window is processed sequentially Window 1: Window 2: Window 3: Might spill to disk if dataset is large! Allows us to keep all data in memory
  17. Efficient batch inference Use Cases Scalable shuffled ML ingest

  18. Background Scalable Shuffled ML Ingest

  19. Loading data into one or more model trainers ML Ingest

    - Overview Trainer #1 Data source Data loader Trainer #2 Trainer #3 Primary functions: • Loading the data from storage • Partitioning the data into a shard per trainer • Batching the data into GPU batches • Shuffling the data before each epoch • Local shuffling (status quo) • Global shuffling (better but hard)
  20. TL;DR: For certain models, per-epoch shuffling gives you much better

    model accuracy after a set number of epochs • Shuffling decorrelates samples within and across batches • Per-epoch shuffling reduces variance, improves generalization, and prevents overfitting • Batches are more representative of entire dataset, improving estimate of “true” full-dataset gradient • Gradient updates on individual samples are independent of sample ordering • Results in improved statistical gain of each step in the training process Aside - Why shuffle before each epoch?
  21. • Need parallelized, scalable reading from storage • Need to

    avoid wasting idle GPUs while reading/shuffling data Why is scalable shuffled ML ingest hard? GPU Idle t=0 t=100s t=200s shuffled data loading training GPU Idle GPU Idle GPU Idle Epoch 1 Epoch 2 Epoch 3 Epoch 4
  22. • The full pipeline may not fit in memory •

    Local shuffles hurt model accuracy, but… • Global per-epoch shuffling difficult to do efficiently in a distributed setting Why is scalable shuffled ML ingest hard? - continued • API should still be simple!
  23. Massively scalable parallel IO, supporting many storage backends and formats

    Datasets solution - scalable IO ray.data.read_parquet("s3://some/training/data/bucket", parallelism=20) S3 dataset Read task 1 Read task 3 Read task 2 Trainer #1 Trainer #2 Trainer #3
  24. Easy and efficient pipelining of preprocessing and shuffling with training,

    keeping the GPU saturated Datasets solution - pipelining stages GPU Idle t=0 t=100s t=200s shuffled data loading training Unpipelined Pipelined GPU Idle t=0 t=100s t=200s shuffled data loading training GPU Idle GPU Idle GPU Idle Epoch 1 Epoch 2 Epoch 3 Epoch 4 Epoch 1 Epoch 2 Epoch 3 Epoch 4
  25. Efficient distributed in-memory per-epoch shuffling Datasets solution - per-epoch shuffling

    ray.data.read_parquet(training_data_dir) \ .repeat(num_epochs) \ .random_shuffle_each_window() \ .split( num_shards, equal=True, locality_hints=training_actors) Locality-aware placement of shuffle shards onto trainers
  26. Limit data loading, preprocessing, and shuffling to a smaller window

    of the full dataset Datasets solution - dataset windowing Shuffled data loading Shuffled data loading Shuffled data loading Trainers Trainers Trainers Window 1: Window 2: Window 3: One epoch This trades off shuffle quality with cluster size
  27. Creating a training pipeline for small data on a single

    node is simple Datasets solution - simple API shards = ray.data.read_parquet(training_data_dir) \ .window(blocks_per_window=20) \ .repeat(num_epochs) \ .random_shuffle_each_window() \ .split(num_shards, equal=True) ray.get([trainer.remote(shard) for shard in shards]) @ray.remote def trainer(data: DatasetPipeline): for epoch_ds in data.iter_epochs(): for X, y in epoch_ds.to_torch( label_column="label", batch_size=512): # Train on batch. ds = ray.data.read_parquet(training_data_dir) \ .repeat(num_epochs) \ .random_shuffle_each_window() ray.get(trainer.remote(ds)) Extending this pipeline to large data on a multi-node cluster is easy
  28. Case studies Scalable Shuffled ML Ingest

  29. • Common problem 1: Existing solution is a bottleneck in

    training pipeline • Common problem 2: Models are shuffle-sensitive, where shuffle quality effects model accuracy • Case study 1: high-tech ML platform startup • Existing solution: Pandas → S3 → Petastorm → Horovod • Petastorm: ML ingest library from Uber • Ray’s solution: Dask-on-Ray → Datasets → Horovod • Case study 2: large tech company in transport space • Existing solution: S3 → Petastorm → Horovod • Ray’s solution: S3 → Datasets → Horovod • Both case studies suffered from both problems! Case studies - background
  30. Case Study 1: high-tech ML platform startup Dask-on-Ray → Datasets

    → Horovod • Dask-on-Ray and Datasets was 8x faster than Pandas + S3+ Petastorm, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single g4dn.4xlarge instance Case studies - benchmark results Case Study 2: large transport tech company S3 → Datasets → Horovod • Datasets from S3 was 4x faster than Petastorm from S3 • Benchmark: 1.5 TB synthetic tabular dataset, 16 nodes (40 vCPUs, 180 GB RAM), 2 shuffle windows Aggregate Throughput Petastorm 2.16 GB/s Datasets 8.18 GB/s
  31. User 1: ML platform startup Dask-on-Ray → Datasets → Horovod

    • Dask-on-Ray and Datasets was 10x faster than Pandas + S3+ Petastorm, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single g4dn.4xlarge instance Case study - benchmark results User 2: large transport tech company S3 → Datasets → Horovod • Datasets from S3 was 4x faster than Petastorm from S3 • Benchmark: 1.5 TB synthetic tabular dataset, 70 shuffle workers (c5.18xlarge), 16 trainers (c5.18xlarge), 3 shuffle windows Throughput Petastorm 1.8 GB/s Datasets 7.38 GB/s Ray Datasets gives higher quality shuffle AND better performance, even at small scales!
  32. We’re actively working on more comprehensive, open benchmarks at large

    data scales Stay tuned!
  33. Background Efficient Batch Inference

  34. Run model inference over a large number of images. Primary

    functions: • Loading data from storage • Minor preprocessing before inference • Perform model inference on batches of data (possibly on the GPU) • Save the result to storage And do it all efficiently! What is batch inference?
  35. • Scheduling • Don’t needlessly transfer data around the cluster

    • Only reserve GPUs for a fraction of the time • Need to keep the GPU saturated to keep costs down What are the challenges with distributed batch inference? t=0 t=100s t=200s t=300s GPU Idle loading preprocessing inference
  36. Datasets solution - scheduling • Automatic data locality • Utilize

    Ray’s resource scheduling def preprocess(row): pass def infer(batch): pass dataset = ray.data \ .from_binary_files("s3://my-bucket") \ .map(preprocess) \ .map_batches(infer, num_gpus=0.25) \ .write_parquet("gcs://results_dir")
  37. • Pipelining of data loading, preprocessing, and inference stages Datasets

    solution - pipelining GPU Idle t=0 t=100s t=200s t=300s loading preprocessing inference pipe: DatasetPipeline = ray.data \ .read_binary_files("s3://bucket/image-dir") \ .window(blocks_per_window=2) \ .map(preprocess) \ .map_batches(BatchInferModel, batch_size=256, num_gpus=1) \ .write_json("/tmp/results")
  38. Case studies Efficient Batch Inference

  39. • Case study 1: startup processing aerial imagery • Problem:

    Looking for scalable/resource-efficient batch inference • Case study 2: high-tech ML platform startup • Problem: Existing Pandas → Torch solution: not scalable, inefficient use of resources (idle GPU) • Integration: Dask-on-Ray → Ray Datasets → Torch Case studies - background
  40. • Dask-on-Ray + Datasets + Torch was 5x faster than

    Pandas + Torch, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single r5d.4xlarge instance Case studies - benchmark results
  41. • Dask-on-Ray + Datasets + Torch was 5x faster than

    Pandas + Torch, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single r5d.4xlarge instance Case studies - benchmark results Ray Datasets provides better resource efficiency and better performance, even at small scales!
  42. Lots of room for improvement in current ML pipelines: •

    Reducing cost and performance overheads • Supporting ML operations like shuffling at scale • Ease of use Ray Datasets hits the mark by providing: • Efficient data movement between steps in a pipeline • Hyper-scalable implementations of ML operations • Simple expression of complex ML pipelines • ML ingest: better shuffled data and better performance than the status quo • Batch inference: better efficiency and performance than the status quo Summary
  43. Ray: https://github.com/ray-project/ray Ray Datasets: https://docs.ray.io/en/master/data/dataset.html# Join the Ray Discussion Forum:

    https://discuss.ray.io/ We’re hiring! https://jobs.lever.co/anyscale Questions? Thank you