Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ray Datasets: Scalable data preprocessing for distributed ML

Anyscale
February 23, 2022

Ray Datasets: Scalable data preprocessing for distributed ML

Ray Datasets is a Ray-native distributed dataset library that serves as the standard way to load, process, and exchange data in Ray libraries and applications. It features performant distributed data loading, flexible parallel compute operations, and comprehensive datasource compatibility and distributed framework integrations, all behind an incredibly simple API.

Get a first-hand look at how Ray Datasets:
- Provides hyper-scalable parallel I/O to most popular storage backends and file formats
- Supports common last-mile preprocessing operations, including basic parallel data transformations such as map, batched map, and filter, and global operations such as sort, shuffle, groupby, and stats aggregations
- Efficiently integrates with data processing libraries (e.g., Spark, Pandas, NumPy, Dask, Mars) and machine learning frameworks (e.g., TensorFlow, Torch, Horovod)

Anyscale

February 23, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. • Performance overheads ◦ Serialization/Deserialization ◦ Data materialized to external

    storage • Implementation/Operational Complexity ◦ Cross-lang, cross-workload ◦ CPUs vs GPUs • Missing operations ◦ Per-epoch shuffling ▪ How to do a fast, in-memory, distributed shuffle? Challenges
  2. 01 02 03 Universal Data Loading Last Mile Preprocessing Parallel

    GPU/CPU Compute Ray Datasets ray.data.Dataset Node 1 Block Node 2 Block Block Node 3 Block Blocks
  3. 01 02 03 Universal Data Loading Last Mile Preprocessing Parallel

    GPU/CPU Compute Ray Datasets ray.data.Dataset Node 1 Block Node 2 Block Block Node 3 Block Blocks Not a DataFrame library!
  4. Use case: Data loading for ML Training ETL from unstructured

    data Load data / last-mile preprocessing ML Training Ray Datasets + integrations Ray Train (Horovod, PyTorch, etc.) Large-scale relational data processing (Spark, Flink, Snowflake, Delta Lake etc.) Ray cluster
  5. Universal Data Loader - I/O API and Performance High performance

    distributed IO ds = ray.data.read_parquet("s3://some/bucket") ds = ray.data.read_csv("/tmp/some_file.csv") Leverages Apache Arrow’s high-performance single-threaded IO Parallelized using Ray’s high-throughput task execution Scales to PiB-scale jobs in production (Amazon) Read from storage Convert in memory ds = ray.data.from_pandas(df) df = ds.to_pandas() ds = ray.data.from_dask(dask_df) dask_df = ds.to_dask() ds.write_parquet("s3://some/bucket") ds.write_csv("/tmp/some_file.csv") Saving to storage
  6. Mapper Mapper Reducer Reducer Input Output Last-Mile Preprocessing • Basic

    transformations and aggregations • Map, batch map, filter • Stats aggregations • Global shuffle operations • Sort • Random shuffle • Groupby ray.data.read_parquet("foo.parquet") \ .map_batches( process_batch, batch_size=512) \ .repartition(16) \ .groupby("col_1").mean("col_2") ray.data.read_parquet("foo.parquet") \ .filter(lambda x: x < 0) \ .map(lambda x: x**2) \ .random_shuffle() \ .write_parquet("bar.parquet")
  7. Compute - Pipelining CPU/GPU compute GPU Idle t=0 t=100s t=200s

    t=300s loading preprocessing inference t=0 t=100s t=200s t=300s GPU Idle loading preprocessing inference Pipeline the execution of stages in your workflow 1. Parallelizes stage execution (lower latency) 2. Reduces idle time of stage-specific resources (lowers cost) Without pipelining: With pipelining:
  8. Windowing can use memory more efficiently Execute your pipeline over

    windows of the full dataset This limits your cluster resource utilization (lowers cost) Pipeline Pipeline Pipeline Pipeline Without pipelining: With pipelining: Each window is processed sequentially Window 1: Window 2: Window 3: Might spill to disk if dataset is large! Allows us to keep all data in memory
  9. • Efficient data layer ◦ Zero-copy reads, shared-memory object store

    ◦ Locality-aware scheduling ◦ Object transfer protocols • General purpose ◦ Resource-based scheduling ◦ Highly scalable ◦ Robust primitives ◦ Easy to programmatically compose distributed programs Why Ray?
  10. Loading data into one or more model trainers ML Ingest

    - Overview Trainer #1 Data source Data loader Trainer #2 Trainer #3 Primary functions: • Loading the data from storage • Partitioning the data into a shard per trainer • Batching the data into GPU batches • Shuffling the data before each epoch • Local shuffling (status quo) • Global shuffling (better but hard)
  11. TL;DR: For certain models, per-epoch shuffling gives you much better

    model accuracy after a set number of epochs • Shuffling decorrelates samples within and across batches • Per-epoch shuffling reduces variance, improves generalization, and prevents overfitting • Batches are more representative of entire dataset, improving estimate of “true” full-dataset gradient • Gradient updates on individual samples are independent of sample ordering • Results in improved statistical gain of each step in the training process Aside - Why shuffle before each epoch?
  12. • Need parallelized, scalable reading from storage • Need to

    avoid wasting idle GPUs while reading/shuffling data Why is scalable shuffled ML ingest hard? GPU Idle t=0 t=100s t=200s shuffled data loading training GPU Idle GPU Idle GPU Idle Epoch 1 Epoch 2 Epoch 3 Epoch 4
  13. • The full pipeline may not fit in memory •

    Local shuffles hurt model accuracy, but… • Global per-epoch shuffling difficult to do efficiently in a distributed setting Why is scalable shuffled ML ingest hard? - continued • API should still be simple!
  14. Massively scalable parallel IO, supporting many storage backends and formats

    Datasets solution - scalable IO ray.data.read_parquet("s3://some/training/data/bucket", parallelism=20) S3 dataset Read task 1 Read task 3 Read task 2 Trainer #1 Trainer #2 Trainer #3
  15. Easy and efficient pipelining of preprocessing and shuffling with training,

    keeping the GPU saturated Datasets solution - pipelining stages GPU Idle t=0 t=100s t=200s shuffled data loading training Unpipelined Pipelined GPU Idle t=0 t=100s t=200s shuffled data loading training GPU Idle GPU Idle GPU Idle Epoch 1 Epoch 2 Epoch 3 Epoch 4 Epoch 1 Epoch 2 Epoch 3 Epoch 4
  16. Efficient distributed in-memory per-epoch shuffling Datasets solution - per-epoch shuffling

    ray.data.read_parquet(training_data_dir) \ .repeat(num_epochs) \ .random_shuffle_each_window() \ .split( num_shards, equal=True, locality_hints=training_actors) Locality-aware placement of shuffle shards onto trainers
  17. Limit data loading, preprocessing, and shuffling to a smaller window

    of the full dataset Datasets solution - dataset windowing Shuffled data loading Shuffled data loading Shuffled data loading Trainers Trainers Trainers Window 1: Window 2: Window 3: One epoch This trades off shuffle quality with cluster size
  18. Creating a training pipeline for small data on a single

    node is simple Datasets solution - simple API shards = ray.data.read_parquet(training_data_dir) \ .window(blocks_per_window=20) \ .repeat(num_epochs) \ .random_shuffle_each_window() \ .split(num_shards, equal=True) ray.get([trainer.remote(shard) for shard in shards]) @ray.remote def trainer(data: DatasetPipeline): for epoch_ds in data.iter_epochs(): for X, y in epoch_ds.to_torch( label_column="label", batch_size=512): # Train on batch. ds = ray.data.read_parquet(training_data_dir) \ .repeat(num_epochs) \ .random_shuffle_each_window() ray.get(trainer.remote(ds)) Extending this pipeline to large data on a multi-node cluster is easy
  19. • Common problem 1: Existing solution is a bottleneck in

    training pipeline • Common problem 2: Models are shuffle-sensitive, where shuffle quality effects model accuracy • Case study 1: high-tech ML platform startup • Existing solution: Pandas → S3 → Petastorm → Horovod • Petastorm: ML ingest library from Uber • Ray’s solution: Dask-on-Ray → Datasets → Horovod • Case study 2: large tech company in transport space • Existing solution: S3 → Petastorm → Horovod • Ray’s solution: S3 → Datasets → Horovod • Both case studies suffered from both problems! Case studies - background
  20. Case Study 1: high-tech ML platform startup Dask-on-Ray → Datasets

    → Horovod • Dask-on-Ray and Datasets was 8x faster than Pandas + S3+ Petastorm, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single g4dn.4xlarge instance Case studies - benchmark results Case Study 2: large transport tech company S3 → Datasets → Horovod • Datasets from S3 was 4x faster than Petastorm from S3 • Benchmark: 1.5 TB synthetic tabular dataset, 16 nodes (40 vCPUs, 180 GB RAM), 2 shuffle windows Aggregate Throughput Petastorm 2.16 GB/s Datasets 8.18 GB/s
  21. User 1: ML platform startup Dask-on-Ray → Datasets → Horovod

    • Dask-on-Ray and Datasets was 10x faster than Pandas + S3+ Petastorm, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single g4dn.4xlarge instance Case study - benchmark results User 2: large transport tech company S3 → Datasets → Horovod • Datasets from S3 was 4x faster than Petastorm from S3 • Benchmark: 1.5 TB synthetic tabular dataset, 70 shuffle workers (c5.18xlarge), 16 trainers (c5.18xlarge), 3 shuffle windows Throughput Petastorm 1.8 GB/s Datasets 7.38 GB/s Ray Datasets gives higher quality shuffle AND better performance, even at small scales!
  22. Run model inference over a large number of images. Primary

    functions: • Loading data from storage • Minor preprocessing before inference • Perform model inference on batches of data (possibly on the GPU) • Save the result to storage And do it all efficiently! What is batch inference?
  23. • Scheduling • Don’t needlessly transfer data around the cluster

    • Only reserve GPUs for a fraction of the time • Need to keep the GPU saturated to keep costs down What are the challenges with distributed batch inference? t=0 t=100s t=200s t=300s GPU Idle loading preprocessing inference
  24. Datasets solution - scheduling • Automatic data locality • Utilize

    Ray’s resource scheduling def preprocess(row): pass def infer(batch): pass dataset = ray.data \ .from_binary_files("s3://my-bucket") \ .map(preprocess) \ .map_batches(infer, num_gpus=0.25) \ .write_parquet("gcs://results_dir")
  25. • Pipelining of data loading, preprocessing, and inference stages Datasets

    solution - pipelining GPU Idle t=0 t=100s t=200s t=300s loading preprocessing inference pipe: DatasetPipeline = ray.data \ .read_binary_files("s3://bucket/image-dir") \ .window(blocks_per_window=2) \ .map(preprocess) \ .map_batches(BatchInferModel, batch_size=256, num_gpus=1) \ .write_json("/tmp/results")
  26. • Case study 1: startup processing aerial imagery • Problem:

    Looking for scalable/resource-efficient batch inference • Case study 2: high-tech ML platform startup • Problem: Existing Pandas → Torch solution: not scalable, inefficient use of resources (idle GPU) • Integration: Dask-on-Ray → Ray Datasets → Torch Case studies - background
  27. • Dask-on-Ray + Datasets + Torch was 5x faster than

    Pandas + Torch, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single r5d.4xlarge instance Case studies - benchmark results
  28. • Dask-on-Ray + Datasets + Torch was 5x faster than

    Pandas + Torch, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single r5d.4xlarge instance Case studies - benchmark results Ray Datasets provides better resource efficiency and better performance, even at small scales!
  29. Recently added features: • Lazy compute mode • Automatic task

    fusion • Move semantics to reduce memory consumption • Detailed execution statistics Upcoming features: • More data processing operations • Groupby and aggregation performance improvements • A suite of high-level preprocessing operations • Distributed shuffle performance improvements • Better out-of-core performance Current and Future Work
  30. Lots of room for improvement in current ML pipelines: •

    Reducing cost and performance overheads • Supporting ML operations like shuffling at scale • Ease of use Ray Datasets hits the mark by providing: • Efficient data movement between steps in a pipeline • Hyper-scalable implementations of ML operations • Simple expression of complex ML pipelines • ML ingest: better shuffled data and better performance than the status quo • Batch inference: better efficiency and performance than the status quo Summary
  31. Ray: https://github.com/ray-project/ray Ray Datasets: https://docs.ray.io/en/master/data/dataset.html# Join the Ray Discussion Forum:

    https://discuss.ray.io/ We’re hiring! https://jobs.lever.co/anyscale Questions? Thank you