Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unifying Preprocessing and Training at Scale wi...

Anyscale
January 21, 2022

Unifying Preprocessing and Training at Scale with Ray Datasets

ML tasks such as distributed training and batch inference stretch the abstractions of modern data processing systems. In this talk, we’ll discuss the wide-ranging problems that the Python community faces when building large-scale preprocessing and training pipelines. Some of these problems are caused by the complexity of stitching together distributed systems that weren’t designed to be compatible. For example, creating a pipeline with Dask and Horovod that can efficiently use the CPUs and GPUs in a cluster. Other problems -- like per-epoch dataset shuffling, show a gap between what operations ML practitioners want and what data processing libraries are capable of doing efficiently. We’ll also introduce Ray Datasets, a simple, scalable, and pythonic way of solving these problems.

Anyscale

January 21, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Unifying Data preprocessing and training with Ray Datasets [PUBLIC] Alex

    Wu Software Engineer Clark Zinzow Software Engineer
  2. • Performance overheads ◦ Serialization/Deserialization ◦ Data materialized to external

    storage • Implementation/Operational Complexity ◦ Cross-lang, cross-workload ◦ CPUs vs GPUs • Missing operations ◦ Per-epoch shuffling ▪ How to do a fast, in-memory, distributed shuffle? Challenges
  3. • Efficient data layer ◦ Zero-copy reads, shared-memory object store

    ◦ Locality-aware scheduling ◦ Object transfer protocols • General purpose ◦ Resource-based scheduling ◦ Highly scalable ◦ Robust primitives ◦ Easy to programmatically compose distributed programs Why Ray?
  4. 01 02 03 Universal Data Loading Last Mile Preprocessing Parallel

    GPU/CPU Compute Ray Datasets ray.data.Dataset Node 1 Block Node 2 Block Block Node 3 Block Blocks
  5. 01 02 03 Universal Data Loading Last Mile Preprocessing Parallel

    GPU/CPU Compute Ray Datasets ray.data.Dataset Node 1 Block Node 2 Block Block Node 3 Block Blocks Not a DataFrame library!
  6. Use case: Data loading for ML Training ETL from unstructured

    data Load data / last-mile preprocessing ML Training Ray Datasets + integrations Ray Train (Horovod, PyTorch, etc.) Large-scale relational data processing (Spark, Flink, Snowflake, Delta Lake etc.) Ray cluster
  7. Universal Data Loader - I/O API and Performance High performance

    distributed IO ds = ray.data.read_parquet("s3://some/bucket") ds = ray.data.read_csv("/tmp/some_file.csv") Leverages Apache Arrow’s high-performance single-threaded IO Parallelized using Ray’s high-throughput task execution Scales to PiB-scale jobs in production (Amazon) Read from storage Convert in memory ds = ray.data.from_pandas(df) df = ds.to_pandas() ds = ray.data.from_dask(dask_df) dask_df = ds.to_dask() ds.write_parquet("s3://some/bucket") ds.write_csv("/tmp/some_file.csv") Saving to storage
  8. Mapper Mapper Reducer Reducer Input Output Last-Mile Preprocessing • Basic

    transformations and aggregations • Map, batch map, filter • Stats aggregations • Global shuffle operations • Sort • Random shuffle • Groupby ray.data.read_parquet("foo.parquet") \ .map_batches( process_batch, batch_size=512) \ .repartition(16) \ .groupby("col_1").mean("col_2") ray.data.read_parquet("foo.parquet") \ .filter(lambda x: x < 0) \ .map(lambda x: x**2) \ .random_shuffle() \ .write_parquet("bar.parquet")
  9. Compute - Pipelining CPU/GPU compute GPU Idle t=0 t=100s t=200s

    t=300s loading preprocessing inference t=0 t=100s t=200s t=300s GPU Idle loading preprocessing inference Pipeline the execution of stages in your workflow 1. Parallelizes stage execution (lower latency) 2. Reduces idle time of stage-specific resources (lowers cost) Without pipelining: With pipelining:
  10. Windowing can use memory more efficiently Execute your pipeline over

    windows of the full dataset This limits your cluster resource utilization (lowers cost) Pipeline Pipeline Pipeline Pipeline Without pipelining: With pipelining: Each window is processed sequentially Window 1: Window 2: Window 3: Might spill to disk if dataset is large! Allows us to keep all data in memory
  11. Loading data into one or more model trainers ML Ingest

    - Overview Trainer #1 Data source Data loader Trainer #2 Trainer #3 Primary functions: • Loading the data from storage • Partitioning the data into a shard per trainer • Batching the data into GPU batches • Shuffling the data before each epoch • Local shuffling (status quo) • Global shuffling (better but hard)
  12. TL;DR: For certain models, per-epoch shuffling gives you much better

    model accuracy after a set number of epochs • Shuffling decorrelates samples within and across batches • Per-epoch shuffling reduces variance, improves generalization, and prevents overfitting • Batches are more representative of entire dataset, improving estimate of “true” full-dataset gradient • Gradient updates on individual samples are independent of sample ordering • Results in improved statistical gain of each step in the training process Aside - Why shuffle before each epoch?
  13. • Need parallelized, scalable reading from storage • Need to

    avoid wasting idle GPUs while reading/shuffling data Why is scalable shuffled ML ingest hard? GPU Idle t=0 t=100s t=200s shuffled data loading training GPU Idle GPU Idle GPU Idle Epoch 1 Epoch 2 Epoch 3 Epoch 4
  14. • The full pipeline may not fit in memory •

    Local shuffles hurt model accuracy, but… • Global per-epoch shuffling difficult to do efficiently in a distributed setting Why is scalable shuffled ML ingest hard? - continued • API should still be simple!
  15. Massively scalable parallel IO, supporting many storage backends and formats

    Datasets solution - scalable IO ray.data.read_parquet("s3://some/training/data/bucket", parallelism=20) S3 dataset Read task 1 Read task 3 Read task 2 Trainer #1 Trainer #2 Trainer #3
  16. Easy and efficient pipelining of preprocessing and shuffling with training,

    keeping the GPU saturated Datasets solution - pipelining stages GPU Idle t=0 t=100s t=200s shuffled data loading training Unpipelined Pipelined GPU Idle t=0 t=100s t=200s shuffled data loading training GPU Idle GPU Idle GPU Idle Epoch 1 Epoch 2 Epoch 3 Epoch 4 Epoch 1 Epoch 2 Epoch 3 Epoch 4
  17. Efficient distributed in-memory per-epoch shuffling Datasets solution - per-epoch shuffling

    ray.data.read_parquet(training_data_dir) \ .repeat(num_epochs) \ .random_shuffle_each_window() \ .split( num_shards, equal=True, locality_hints=training_actors) Locality-aware placement of shuffle shards onto trainers
  18. Limit data loading, preprocessing, and shuffling to a smaller window

    of the full dataset Datasets solution - dataset windowing Shuffled data loading Shuffled data loading Shuffled data loading Trainers Trainers Trainers Window 1: Window 2: Window 3: One epoch This trades off shuffle quality with cluster size
  19. Creating a training pipeline for small data on a single

    node is simple Datasets solution - simple API shards = ray.data.read_parquet(training_data_dir) \ .window(blocks_per_window=20) \ .repeat(num_epochs) \ .random_shuffle_each_window() \ .split(num_shards, equal=True) ray.get([trainer.remote(shard) for shard in shards]) @ray.remote def trainer(data: DatasetPipeline): for epoch_ds in data.iter_epochs(): for X, y in epoch_ds.to_torch( label_column="label", batch_size=512): # Train on batch. ds = ray.data.read_parquet(training_data_dir) \ .repeat(num_epochs) \ .random_shuffle_each_window() ray.get(trainer.remote(ds)) Extending this pipeline to large data on a multi-node cluster is easy
  20. • Common problem 1: Existing solution is a bottleneck in

    training pipeline • Common problem 2: Models are shuffle-sensitive, where shuffle quality effects model accuracy • Case study 1: high-tech ML platform startup • Existing solution: Pandas → S3 → Petastorm → Horovod • Petastorm: ML ingest library from Uber • Ray’s solution: Dask-on-Ray → Datasets → Horovod • Case study 2: large tech company in transport space • Existing solution: S3 → Petastorm → Horovod • Ray’s solution: S3 → Datasets → Horovod • Both case studies suffered from both problems! Case studies - background
  21. Case Study 1: high-tech ML platform startup Dask-on-Ray → Datasets

    → Horovod • Dask-on-Ray and Datasets was 8x faster than Pandas + S3+ Petastorm, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single g4dn.4xlarge instance Case studies - benchmark results Case Study 2: large transport tech company S3 → Datasets → Horovod • Datasets from S3 was 4x faster than Petastorm from S3 • Benchmark: 1.5 TB synthetic tabular dataset, 16 nodes (40 vCPUs, 180 GB RAM), 2 shuffle windows Aggregate Throughput Petastorm 2.16 GB/s Datasets 8.18 GB/s
  22. User 1: ML platform startup Dask-on-Ray → Datasets → Horovod

    • Dask-on-Ray and Datasets was 10x faster than Pandas + S3+ Petastorm, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single g4dn.4xlarge instance Case study - benchmark results User 2: large transport tech company S3 → Datasets → Horovod • Datasets from S3 was 4x faster than Petastorm from S3 • Benchmark: 1.5 TB synthetic tabular dataset, 70 shuffle workers (c5.18xlarge), 16 trainers (c5.18xlarge), 3 shuffle windows Throughput Petastorm 1.8 GB/s Datasets 7.38 GB/s Ray Datasets gives higher quality shuffle AND better performance, even at small scales!
  23. Run model inference over a large number of images. Primary

    functions: • Loading data from storage • Minor preprocessing before inference • Perform model inference on batches of data (possibly on the GPU) • Save the result to storage And do it all efficiently! What is batch inference?
  24. • Scheduling • Don’t needlessly transfer data around the cluster

    • Only reserve GPUs for a fraction of the time • Need to keep the GPU saturated to keep costs down What are the challenges with distributed batch inference? t=0 t=100s t=200s t=300s GPU Idle loading preprocessing inference
  25. Datasets solution - scheduling • Automatic data locality • Utilize

    Ray’s resource scheduling def preprocess(row): pass def infer(batch): pass dataset = ray.data \ .from_binary_files("s3://my-bucket") \ .map(preprocess) \ .map_batches(infer, num_gpus=0.25) \ .write_parquet("gcs://results_dir")
  26. • Pipelining of data loading, preprocessing, and inference stages Datasets

    solution - pipelining GPU Idle t=0 t=100s t=200s t=300s loading preprocessing inference pipe: DatasetPipeline = ray.data \ .read_binary_files("s3://bucket/image-dir") \ .window(blocks_per_window=2) \ .map(preprocess) \ .map_batches(BatchInferModel, batch_size=256, num_gpus=1) \ .write_json("/tmp/results")
  27. • Case study 1: startup processing aerial imagery • Problem:

    Looking for scalable/resource-efficient batch inference • Case study 2: high-tech ML platform startup • Problem: Existing Pandas → Torch solution: not scalable, inefficient use of resources (idle GPU) • Integration: Dask-on-Ray → Ray Datasets → Torch Case studies - background
  28. • Dask-on-Ray + Datasets + Torch was 5x faster than

    Pandas + Torch, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single r5d.4xlarge instance Case studies - benchmark results
  29. • Dask-on-Ray + Datasets + Torch was 5x faster than

    Pandas + Torch, even on small data/cluster scales • Benchmark: NYC Taxi dataset (5 GB subset), single r5d.4xlarge instance Case studies - benchmark results Ray Datasets provides better resource efficiency and better performance, even at small scales!
  30. Lots of room for improvement in current ML pipelines: •

    Reducing cost and performance overheads • Supporting ML operations like shuffling at scale • Ease of use Ray Datasets hits the mark by providing: • Efficient data movement between steps in a pipeline • Hyper-scalable implementations of ML operations • Simple expression of complex ML pipelines • ML ingest: better shuffled data and better performance than the status quo • Batch inference: better efficiency and performance than the status quo Summary
  31. Ray: https://github.com/ray-project/ray Ray Datasets: https://docs.ray.io/en/master/data/dataset.html# Join the Ray Discussion Forum:

    https://discuss.ray.io/ We’re hiring! https://jobs.lever.co/anyscale Questions? Thank you