Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ray Datasets: Scalable data preprocessing for distributed ML

Anyscale
PRO
February 23, 2022

Ray Datasets: Scalable data preprocessing for distributed ML

Ray Datasets is a Ray-native distributed dataset library that serves as the standard way to load, process, and exchange data in Ray libraries and applications. It features performant distributed data loading, flexible parallel compute operations, and comprehensive datasource compatibility and distributed framework integrations, all behind an incredibly simple API.

Get a first-hand look at how Ray Datasets:
- Provides hyper-scalable parallel I/O to most popular storage backends and file formats
- Supports common last-mile preprocessing operations, including basic parallel data transformations such as map, batched map, and filter, and global operations such as sort, shuffle, groupby, and stats aggregations
- Efficiently integrates with data processing libraries (e.g., Spark, Pandas, NumPy, Dask, Mars) and machine learning frameworks (e.g., TensorFlow, Torch, Horovod)

Anyscale
PRO

February 23, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Ray Datasets: Scalable Data
    Preprocessing for Distributed ML
    Clark Zinzow Software Engineer

    View Slide

  2. 01
    02
    03
    State of ML Training/Scoring Pipelines
    Introducing Ray Datasets
    Use Cases
    Overview

    View Slide

  3. Existing ML Pipelines
    Preprocess Checkpoint Train

    View Slide

  4. Existing ML Pipelines

    View Slide

  5. ● Performance overheads
    ○ Serialization/Deserialization
    ○ Data materialized to external storage
    ● Implementation/Operational Complexity
    ○ Cross-lang, cross-workload
    ○ CPUs vs GPUs
    ● Missing operations
    ○ Per-epoch shuffling
    ■ How to do a fast, in-memory, distributed shuffle?
    Challenges

    View Slide

  6. Built on Ray
    Ray Datasets

    View Slide

  7. 01
    02
    03
    Universal Data
    Loading
    Last Mile
    Preprocessing
    Parallel GPU/CPU
    Compute
    Ray Datasets
    ray.data.Dataset
    Node 1
    Block
    Node 2
    Block Block
    Node 3
    Block
    Blocks

    View Slide

  8. 01
    02
    03
    Universal Data
    Loading
    Last Mile
    Preprocessing
    Parallel GPU/CPU
    Compute
    Ray Datasets
    ray.data.Dataset
    Node 1
    Block
    Node 2
    Block Block
    Node 3
    Block
    Blocks
    Not a DataFrame library!

    View Slide

  9. Use case: Data loading for ML Training
    ETL from unstructured data
    Load data /
    last-mile
    preprocessing
    ML
    Training
    Ray Datasets
    + integrations
    Ray Train
    (Horovod,
    PyTorch, etc.)
    Large-scale relational data
    processing (Spark, Flink,
    Snowflake, Delta Lake etc.)
    Ray cluster

    View Slide

  10. Powered by
    Universal Data Loader
    ray.data.Dataset

    View Slide

  11. Universal Data Loader - Supported Data Sources

    View Slide

  12. Universal Data Loader - I/O API and Performance
    High performance distributed IO
    ds = ray.data.read_parquet("s3://some/bucket")
    ds = ray.data.read_csv("/tmp/some_file.csv")
    Leverages Apache Arrow’s
    high-performance single-threaded IO
    Parallelized using Ray’s
    high-throughput task execution
    Scales to PiB-scale jobs in production
    (Amazon)
    Read from storage
    Convert in memory
    ds = ray.data.from_pandas(df)
    df = ds.to_pandas()
    ds = ray.data.from_dask(dask_df)
    dask_df = ds.to_dask()
    ds.write_parquet("s3://some/bucket")
    ds.write_csv("/tmp/some_file.csv")
    Saving to storage

    View Slide

  13. Mapper
    Mapper
    Reducer
    Reducer
    Input Output
    Last-Mile Preprocessing
    • Basic transformations and
    aggregations
    • Map, batch map, filter
    • Stats aggregations
    • Global shuffle operations
    • Sort
    • Random shuffle
    • Groupby
    ray.data.read_parquet("foo.parquet") \
    .map_batches(
    process_batch, batch_size=512) \
    .repartition(16) \
    .groupby("col_1").mean("col_2")
    ray.data.read_parquet("foo.parquet") \
    .filter(lambda x: x < 0) \
    .map(lambda x: x**2) \
    .random_shuffle() \
    .write_parquet("bar.parquet")

    View Slide

  14. Compute - Pipelining CPU/GPU compute
    GPU Idle
    t=0 t=100s t=200s t=300s
    loading
    preprocessing
    inference
    t=0 t=100s t=200s t=300s
    GPU Idle
    loading
    preprocessing
    inference
    Pipeline the execution of stages in
    your workflow
    1. Parallelizes stage execution (lower latency)
    2. Reduces idle time of stage-specific
    resources (lowers cost)
    Without pipelining:
    With pipelining:

    View Slide

  15. Windowing can use memory more efficiently
    Execute your pipeline over windows
    of the full dataset
    This limits your cluster resource utilization
    (lowers cost)
    Pipeline
    Pipeline
    Pipeline
    Pipeline
    Without pipelining:
    With pipelining:
    Each window is
    processed
    sequentially
    Window 1:
    Window 2:
    Window 3:
    Might spill to disk
    if dataset is large!
    Allows us to
    keep all data in
    memory

    View Slide

  16. ● Efficient data layer
    ○ Zero-copy reads, shared-memory
    object store
    ○ Locality-aware scheduling
    ○ Object transfer protocols
    ● General purpose
    ○ Resource-based scheduling
    ○ Highly scalable
    ○ Robust primitives
    ○ Easy to programmatically
    compose distributed programs
    Why Ray?

    View Slide

  17. Efficient batch inference
    Use Cases
    Scalable shuffled ML ingest

    View Slide

  18. Background
    Scalable Shuffled ML Ingest

    View Slide

  19. Loading data into one or more model trainers
    ML Ingest - Overview
    Trainer
    #1
    Data
    source
    Data
    loader
    Trainer
    #2
    Trainer
    #3
    Primary functions:
    • Loading the data from storage
    • Partitioning the data into a shard per trainer
    • Batching the data into GPU batches
    • Shuffling the data before each epoch
    • Local shuffling (status quo)
    • Global shuffling (better but hard)

    View Slide

  20. TL;DR: For certain models, per-epoch shuffling gives you much
    better model accuracy after a set number of epochs
    • Shuffling decorrelates samples within and across batches
    • Per-epoch shuffling reduces variance, improves
    generalization, and prevents overfitting
    • Batches are more representative of entire dataset,
    improving estimate of “true” full-dataset gradient
    • Gradient updates on individual samples are independent
    of sample ordering
    • Results in improved statistical gain of each step in the
    training process
    Aside - Why shuffle before each epoch?

    View Slide

  21. • Need parallelized, scalable reading from storage
    • Need to avoid wasting idle GPUs while reading/shuffling
    data
    Why is scalable shuffled ML ingest hard?
    GPU Idle
    t=0 t=100s t=200s
    shuffled data loading
    training
    GPU Idle GPU Idle GPU Idle
    Epoch 1 Epoch 2 Epoch 3 Epoch 4

    View Slide

  22. • The full pipeline may not fit in
    memory
    • Local shuffles hurt model
    accuracy, but…
    • Global per-epoch shuffling
    difficult to do efficiently in a
    distributed setting
    Why is scalable shuffled ML ingest hard? - continued
    • API should still be simple!

    View Slide

  23. Massively scalable parallel IO, supporting many
    storage backends and formats
    Datasets solution - scalable IO
    ray.data.read_parquet("s3://some/training/data/bucket", parallelism=20)
    S3 dataset
    Read task 1
    Read task 3
    Read task 2
    Trainer
    #1
    Trainer
    #2
    Trainer
    #3

    View Slide

  24. Easy and efficient pipelining of preprocessing and
    shuffling with training, keeping the GPU saturated
    Datasets solution - pipelining stages
    GPU Idle
    t=0 t=100s t=200s
    shuffled data loading
    training
    Unpipelined
    Pipelined
    GPU Idle
    t=0 t=100s t=200s
    shuffled data loading
    training
    GPU Idle GPU Idle GPU Idle
    Epoch 1 Epoch 2 Epoch 3 Epoch 4
    Epoch 1 Epoch 2 Epoch 3
    Epoch 4

    View Slide

  25. Efficient distributed in-memory
    per-epoch shuffling
    Datasets solution - per-epoch shuffling
    ray.data.read_parquet(training_data_dir) \
    .repeat(num_epochs) \
    .random_shuffle_each_window() \
    .split(
    num_shards, equal=True,
    locality_hints=training_actors)
    Locality-aware placement of
    shuffle shards onto trainers

    View Slide

  26. Limit data loading, preprocessing, and shuffling to a
    smaller window of the full dataset
    Datasets solution - dataset windowing
    Shuffled data
    loading
    Shuffled data
    loading
    Shuffled data
    loading
    Trainers
    Trainers
    Trainers
    Window 1:
    Window 2:
    Window 3:
    One epoch
    This trades off shuffle quality with cluster size

    View Slide

  27. Creating a training pipeline
    for small data on a single
    node is simple
    Datasets solution - simple API
    shards = ray.data.read_parquet(training_data_dir) \
    .window(blocks_per_window=20) \
    .repeat(num_epochs) \
    .random_shuffle_each_window() \
    .split(num_shards, equal=True)
    ray.get([trainer.remote(shard) for shard in shards])
    @ray.remote
    def trainer(data: DatasetPipeline):
    for epoch_ds in data.iter_epochs():
    for X, y in epoch_ds.to_torch(
    label_column="label",
    batch_size=512):
    # Train on batch.
    ds = ray.data.read_parquet(training_data_dir) \
    .repeat(num_epochs) \
    .random_shuffle_each_window()
    ray.get(trainer.remote(ds))
    Extending this pipeline to
    large data on a multi-node
    cluster is easy

    View Slide

  28. Case studies
    Scalable Shuffled ML Ingest

    View Slide

  29. • Common problem 1: Existing solution is a bottleneck in training pipeline
    • Common problem 2: Models are shuffle-sensitive, where shuffle quality
    effects model accuracy
    • Case study 1: high-tech ML platform startup
    • Existing solution: Pandas → S3 → Petastorm → Horovod
    • Petastorm: ML ingest library from Uber
    • Ray’s solution: Dask-on-Ray → Datasets → Horovod
    • Case study 2: large tech company in transport space
    • Existing solution: S3 → Petastorm → Horovod
    • Ray’s solution: S3 → Datasets → Horovod
    • Both case studies suffered from both problems!
    Case studies - background

    View Slide

  30. Case Study 1: high-tech ML platform startup
    Dask-on-Ray → Datasets → Horovod
    • Dask-on-Ray and Datasets was 8x faster
    than Pandas + S3+ Petastorm, even on
    small data/cluster scales
    • Benchmark: NYC Taxi dataset (5 GB
    subset), single g4dn.4xlarge instance
    Case studies - benchmark results
    Case Study 2: large transport tech company
    S3 → Datasets → Horovod
    • Datasets from S3 was 4x faster than
    Petastorm from S3
    • Benchmark: 1.5 TB synthetic tabular
    dataset, 16 nodes (40 vCPUs, 180 GB
    RAM), 2 shuffle windows
    Aggregate Throughput
    Petastorm 2.16 GB/s
    Datasets 8.18 GB/s

    View Slide

  31. User 1: ML platform startup
    Dask-on-Ray → Datasets → Horovod
    • Dask-on-Ray and Datasets was 10x faster
    than Pandas + S3+ Petastorm, even on
    small data/cluster scales
    • Benchmark: NYC Taxi dataset (5 GB
    subset), single g4dn.4xlarge instance
    Case study - benchmark results
    User 2: large transport tech company
    S3 → Datasets → Horovod
    • Datasets from S3 was 4x faster
    than Petastorm from S3
    • Benchmark: 1.5 TB synthetic
    tabular dataset, 70 shuffle workers
    (c5.18xlarge), 16 trainers
    (c5.18xlarge), 3 shuffle windows
    Throughput
    Petastorm 1.8 GB/s
    Datasets 7.38 GB/s
    Ray Datasets gives higher quality
    shuffle AND better performance,
    even at small scales!

    View Slide

  32. We’re actively working on more
    comprehensive, open benchmarks at
    large data scales
    Stay tuned!

    View Slide

  33. Background
    Efficient Batch Inference

    View Slide

  34. Run model inference over a large number of images.
    Primary functions:
    ● Loading data from storage
    ● Minor preprocessing before inference
    ● Perform model inference on batches of data (possibly on
    the GPU)
    ● Save the result to storage
    And do it all efficiently!
    What is batch inference?

    View Slide

  35. • Scheduling
    • Don’t needlessly transfer data around the cluster
    • Only reserve GPUs for a fraction of the time
    • Need to keep the GPU saturated to keep costs down
    What are the challenges with distributed batch
    inference?
    t=0 t=100s t=200s t=300s
    GPU Idle
    loading
    preprocessing
    inference

    View Slide

  36. Datasets solution - scheduling
    • Automatic data locality
    • Utilize Ray’s resource scheduling
    def preprocess(row):
    pass
    def infer(batch):
    pass
    dataset = ray.data \
    .from_binary_files("s3://my-bucket") \
    .map(preprocess) \
    .map_batches(infer, num_gpus=0.25) \
    .write_parquet("gcs://results_dir")

    View Slide

  37. • Pipelining of data loading,
    preprocessing, and inference stages
    Datasets solution - pipelining
    GPU Idle
    t=0 t=100s t=200s t=300s
    loading
    preprocessing
    inference
    pipe: DatasetPipeline = ray.data \
    .read_binary_files("s3://bucket/image-dir") \
    .window(blocks_per_window=2) \
    .map(preprocess) \
    .map_batches(BatchInferModel, batch_size=256, num_gpus=1) \
    .write_json("/tmp/results")

    View Slide

  38. Case studies
    Efficient Batch Inference

    View Slide

  39. • Case study 1: startup processing aerial imagery
    • Problem: Looking for scalable/resource-efficient batch
    inference
    • Case study 2: high-tech ML platform startup
    • Problem: Existing Pandas → Torch solution: not scalable,
    inefficient use of resources (idle GPU)
    • Integration: Dask-on-Ray → Ray Datasets → Torch
    Case studies - background

    View Slide

  40. • Dask-on-Ray + Datasets + Torch was 5x faster than
    Pandas + Torch, even on small data/cluster scales
    • Benchmark: NYC Taxi dataset (5 GB subset), single
    r5d.4xlarge instance
    Case studies - benchmark results

    View Slide

  41. • Dask-on-Ray + Datasets + Torch was 5x faster than
    Pandas + Torch, even on small data/cluster scales
    • Benchmark: NYC Taxi dataset (5 GB subset), single
    r5d.4xlarge instance
    Case studies - benchmark results
    Ray Datasets provides better
    resource efficiency and better
    performance, even at small
    scales!

    View Slide

  42. Recently added features:
    • Lazy compute mode
    • Automatic task fusion
    • Move semantics to reduce memory consumption
    • Detailed execution statistics
    Upcoming features:
    • More data processing operations
    • Groupby and aggregation performance improvements
    • A suite of high-level preprocessing operations
    • Distributed shuffle performance improvements
    • Better out-of-core performance
    Current and Future Work

    View Slide

  43. Lots of room for improvement in current ML pipelines:
    ● Reducing cost and performance overheads
    ● Supporting ML operations like shuffling at scale
    ● Ease of use
    Ray Datasets hits the mark by providing:
    ● Efficient data movement between steps in a pipeline
    ● Hyper-scalable implementations of ML operations
    ● Simple expression of complex ML pipelines
    ● ML ingest: better shuffled data and better performance
    than the status quo
    ● Batch inference: better efficiency and performance than
    the status quo
    Summary

    View Slide

  44. Ray: https://github.com/ray-project/ray
    Ray Datasets: https://docs.ray.io/en/master/data/dataset.html#
    Join the Ray Discussion Forum: https://discuss.ray.io/
    We’re hiring! https://jobs.lever.co/anyscale
    Questions?
    Thank you

    View Slide