$30 off During Our Annual Pro Sale. View Details »

Streaming distributed execution across. CPUs and GPUs

Streaming distributed execution across. CPUs and GPUs

Some of the most demanding machine learning (ML) use cases we have encountered involve pipelines that span both CPU and GPU devices in distributed environments. These situations are common workloads, including:

* Batch inference, which involves a CPU-intensive preprocessing stage (e.g., video decoding or image resizing) before utilizing a GPU-intensive model to make predictions.
* Distributed training, where similar CPU-heavy transformations are required to prepare or augment the dataset prior to GPU training.

In this talk, we examine how Ray data streaming works and how to use it for your own machine learning pipelines to address these common workloads utilizing all your compute resource–CPUs and GPUs–at scale.

Anyscale
PRO

June 22, 2023
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Streaming distributed execution
    across CPUs and GPUs
    June 21, 2023
    Eric Liang
    Email: [email protected]

    View Slide

  2. Talk Overview
    ● ML Inference and Training workloads
    ● How it relates to Ray Data streaming (new feature in Ray 2.4!)
    ● Examples
    ● Backend overview

    View Slide

  3. About me
    ● Technical lead for Ray / OSS at Anyscale
    ● Previously:
    ○ PhD in systems / ML at Berkeley
    ○ Staff eng @ Databricks, storage infra @ Google

    View Slide

  4. ML workloads and Data

    View Slide

  5. ML Workloads
    ● Where does data processing come in for ML workloads?
    ETL Pipeline Preprocessing
    Training /
    Inference

    View Slide

  6. ML Workloads
    ETL Pipeline Preprocessing
    Training /
    Inference
    Resizing images,
    Decoding videos
    Data augmentation
    Using PyTorch, TF,
    HuggingFace,
    LLMs, etc.
    Ingesting latest data,
    Joining data tables

    View Slide

  7. ML Workloads
    ETL Pipeline Preprocessing
    Training /
    Inference
    CPU CPU GPU
    Resizing images,
    Decoding videos
    Data augmentation
    Using PyTorch, TF,
    HuggingFace,
    LLMs, etc.
    Ingesting latest data,
    Joining data tables

    View Slide

  8. ML Workloads
    ETL Pipeline Preprocessing
    Training /
    Inference
    CPU CPU GPU
    Resizing images,
    Decoding videos
    Data augmentation
    usual scope of
    ML teams
    Ingesting latest data,
    Joining data tables
    Using PyTorch, TF,
    HuggingFace,
    LLMs, etc.

    View Slide

  9. Our vision for simplifying this with Ray
    ETL Pipeline Preprocessing
    Training /
    Inference

    View Slide

  10. Our vision for simplifying this with Ray
    ETL Pipeline Preprocessing
    Training /
    Inference

    View Slide

  11. Our vision for simplifying this with Ray
    ETL Pipeline Preprocessing
    Training /
    Inference

    View Slide

  12. Ray Data: Overview

    View Slide

  13. Ray Data overview

    View Slide

  14. Ray Data overview

    View Slide

  15. Ray Data overview
    ray.data.Dataset
    Node 1
    Block
    Node 2
    Block Block
    Node 3
    Block
    Blocks

    View Slide

  16. Ray Data overview
    Powered by
    ray.data.Dataset

    View Slide

  17. Ray Data overview
    High performance distributed IO
    ds = ray.data.read_parquet("s3://some/bucket")
    ds = ray.data.read_csv("/tmp/some_file.csv")
    Leverages Apache Arrow’s
    high-performance single-threaded IO
    Parallelized using Ray’s
    high-throughput task execution
    Scales to PiB-scale jobs in production
    (Amazon)
    Read from storage
    Transform data
    ds = ds.map_batches(batch_func)
    ds = ds.map(func)
    ds.iter_batches() -> Iterator
    ds.write_parquet("s3://some/bucket")
    Consume data

    View Slide

  18. Ray Data Streaming

    View Slide

  19. Bulk execution
    Previous versions of Ray Data (<2.4) used bulk execution strategy
    What is bulk execution?
    ● Load all data into memory
    ● Apply transformations on in-memory data in bulk
    ● Out of memory? -> spill blocks to disk
    ● Similar to Spark's execution model (bulk synchronous parallel)

    View Slide

  20. Streaming (pipelined) execution
    ● Default execution strategy for Ray Data in 2.4
    ● Same data transformations API
    ● Instead of executing operations in bulk, build a pipeline of
    operators
    ● Data blocks are streamed through operators, reducing memory
    use and avoiding spilling to disk

    View Slide

  21. Preprocessing can often be the bottleneck
    ● Example: video decoding prior to inference / training
    ● Too expensive to run on just GPU nodes: needs scaling out
    ● Large intermediate data: uses lots of memory

    View Slide

  22. Ray Data streaming avoids the bottleneck
    ● E.g., intermediate video frames streamed through memory
    ● Decoding can be offloaded onto CPU nodes from GPU nodes
    ● Intermediate frames kept purely in (cluster) memory

    View Slide

  23. Inference <> Training
    ● Same streaming pipeline can easily be used for training too!

    View Slide

  24. Inference <> Training
    ● Same streaming pipeline can easily be used for training too!
    Split[3]
    Worker [0]
    Worker [1]
    Worker [2]

    View Slide

  25. Streaming performance
    benefits deep dive

    View Slide

  26. Performance benefits overview
    Bulk execution Streaming execution
    CPU-only pipelines
    (single-stage)
    Heterogeneous CPU+GPU
    pipelines (multi-stage)

    View Slide

  27. Performance benefits overview
    Bulk execution Streaming execution
    CPU-only pipelines
    (single-stage)
    - Memory optimal
    - Good for inference
    - Bad for training
    Heterogeneous CPU+GPU
    pipelines (multi-stage)

    View Slide

  28. Performance benefits overview
    Bulk execution Streaming execution
    CPU-only pipelines
    (single-stage)
    - Memory optimal
    - Good for inference
    - Bad for training
    - Memory optimal
    - Good for inference
    - Good for training
    Heterogeneous CPU+GPU
    pipelines (multi-stage)

    View Slide

  29. Performance benefits overview
    Bulk execution Streaming execution
    CPU-only pipelines
    (single-stage)
    - Memory optimal
    - Good for inference
    - Bad for training
    - Memory optimal
    - Good for inference
    - Good for training
    Heterogeneous CPU+GPU
    pipelines (multi-stage)
    - Memory inefficient
    - Slower for inference
    - Bad for training

    View Slide

  30. Performance benefits overview
    Bulk execution Streaming execution
    CPU-only pipelines
    (single-stage)
    - Memory optimal
    - Good for inference
    - Bad for training
    - Memory optimal
    - Good for inference
    - Good for training
    Heterogeneous CPU+GPU
    pipelines (multi-stage)
    - Memory inefficient
    - Slower for inference
    - Bad for training
    - Memory optimal
    - Good for inference
    - Good for training

    View Slide

  31. In more detail: simple batch inference job
    Logical data flow:

    View Slide

  32. A simple batch inference job

    View Slide

  33. A simple batch inference job

    View Slide

  34. A simple batch inference job

    View Slide

  35. A simple batch inference job

    View Slide

  36. A simple batch inference job

    View Slide

  37. A simple batch inference job

    View Slide

  38. This is a single stage pipeline

    View Slide

  39. Bulk physical execution
    output 1
    output 2
    output 3

    View Slide

  40. Bulk physical execution -- single stage
    ● Memory usage is optimal (no intermediate data)
    ● Good for inference
    ● Not good for distributed training (cannot consume results
    incrementally)

    View Slide

  41. Streaming physical execution
    Operator (Stage 1)
    data
    partition 1
    data
    partition 2
    data
    partition 3

    View Slide

  42. Streaming physical execution
    Operator (Stage 1)
    data
    partition 1
    data
    partition 2
    data
    partition 3

    View Slide

  43. Streaming physical execution
    Operator (Stage 1)
    data
    partition 1
    data
    partition 2
    data
    partition 3
    output 1

    View Slide

  44. Streaming physical execution
    Operator (Stage 1)
    data
    partition 1
    data
    partition 2
    data
    partition 3
    output 1

    View Slide

  45. Streaming physical execution
    Operator (Stage 1)
    data
    partition 1
    data
    partition 2
    data
    partition 3
    output 1
    output 2

    View Slide

  46. Streaming physical execution
    Operator (Stage 1)
    data
    partition 1
    data
    partition 2
    data
    partition 3
    output 1
    output 2

    View Slide

  47. Streaming physical execution
    Operator (Stage 1)
    data
    partition 1
    data
    partition 2
    data
    partition 3
    output 1
    output 2
    output 3

    View Slide

  48. Streaming physical execution -- single stage
    ● Memory usage is optimal (no intermediate data)
    ● Good for inference
    ● Good for distributed training

    View Slide

  49. In more detail: multi-stage (heterogeneous)
    pipeline
    GPU

    View Slide

  50. Heterogeneous pipeline (CPU + GPU)

    View Slide

  51. Heterogeneous pipeline (CPU + GPU)

    View Slide

  52. Heterogeneous pipeline (CPU + GPU)
    GPU

    View Slide

  53. Compare against bulk vs streaming

    View Slide

  54. Bulk physical execution

    View Slide

  55. Bulk physical execution

    View Slide

  56. Bulk physical execution

    View Slide

  57. Bulk physical execution -- multi stage
    ● Memory usage is inefficient (disk spilling)
    ● Slower for inference
    ● Bad for distributed training

    View Slide

  58. Streaming physical execution
    Operator (Stage 1) Operator (Stage 2) Operator (Stage 3)

    View Slide

  59. Streaming physical execution
    Operator (Stage 1) Operator (Stage 2) Operator (Stage 3)

    View Slide

  60. Streaming physical execution
    Operator (Stage 1) Operator (Stage 2) Operator (Stage 3)

    View Slide

  61. Streaming physical execution
    Operator (Stage 1) Operator (Stage 2) Operator (Stage 3)

    View Slide

  62. Streaming physical execution
    Operator (Stage 1) Operator (Stage 2) Operator (Stage 3)

    View Slide

  63. Streaming physical execution
    Operator (Stage 1) Operator (Stage 2) Operator (Stage 3)

    View Slide

  64. Streaming physical execution -- multi stage
    ● Memory usage is optimal (no intermediate data)
    ● Good for inference
    ● Good for distributed training

    View Slide

  65. Comparison to other systems
    ● DataFrame systems (e.g., Spark)
    ○ Ray Data streaming is more memory efficient
    ○ Ray Data supports heterogeneous clusters
    ○ Execution model a better fit for distributed training
    ● ML ingest libraries: TF Data / Torch Data / Petastorm
    ○ Ray Data supports scaling preprocessing out to a cluster

    View Slide

  66. Video Inference Example

    View Slide

  67. Four stage pipeline
    Logical data flow:

    View Slide

  68. Four stage pipeline

    View Slide

  69. Four stage pipeline

    View Slide

  70. Four stage pipeline

    View Slide

  71. Putting it together

    View Slide

  72. Putting it together

    View Slide

  73. Putting it together

    View Slide

  74. Putting it together

    View Slide

  75. Putting it together

    View Slide

  76. Streaming execution plan

    View Slide

  77. Running this on a Ray cluster
    $ python workload.py

    View Slide

  78. Running this on a Ray cluster
    $ python workload.py
    2023-05-02 15:10:01,105 INFO streaming_executor.py:91 -- Executing DAG
    InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(decode_frames)] ->
    ActorPoolMapOperator[MapBatches(FrameAnnotator)] ->
    ActorPoolMapOperator[MapBatches(FrameClassifier)] ->
    TaskPoolMapOperator[Write]

    View Slide

  79. Running this on a Ray cluster
    $ python workload.py
    2023-05-02 15:10:01,105 INFO streaming_executor.py:91 -- Executing DAG
    InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(decode_frames)] ->
    ActorPoolMapOperator[MapBatches(FrameAnnotator)] ->
    ActorPoolMapOperator[MapBatches(FrameClassifier)] ->
    TaskPoolMapOperator[Write]
    Running: 25.0/112.0 CPU, 2.0/2.0 GPU, 33.19 GiB/32.27 GiB object_store_memory:
    28%|███▍ | 285/1000 [01:40<03:12, 3.71it/s]

    View Slide

  80. Running this on a Ray cluster
    $ python workload.py
    2023-05-02 15:10:01,105 INFO streaming_executor.py:91 -- Executing DAG
    InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(decode_frames)] ->
    ActorPoolMapOperator[MapBatches(FrameAnnotator)] ->
    ActorPoolMapOperator[MapBatches(FrameClassifier)] ->
    TaskPoolMapOperator[Write]
    Running: 0.0/112.0 CPU, 0.0/2.0 GPU, 0.0 GiB/32.27 GiB object_store_memory:
    100%|████████████| 1000/1000 [05:13<00:00, 3.85it/s]

    View Slide

  81. Ray dashboard observability: active tasks

    View Slide

  82. Ray dashboard observability: actor state

    View Slide

  83. Ray dashboard observability: network

    View Slide

  84. Backend overview

    View Slide

  85. How does Ray data streaming work?
    ● Each transformation is implemented as an operator
    ● Use Ray tasks and actors for execution of operators
    ○ default -> use Ray tasks
    ○ actor pool -> use Ray actors
    ● Intermediate data blocks in Ray object store
    ● Memory usage of operators is limited (backpressure) to enable
    efficient streaming without spilling to disk

    View Slide

  86. Advantages of using Ray core primitives
    ● Heterogeneous clusters support
    ● Fault tolerance out of the box
    ○ Lineage-based reconstruction of tasks and actors operations
    → your ML job will survive failures during preprocessing
    ● Resilient object store layer
    ○ Spills to disk in case of unexpectedly high memory usage:
    slowdown instead of a crash
    ○ Can also do large-scale shuffles in a pinch
    ● Easy to add data locality optimizations for both task + actor ops

    View Slide

  87. Scalability
    ● Stress tests on a 20TiB array dataset
    ● 500 machines

    View Slide

  88. Training
    ● Can use pipelines for training as well
    ● Swap map_batches(Model) call for streaming_split(K)
    Inference
    ray.data.read_datasource(...) \
    .map_batches(preprocess)
    .map_batches(Model,
    compute=ActorPoolStrategy(...)) \
    .write_datasource(...)
    Training
    iters = ray.data.read_datasource(...) \
    .map_batches(preprocess) \
    .streaming_split(len(workers))
    for i, w in enumerate(workers):
    w.set_data_iterator.remote(iters[i])
    ## in worker
    for batch in it.iter_batches(batch_size=32):
    model.forward(batch)...

    View Slide

  89. Advantages of streaming for Training
    ● Example of accelerating an expensive Read/Preprocessing
    operation by adding CPU nodes to a cluster

    View Slide

  90. Summary
    ● Ray Data streaming scales batch inference and training workloads
    ● More efficient computation model than bulk processing
    ● Simple API for composing streaming topologies
    Next steps:
    ● Streaming Inference is available in 2.4: docs.ray.io
    ● Ray Train integration coming in 2.6
    ● We're hardening streaming to work robustly at 100+ node clusters,
    10M+ input files. Contact us!

    View Slide