Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Processing on Ray (SangBin Cho, Anyscale)

Data Processing on Ray (SangBin Cho, Anyscale)

Machine learning and data processing applications continue to drive the need to develop scalable Python applications. Ray is a distributed execution engine that enables programmers to scale up their Python applications. This talk will cover some of the challenges we faced and key architectural changes we made to Ray over the past year to support a new set of large scale data processing workloads.

Anyscale

July 15, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. 01 02 03 Importance of general purpose systems for ML

    infrastructure Introduction to Ray and its previous limitations in data processing Ray’s architectural evolvement as a data processing backend What is this talk about?
  2. Distributed apps will become the norm Anyscale • Software Engineer

    @ Anyscale • Ray Committer • Working on Ray 1+ year • Current focus: Data processing support on Ray Who am I?
  3. Anyscale Motivation to support large scale data processing on top

    of Ray Why do we need a general-purpose system for data processing?
  4. Distributed apps will become the orm Anyscale • ML jobs

    need complex compositions of systems • Feature processing in ETL clusters • Load data to training clusters and train it • Model tuning in separate tuning clusters Complexities in ML jobs ETL Cluster (Spark) Training cluster (Horovod) Tuning cluster (Ray Tune) Load & shuffle
  5. Anyscale What are problems? • Job composition across multiple systems

    • Many different clusters • Not efficient (sometimes) • Intermediate layers are necessary (parquet files)
  6. Anyscale What if there are general purpose systems? • Different

    “type” of workload in a single system • Remove complex job dependencies by logically grouping jobs • One system • Less maintenance burden, easier to debug • Optimization is possible • If clusters have enough memory, it can utilize that
  7. Anyscale Why General purpose systems? Distributed apps will become the

    orm ETL Cluster Training cluster Tuning cluster Load & shuffle
  8. Anyscale Why General purpose systems? Distributed apps will become the

    orm ETL library Training library Tuning library Load & shuffle General purpose Systems Distributed apps will become the orm ETL Cluster Training cluster Tuning cluster Load & shuffle
  9. Distributed apps will become the norm Anyscale • Simple library

    for distributed computing • General purpose • An ecosystem of libraries • High performance What is Ray? (in a nutshell)
  10. def read_array(file): # read ndarray “a” # from “file” return

    a def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) Function class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() Class
  11. @ray.remote def read_array(file): # read ndarray “a” # from “file”

    return a @ray.remote def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) @ray.remote class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() Function → Task Class → Actor
  12. @ray.remote def read_array(file): # read ndarray “a” # from “file”

    return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) @ray.remote(num_gpus=1) class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() id4 = c.inc.remote() id5 = c.inc.remote() Function → Task Class → Actor
  13. Anyscale Ray as a general purpose system Distributed apps will

    become the orm ETL Cluster (Spark on Ray) Training cluster (Horovod on Ray) Tuning cluster (Ray Tune) Load & shuffle (Hub)
  14. Distributed apps will become the norm Anyscale • Very ML-focused.

    Lots of ML integration such as Huggingface, Horovod on Ray, and etc. • Strong at ML type of workloads, but not some of features for data processing at scale wasn’t supported. But, how was Ray last year?
  15. Distributed apps will become the norm Anyscale • Very ML-focused.

    Lots of ML integration such as Huggingface, Horovod on Ray, and etc. • Strong at ML type of workloads, but not some of features for data processing at scale wasn’t supported. But, how was Ray last year? But it’s been improved a lot over the last half year
  16. Distributed apps will become the norm Anyscale Better 3rd party

    integration with other data libraries • Dask on Ray demonstrated high performance in large scale (upto 4X cost saving) • First class integration with data libraries like Mars What’s enabled by this?
  17. Distributed apps will become the norm Anyscale Smoother Data processing

    <-> ML interoperability • Spark on Ray and Pytorch / TF integration • XGBoost on Ray (training) + Modin/Dask on Ray (feature processing) What’s enabled by this?
  18. Distributed apps will become the norm Anyscale Building end-to-end ML

    pipeline in a single system • (training) XGBoost on Ray • (training) Horovod on Ray • (HP tuning) Ray Tune • (Data processing) Modin / Dask on Ray • (Data loading) Hub • (Serving) Ray Serve What’s enabled by this?
  19. Distributed apps will become the norm Anyscale Run 3 workloads

    in a single script • Modin/Dask dataframe (feature processing) • XGBoost on Ray (training) • Ray Tune (hyperparameter tuning) Example code
  20. Distributed apps will become the norm Anyscale Read dataframe using

    Modin Passing the dataframe to distributed dataset Training + hyperparameter tuning
  21. Distributed apps will become the norm Anyscale Actor Actor Actor

    XGBoost on Ray Data Parallel Training Arbitrarily fine grained partitioning Hyperparameter tuning
  22. Anyscale How Ray has evolved to support large scale data

    processing? Ray as a data processing backend
  23. Distributed apps will become the norm Anyscale • ETL (Extract,

    Transform, Load) • ETL -> ML (Data ingest) • Analytics (Analyze data) • Streaming processing • And others... Type of Data Processing
  24. Distributed apps will become the norm Anyscale • ETL (Extract,

    Transform, Load) • ETL -> ML (Data ingest) • Analytics (Analyze data) • Streaming processing Have focused on supporting ML pre-processing type of workload in the short-term What is the Ray’s short-term focus?
  25. Distributed apps will become the norm Anyscale • Seamless distributed

    execution • Robust distributed memory management Requirements for data processing backend
  26. Distributed apps will become the norm Anyscale • + Seamless

    distributed execution • - Robust distributed memory management What was supported by Ray before?
  27. Distributed apps will become the norm Anyscale + Seamless distributed

    execution • Simple and straightforward execution model / APIs • Scalability / Fault tolerance • Distributed object store utilizing shared memory • High performance decentralized scheduler What was supported by Ray before?
  28. Distributed apps will become the norm Anyscale - Robust distributed

    memory management • Scheduling doesn’t consider memory usage or locality information • Workloads failed when the data size > memory capacity • Ray cluster was crashed or deadlocked when there’s memory pressure What was supported by Ray before?
  29. Distributed apps will become the norm Anyscale Focused on supporting

    the robust distributed memory management How? Make sure distributed shuffle works really well on top of Ray So, what have we done?
  30. Distributed apps will become the norm Anyscale • A distributed

    dataset is usually stored in partitions, with each partition holding a group of rows • A shuffle is any operation over a dataset that requires redistributing data across its partitions Distributed shuffle
  31. Distributed apps will become the norm Anyscale Distributed shuffle stresses

    data processing systems’ memory management layer Why focused on distributed shuffle?
  32. Distributed apps will become the norm Anyscale • Built-in shared

    memory based distributed object store • Originally developed by Ray and contributed to the Arrow project • Now backported to Ray again for optimization Ray architecture in a glance; Plasma store
  33. Shared memory store Anyscale Plasma store Plasma store Machine A

    Shared memory store Machine B One plasma store per machine
  34. Shared memory store Anyscale Plasma store Machine A Stores “Ray

    objects” in the shared memory with “zero copy read” support (object A is not copied to task / taskB’s memory). task Zero-copy read
  35. Shared memory store Anyscale Plasma Store Object A Machine A

    Shared memory store Machine B Pull/push objects “on demand”
  36. Distributed apps will become the norm Anyscale Locality aware scheduling

    • Ray scheduler calculates which machines will have the biggest input size that is already cached • Minimize objects are copied to multiple nodes Improvement 1: Scheduling improvement
  37. Shared memory store Anyscale No Locality aware scheduling Object A

    Machine A Shared memory store Machine B Needs to be copied
  38. Distributed apps will become the norm Anyscale Locality aware scheduling

    • Ray scheduler calculates which machines will have the biggest input size that is already cached • Minimize objects are copied to multiple nodes Improvement 1: Scheduling improvement
  39. Distributed apps will become the norm Anyscale Memory aware scheduling

    • Ray always tries to schedule tasks to nodes that have low memory usage Improvement 1: Scheduling improvement
  40. Shared memory store Anyscale Memory aware scheduling Machine A memory

    usage: 30GB Shared memory store Machine B memory usage: 20GB
  41. Shared memory store Anyscale Memory aware scheduling Machine A memory

    usage: 30GB Shared memory store Machine B memory usage: 20GB Prefer the machine using less memory
  42. Distributed apps will become the norm Anyscale • To support

    out of core data processing • processing data that is too large to fit into a computer’s main memory. • Spill Ray objects from object store to external storage like disks or S3. • To support distributed shuffle workload, the system should be tolerant to more memory usage than the maximum memory capacity Improvement 2: Object spilling
  43. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Map
  44. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Map
  45. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Create objects Map
  46. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Map Distributed apps will become the norm
  47. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Map
  48. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Create Create Map
  49. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Create Create Spill objects Map
  50. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Create Create Map Spill objects
  51. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Map Spill objects
  52. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Map Spill objects
  53. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Map Spill objects
  54. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Map
  55. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Reduce
  56. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Reduce
  57. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Reduce
  58. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Reduce
  59. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Reduce Spill objects
  60. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Reduce Spill objects
  61. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Reduce
  62. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Reduce Restore objects
  63. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Reduce
  64. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Reduce
  65. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Reduce
  66. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Reduce
  67. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Reduce
  68. Shared-memory object store Worker slots Node External object store (disk,

    S3, etc) ... Shared-memory object store Worker slots Node Reduce
  69. Distributed apps will become the norm Anyscale Ray respects a

    hard limit for the distributed object store • detecting when there is memory pressure • guaranteeing progress when applications run out of memory • e.g., evicting unnecessary objects from the store or spilling objects • Using admission control when scheduling tasks to limit the total memory used Improvement 3: Robust memory management
  70. Distributed apps will become the norm Anyscale Ray respects a

    hard limit for the distributed object store • detecting when there is memory pressure • guaranteeing progress when applications run out of memory by taking actions • e.g., evicting unnecessary objects from the store or spilling objects • Using admission control when scheduling tasks to limit the total memory used Improvement 3: Robust memory management We will see this example
  71. Distributed apps will become the norm Anyscale • Ray’s decentralized

    scheduler now is aware of memory usage of task inputs • Ray doesn’t schedule tasks if the task inputs require more memory than its capacity after it’s scheduled How did we solve the problem?
  72. Distributed apps will become the norm Anyscale + Ray became

    more viable data processing backend • + Seamless distributed execution • + Robust distributed memory management + We recently succeed to run 100TB distributed shuffle workload on top of Ray after all improvements. What is the current state of the art?
  73. Takeaway: • Ray being able to support data processing will

    reduce complexities in ML jobs • There has been several improvements to make Ray a suitable data processing backend More about Ray: • Ray Github • Ray Discourse Career: Anyscale is hiring (anyscale.com) Thank you