Data Processing on Ray (SangBin Cho, Anyscale)

Anyscale Data Processing on Ray SangBin Cho Software Engineer @
Anyscale

01 02 03 Importance of general purpose systems for ML
infrastructure Introduction to Ray and its previous limitations in data processing Ray’s architectural evolvement as a data processing backend What is this talk about?

Distributed apps will become the norm Anyscale • Software Engineer
@ Anyscale • Ray Committer • Working on Ray 1+ year • Current focus: Data processing support on Ray Who am I?

Anyscale Motivation to support large scale data processing on top
of Ray Why do we need a general-purpose system for data processing?

Distributed apps will become the orm Anyscale • ML jobs
need complex compositions of systems • Feature processing in ETL clusters • Load data to training clusters and train it • Model tuning in separate tuning clusters Complexities in ML jobs ETL Cluster (Spark) Training cluster (Horovod) Tuning cluster (Ray Tune) Load & shuffle

Anyscale What are problems? • Job composition across multiple systems
• Many different clusters • Not efficient (sometimes) • Intermediate layers are necessary (parquet files)

Anyscale What if there are general purpose systems? • Different
“type” of workload in a single system • Remove complex job dependencies by logically grouping jobs • One system • Less maintenance burden, easier to debug • Optimization is possible • If clusters have enough memory, it can utilize that

Anyscale Why General purpose systems? Distributed apps will become the
orm ETL Cluster Training cluster Tuning cluster Load & shuffle

Anyscale Why General purpose systems? Distributed apps will become the
orm ETL library Training library Tuning library Load & shuffle General purpose Systems Distributed apps will become the orm ETL Cluster Training cluster Tuning cluster Load & shuffle

Distributed apps will become the norm Anyscale • Simple library
for distributed computing • General purpose • An ecosystem of libraries • High performance What is Ray? (in a nutshell)

def read_array(file): # read ndarray “a” # from “file” return
a def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) Function class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() Class

@ray.remote def read_array(file): # read ndarray “a” # from “file”
return a @ray.remote def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) @ray.remote class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() Function → Task Class → Actor

@ray.remote def read_array(file): # read ndarray “a” # from “file”
return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) @ray.remote(num_gpus=1) class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() id4 = c.inc.remote() id5 = c.inc.remote() Function → Task Class → Actor

Native Libraries 3rd Party Libraries universal framework for distributed computing
Ray: Ecosystem

Anyscale Ray as a general purpose system Distributed apps will
become the orm ETL Cluster (Spark on Ray) Training cluster (Horovod on Ray) Tuning cluster (Ray Tune) Load & shuffle (Hub)

Distributed apps will become the norm Anyscale • Very ML-focused.
Lots of ML integration such as Huggingface, Horovod on Ray, and etc. • Strong at ML type of workloads, but not some of features for data processing at scale wasn’t supported. But, how was Ray last year?

Distributed apps will become the norm Anyscale • Very ML-focused.
Lots of ML integration such as Huggingface, Horovod on Ray, and etc. • Strong at ML type of workloads, but not some of features for data processing at scale wasn’t supported. But, how was Ray last year? But it’s been improved a lot over the last half year

Distributed apps will become the norm Anyscale Better 3rd party
integration with other data libraries • Dask on Ray demonstrated high performance in large scale (upto 4X cost saving) • First class integration with data libraries like Mars What’s enabled by this?

Distributed apps will become the norm Anyscale Smoother Data processing
<-> ML interoperability • Spark on Ray and Pytorch / TF integration • XGBoost on Ray (training) + Modin/Dask on Ray (feature processing) What’s enabled by this?

Distributed apps will become the norm Anyscale Building end-to-end ML
pipeline in a single system • (training) XGBoost on Ray • (training) Horovod on Ray • (HP tuning) Ray Tune • (Data processing) Modin / Dask on Ray • (Data loading) Hub • (Serving) Ray Serve What’s enabled by this?

Distributed apps will become the norm Anyscale Run 3 workloads
in a single script • Modin/Dask dataframe (feature processing) • XGBoost on Ray (training) • Ray Tune (hyperparameter tuning) Example code

Distributed apps will become the norm Anyscale Read dataframe using
Modin Passing the dataframe to distributed dataset Training + hyperparameter tuning

Distributed apps will become the norm Anyscale Actor Actor Actor
XGBoost on Ray Data Parallel Training Arbitrarily fine grained partitioning Hyperparameter tuning

Anyscale How Ray has evolved to support large scale data
processing? Ray as a data processing backend

Distributed apps will become the norm Anyscale • ETL (Extract,
Transform, Load) • ETL -> ML (Data ingest) • Analytics (Analyze data) • Streaming processing • And others... Type of Data Processing

Distributed apps will become the norm Anyscale • ETL (Extract,
Transform, Load) • ETL -> ML (Data ingest) • Analytics (Analyze data) • Streaming processing Have focused on supporting ML pre-processing type of workload in the short-term What is the Ray’s short-term focus?

Distributed apps will become the norm Anyscale • Seamless distributed
execution • Robust distributed memory management Requirements for data processing backend

Distributed apps will become the norm Anyscale • + Seamless
distributed execution • - Robust distributed memory management What was supported by Ray before?

Distributed apps will become the norm Anyscale + Seamless distributed
execution • Simple and straightforward execution model / APIs • Scalability / Fault tolerance • Distributed object store utilizing shared memory • High performance decentralized scheduler What was supported by Ray before?

Distributed apps will become the norm Anyscale - Robust distributed
memory management • Scheduling doesn’t consider memory usage or locality information • Workloads failed when the data size > memory capacity • Ray cluster was crashed or deadlocked when there’s memory pressure What was supported by Ray before?

Distributed apps will become the norm Anyscale Focused on supporting
the robust distributed memory management How? Make sure distributed shuffle works really well on top of Ray So, what have we done?

Distributed apps will become the norm Anyscale • A distributed
dataset is usually stored in partitions, with each partition holding a group of rows • A shuffle is any operation over a dataset that requires redistributing data across its partitions Distributed shuffle

Distributed apps will become the norm Anyscale Distributed shuffle stresses
data processing systems’ memory management layer Why focused on distributed shuffle?

Distributed apps will become the norm Anyscale How does it
work?

Distributed apps will become the norm Anyscale • Built-in shared
memory based distributed object store • Originally developed by Ray and contributed to the Arrow project • Now backported to Ray again for optimization Ray architecture in a glance; Plasma store

Shared memory store Anyscale Plasma store Plasma store Machine A
Shared memory store Machine B One plasma store per machine

Shared memory store Anyscale Plasma store Machine A Stores “Ray
objects” in the shared memory with “zero copy read” support (object A is not copied to task / taskB’s memory). task Zero-copy read

Shared memory store Anyscale Plasma Store Object A Machine A
Shared memory store Machine B Pull/push objects “on demand”

Distributed apps will become the norm Anyscale Locality aware scheduling
• Ray scheduler calculates which machines will have the biggest input size that is already cached • Minimize objects are copied to multiple nodes Improvement 1: Scheduling improvement

Shared memory store Anyscale No Locality aware scheduling Object A
Machine A Shared memory store Machine B

Shared memory store Anyscale No Locality aware scheduling Object A
Machine A Shared memory store Machine B Needs to be copied

Shared memory store Anyscale Locality aware scheduling Object A Machine
A Shared memory store Machine B

Distributed apps will become the norm Anyscale Locality aware scheduling
• Ray scheduler calculates which machines will have the biggest input size that is already cached • Minimize objects are copied to multiple nodes Improvement 1: Scheduling improvement

Distributed apps will become the norm Anyscale Memory aware scheduling
• Ray always tries to schedule tasks to nodes that have low memory usage Improvement 1: Scheduling improvement

Shared memory store Anyscale Memory aware scheduling Machine A memory
usage: 30GB Shared memory store Machine B memory usage: 20GB

Shared memory store Anyscale Memory aware scheduling Machine A memory
usage: 30GB Shared memory store Machine B memory usage: 20GB Prefer the machine using less memory

Distributed apps will become the norm Anyscale • To support
out of core data processing • processing data that is too large to fit into a computer’s main memory. • Spill Ray objects from object store to external storage like disks or S3. • To support distributed shuffle workload, the system should be tolerant to more memory usage than the maximum memory capacity Improvement 2: Object spilling

Distributed apps will become the norm Anyscale Shuffle without object
spilling

Distributed apps will become the norm Anyscale Shuffle with object
spilling (Map)

spilling

Shared-memory object store Worker slots Node External object store (disk,
S3, etc) ... Shared-memory object store Worker slots Node Map

S3, etc) ... Shared-memory object store Worker slots Node Create objects Map

S3, etc) ... Shared-memory object store Worker slots Node Map Distributed apps will become the norm

S3, etc) ... Shared-memory object store Worker slots Node Create Create Map

S3, etc) ... Shared-memory object store Worker slots Node Create Create Spill objects Map

S3, etc) ... Shared-memory object store Worker slots Node Create Create Map Spill objects

S3, etc) ... Shared-memory object store Worker slots Node Map Spill objects

Distributed apps will become the norm Anyscale Shuffle with object
spilling (Reduce)

spilling

S3, etc) ... Shared-memory object store Worker slots Node Reduce

S3, etc) ... Shared-memory object store Worker slots Node Reduce Spill objects

S3, etc) ... Shared-memory object store Worker slots Node Reduce Restore objects

Distributed apps will become the norm Anyscale Ray respects a
hard limit for the distributed object store • detecting when there is memory pressure • guaranteeing progress when applications run out of memory • e.g., evicting unnecessary objects from the store or spilling objects • Using admission control when scheduling tasks to limit the total memory used Improvement 3: Robust memory management

Distributed apps will become the norm Anyscale Ray respects a
hard limit for the distributed object store • detecting when there is memory pressure • guaranteeing progress when applications run out of memory by taking actions • e.g., evicting unnecessary objects from the store or spilling objects • Using admission control when scheduling tasks to limit the total memory used Improvement 3: Robust memory management We will see this example

S3, etc) ...

Shared-memory object store Worker slots Node Restore External object store
(disk, S3, etc) ...

S3, etc) ...

S3, etc) ... DEADLOCK

Distributed apps will become the norm Anyscale • Ray’s decentralized
scheduler now is aware of memory usage of task inputs • Ray doesn’t schedule tasks if the task inputs require more memory than its capacity after it’s scheduled How did we solve the problem?

S3, etc) ...

S3, etc) ... Restore

S3, etc) ...

S3, etc) ... Restore

S3, etc) ...

Distributed apps will become the norm Anyscale For more detailed
information, check out Ray whitepaper

Distributed apps will become the norm Anyscale + Ray became
more viable data processing backend • + Seamless distributed execution • + Robust distributed memory management + We recently succeed to run 100TB distributed shuffle workload on top of Ray after all improvements. What is the current state of the art?

Takeaway: • Ray being able to support data processing will
reduce complexities in ML jobs • There has been several improvements to make Ray a suitable data processing backend More about Ray: • Ray Github • Ray Discourse Career: Anyscale is hiring (anyscale.com) Thank you

Data Processing on Ray (SangBin Cho, Anyscale)

Data Processing on Ray (SangBin Cho, Anyscale)

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript