Ray Community Meetup Talks

Data Transfer Speed on Ray Ankur Mohan

Outline.. Objective: Discuss data transfer speed from a distributed data
store (eg., S3) vs. Ray plasma store. Will also learn setting up a RayCluster on Kubernetes Use-case: Downloading and scattering data to compute nodes is a frequent operation in any data processing or ML workflow and must be efficient to benefit from parallelization Agenda: - Test application demo - Test application system architecture - Ray concepts - Analysis of data transfer latency for several scenarios - Conclusion

Test Application workflow Sample audio

System Architecture

Ray Concepts - Tasks: Arbitrary functions to be executed asynchronously
on separate Python workers - Actors: A stateful task. When a new actor is instantiated, a new process is created. Methods of the actor are scheduled on that process and can access and mutate the state of the actor. - Tasks and actors create and compute on objects. These objects can be located anywhere on the cluster, hence called remote objects. Remote objects are referred to by object refs - Remote objects are cached in Ray’s distributed object store, also called plasma store - There is one object store per Ray node. There is no “global” object store. If an actor or task running on a node needs a piece of data that is not located in that actor or task’s object store, then the data needs to replicated from where it is located to where it is needed. Task

Now let’s apply these ideas to the downloading the model
in the speech2text application Object Replication from Plasma Store Object Download from S3

Object Store Replication from Plasma Store Scenario Objective Case 1
(base case) Download Actor and Process Actor on different pods on different instances cross-instance data transfer latency Case 2 One Download Actor and multiple Process Actors, each on different instances cross-instance data transfer latency + object store read parallelism Case 3 Download Actor and Process Actors on different pods on the same instance Object store read parallelism within-instance Case 4 Download Actor and Process Actors on the same pod Inter-process data transfer, no object store replication is involved Higher data + compute locality

Case 1 (base): Download and Process actors on different nodes
# Process Actors Data Transfer Time (sec) 1 4.0 Data transfer rate = 563/4.0 ~ 140.75 Mbps

Case 2: Multiple process actors on separate instances • Here,
there is a single download actor that downloads the model from S3 and transfers the model data to the object store of that pod. Then, the data is replicated to multiple process actors running on separate EC2 instances • This scenario measures data transfer across EC2 instances + parallel data replication

Case 2: Multiple process actors on separate instances (Ray 2.0.0)
Data transfer rate = 563*5/10.5 ~ 268 Mbps

# Process actors Data transfer time (sec) Data transfer BW
(MB/sec) 1 5.8 97 2 10.8 104 3 15.23 110 4 19.79 113 5 25.8 109 Case 2: Multiple process actors on separate instances (Ray < 2.0.0) This graph shows the increase in data transfer time as the number of process actors is increased from 1 to 5, with each process actor scheduled on a separate EC2 instance, for previous version of Ray. The increase in data transfer time is nearly linear, implying that parallel data replication has been significantly optimized in Ray 2.0.0!

Case 3: Download Actor and Load Actors on different pods
on same instance # Process actors Data transfer time (sec) Data transfer BW (MB/sec) 1 3.24 173 2 3.60 312 kubectl get pods -o=wide -n ray

Case 4: Download and Process actors on same pod #
Process actors Data transfer time (sec) Data transfer BW (MB) 1 2.2 255 2 2.4 469 Object store not involved!

Aside: Discussion about Fred Reiss’s talk Fred Reiss’s talk This
talk: object store replication + object store to process memory

Object Download from S3 Scenario Objective Case 1 Single Download
+ Process actor base-case Case 2 Multiple actors, each on a separate EC2 instance S3 read-parallelism Case 3 Multiple actors on the same EC2 instance Per-instance S3 read-parallelism Case 4 Multiple actors on the same pod Inter-process data transfer

Case 1 + 2: Downloading from S3, multiple Actors #
Num Actors Download time (sec) 1 5.4 2 5.4 3 5.4 4 5.6 Object store not utilized!

Network I/O bound when multiple actors are scheduled on the
same EC2 instance Case 3: Downloading from S3, multiple actors on same instance Data transfer rate (#actors = 5) = 5*563/12 = 234 MB/sec Data transfer rate (#actors = 2) = 2*563/6.5 = 173 MB/sec Ray driver output showing actors starting on separate pods kubectl -n ray get pods -o=wide All pods are scheduled on the same EC2 instance

Case 4 Downloading from S3: Multiple actors on same pod
Notice: Data copied directly to process memory! Not via object store - Similar to actors on different pods

Scheduling tools and techniques • Maximum of two actors running
f1 can be scheduled on this pod/node • Need to ensure the pod/node has sufficient CPU/memory allocation • Force pods on a particular instance or group of instances: nodeName, nodeAffinity, taints/tolerations ◦ Need a combination of nodeName + taints/tolerations to ensure only certain pods are scheduled on certain instances, and no other! • Force actors on the same pod: actor resources

Scheduling tools and techniques • Can also specify resources (CPU/memory/GPUs,
custom resources) in task/actor definition. This will ensure the task/actor is only scheduled on a node/pod that provides those (or more) resources • Ray also offers Placement groups that allow users to atomically reserve groups of resources across multiple nodes which can then be used to schedule Ray tasks and actors packed as close as possible for locality (PACK), or spread apart (SPREAD). Placement groups are generally used for gang-scheduling actors, but also support tasks.

Conclusions • For mapped tasks with a medium (< 10)
mapping index, object store replication is more efficient. This is because actors can be scheduled on a small group of EC2 instances and object store supports moderate level of read-parallelism ◦ Major improvement in parallel read BW in Ray 2.0! ◦ May eliminate the need for setting up NFS/minio/multi-mount EBS volumes ◦ Data stays local to the cluster • For tasks with high mapping index, parallel reads from S3 may be more efficient, this is because S3 offers great read parallelism ◦ However, for small data sizes, object store replication is probably sufficient ◦ Need to consider latency of obtaining S3 read credentials • It helps to locate data source and data consumer close to each other--on the same EC2 instance, or on the same pod, if feasible • Must anticipate peak CPU/Memory use and schedule actors/tasks on pods with sufficient resources, otherwise actor/task will crash. Ray offers several methods to achieve this

Thanks!

Appendix

RayCluster

Passing Data • Lazy loading of data - avoid unnecessary
complexity • Functions allocate on separate pods (not necessarily instances!)

Data transfer when actors are on the same pod, but
different from source

Data transfer when pods are on the same EC2 instace

When source and two actors are on the same pod

Giving the scheduler free rein.. Data load time 5 19.01

Ray Head and Worker Each is a collection of services
k exec -n ray example-cluster-ray-head-49rl9 -it /bin/bash

Service Language Head/Worker Function log_monitor.py Python Worker, Head Process for
monitoring ray log files dashboard/agent.py Python Worker, Head A server that exposes endpoints for collecting metrics such as CPU/Mem utilization raylet C++ (ray-project/ray/blob/maste r/src/ray/raylet/raylet.cc) Worker, Head Consists of node manager (aka scheduler?) and object manager, which are services listening on certain (configurable) ports and responsible for scheduling remote task execution and transfer of data across nodes gcs_server C++ (ray-project/ray/gcs/gcs_se rver/gcs_server_main.cc) Head Server that exposes the global control store, a storage for metadata about actors, nodes, jobs, resources, placement groups etc. See this1 and this2. client.server Python Head Server that enables remote clients to connect to a ray cluster to run programs dashboard Python Head Exposes a UI that exposes cluster-wide state such as per-node logs, actors/tasks running on each node, CPU/Mem utilization etc.

Actors • Actors map 1-1 to a process. Processes are
scheduled depending on resource (CPU cores) availability • Actors can have state • Actors can be running a long/short running method (eg., spin loop) in a or in an idle state • Actors support resource hints that ensure they are only scheduled on a worker pod with enough resource, eg., CPU/Memory ◦ If an actor consumes more memory than the host pod resource limits, the actor will crash. It is the user’s responsibility to estimate peak resource consumption and provide scheduling hints so actors are scheduled on appropriate host pods ◦ Accelerator_type field can be used to force an actor to be scheduled on node with that particular accelerator type available - eg., a certain type of GPU ◦ Seems to be easy to get into a deadlock by forcing Ray’s scheduler’s hand too much!

Node Manager and Object Manager services are running on the
Raylet process Note - :: is the short-form of 0:0:0:0:0:0:0:0, which is the equivalent address of 0.0.0.0 in IPv6

Remote Task Execution

Ray Concepts - Tasks: Arbitrary functions to be executed asynchronously
on separate Python workers - Actors: A stateful worker. When a new actor is instantiated, a new worker (process) is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker. - Actors can be running a long/short running method (eg., spin loop) or be in an idle state - Actors and tasks support resource hints that ensure they are only scheduled on a worker pod with enough resource, eg., CPU/Memory - If an actor consumes more memory than the host pod resource limits, the actor will crash. It is the user’s responsibility to estimate peak resource consumption and provide scheduling hints so actors are scheduled on appropriate host pods - Tasks and actors create and compute on objects. These objects can be located anywhere on the cluster, hence called remote objects. Remote objects are referred to by object refs Task Actor with spin loop Actor without spin loop

Passing Data • Base case Average execution time: ~4.5 sec

Passing Data • Within-node (instance) data transfer Average execution time:
~10 sec

Passing Data • Across-node (instance) data transfer Average execution time:
~11 sec

Passing Data • Dereferencing remote objects too early Average execution
time: ~35 sec

We’ll be focussing on the data loading part Case 1
Case 2

Ray Kubernetes Cluster - Tasks - Actors - Ray nodes
<-> k8s pods - RayCluster - Head node (pod) - Several worker nodes (pods) - Service (ClusterIP/NodePort/Loadbalancer) - Actor/Task -> (1:1)->Process ->(n:1) -> node(pod) - RayCluster CRD describes the desired cluster state. The Ray-Kubernetes operator adjusts cluster current state to track the desired state

Ray System Architecture Global Control Store: - List of nodes,
actors, tasks - Client workflow zip (similar to Prefect pickle?) - No shared object store at the cluster level! - Shared memory across actors means large objects can be efficiently shared across actors on a single node. This implies that scheduling actors on a single node can be more efficient than scheduling actors on different nodes

Ray Community Meetup Talks

Ray Community Meetup Talks

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript