Slide 1

Slide 1 text

Data Transfer Speed on Ray Ankur Mohan

Slide 2

Slide 2 text

Outline.. Objective: Discuss data transfer speed from a distributed data store (eg., S3) vs. Ray plasma store. Will also learn setting up a RayCluster on Kubernetes Use-case: Downloading and scattering data to compute nodes is a frequent operation in any data processing or ML workflow and must be efficient to benefit from parallelization Agenda: - Test application demo - Test application system architecture - Ray concepts - Analysis of data transfer latency for several scenarios - Conclusion

Slide 3

Slide 3 text

Test Application workflow Sample audio

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

System Architecture

Slide 7

Slide 7 text

Ray Concepts - Tasks: Arbitrary functions to be executed asynchronously on separate Python workers - Actors: A stateful task. When a new actor is instantiated, a new process is created. Methods of the actor are scheduled on that process and can access and mutate the state of the actor. - Tasks and actors create and compute on objects. These objects can be located anywhere on the cluster, hence called remote objects. Remote objects are referred to by object refs - Remote objects are cached in Ray’s distributed object store, also called plasma store - There is one object store per Ray node. There is no “global” object store. If an actor or task running on a node needs a piece of data that is not located in that actor or task’s object store, then the data needs to replicated from where it is located to where it is needed. Task

Slide 8

Slide 8 text

Now let’s apply these ideas to the downloading the model in the speech2text application Object Replication from Plasma Store Object Download from S3

Slide 9

Slide 9 text

Object Store Replication from Plasma Store Scenario Objective Case 1 (base case) Download Actor and Process Actor on different pods on different instances cross-instance data transfer latency Case 2 One Download Actor and multiple Process Actors, each on different instances cross-instance data transfer latency + object store read parallelism Case 3 Download Actor and Process Actors on different pods on the same instance Object store read parallelism within-instance Case 4 Download Actor and Process Actors on the same pod Inter-process data transfer, no object store replication is involved Higher data + compute locality

Slide 10

Slide 10 text

Case 1 (base): Download and Process actors on different nodes # Process Actors Data Transfer Time (sec) 1 4.0 Data transfer rate = 563/4.0 ~ 140.75 Mbps

Slide 11

Slide 11 text

Case 2: Multiple process actors on separate instances ● Here, there is a single download actor that downloads the model from S3 and transfers the model data to the object store of that pod. Then, the data is replicated to multiple process actors running on separate EC2 instances ● This scenario measures data transfer across EC2 instances + parallel data replication

Slide 12

Slide 12 text

Case 2: Multiple process actors on separate instances (Ray 2.0.0) Data transfer rate = 563*5/10.5 ~ 268 Mbps

Slide 13

Slide 13 text

# Process actors Data transfer time (sec) Data transfer BW (MB/sec) 1 5.8 97 2 10.8 104 3 15.23 110 4 19.79 113 5 25.8 109 Case 2: Multiple process actors on separate instances (Ray < 2.0.0) This graph shows the increase in data transfer time as the number of process actors is increased from 1 to 5, with each process actor scheduled on a separate EC2 instance, for previous version of Ray. The increase in data transfer time is nearly linear, implying that parallel data replication has been significantly optimized in Ray 2.0.0!

Slide 14

Slide 14 text

Case 3: Download Actor and Load Actors on different pods on same instance # Process actors Data transfer time (sec) Data transfer BW (MB/sec) 1 3.24 173 2 3.60 312 kubectl get pods -o=wide -n ray

Slide 15

Slide 15 text

Case 4: Download and Process actors on same pod # Process actors Data transfer time (sec) Data transfer BW (MB) 1 2.2 255 2 2.4 469 Object store not involved!

Slide 16

Slide 16 text

Aside: Discussion about Fred Reiss’s talk Fred Reiss’s talk This talk: object store replication + object store to process memory

Slide 17

Slide 17 text

Object Download from S3 Scenario Objective Case 1 Single Download + Process actor base-case Case 2 Multiple actors, each on a separate EC2 instance S3 read-parallelism Case 3 Multiple actors on the same EC2 instance Per-instance S3 read-parallelism Case 4 Multiple actors on the same pod Inter-process data transfer

Slide 18

Slide 18 text

Case 1 + 2: Downloading from S3, multiple Actors # Num Actors Download time (sec) 1 5.4 2 5.4 3 5.4 4 5.6 Object store not utilized!

Slide 19

Slide 19 text

Network I/O bound when multiple actors are scheduled on the same EC2 instance Case 3: Downloading from S3, multiple actors on same instance Data transfer rate (#actors = 5) = 5*563/12 = 234 MB/sec Data transfer rate (#actors = 2) = 2*563/6.5 = 173 MB/sec Ray driver output showing actors starting on separate pods kubectl -n ray get pods -o=wide All pods are scheduled on the same EC2 instance

Slide 20

Slide 20 text

Case 4 Downloading from S3: Multiple actors on same pod Notice: Data copied directly to process memory! Not via object store - Similar to actors on different pods

Slide 21

Slide 21 text

Scheduling tools and techniques ● Maximum of two actors running f1 can be scheduled on this pod/node ● Need to ensure the pod/node has sufficient CPU/memory allocation ● Force pods on a particular instance or group of instances: nodeName, nodeAffinity, taints/tolerations ○ Need a combination of nodeName + taints/tolerations to ensure only certain pods are scheduled on certain instances, and no other! ● Force actors on the same pod: actor resources

Slide 22

Slide 22 text

Scheduling tools and techniques ● Can also specify resources (CPU/memory/GPUs, custom resources) in task/actor definition. This will ensure the task/actor is only scheduled on a node/pod that provides those (or more) resources ● Ray also offers Placement groups that allow users to atomically reserve groups of resources across multiple nodes which can then be used to schedule Ray tasks and actors packed as close as possible for locality (PACK), or spread apart (SPREAD). Placement groups are generally used for gang-scheduling actors, but also support tasks.

Slide 23

Slide 23 text

Conclusions ● For mapped tasks with a medium (< 10) mapping index, object store replication is more efficient. This is because actors can be scheduled on a small group of EC2 instances and object store supports moderate level of read-parallelism ○ Major improvement in parallel read BW in Ray 2.0! ○ May eliminate the need for setting up NFS/minio/multi-mount EBS volumes ○ Data stays local to the cluster ● For tasks with high mapping index, parallel reads from S3 may be more efficient, this is because S3 offers great read parallelism ○ However, for small data sizes, object store replication is probably sufficient ○ Need to consider latency of obtaining S3 read credentials ● It helps to locate data source and data consumer close to each other--on the same EC2 instance, or on the same pod, if feasible ● Must anticipate peak CPU/Memory use and schedule actors/tasks on pods with sufficient resources, otherwise actor/task will crash. Ray offers several methods to achieve this

Slide 24

Slide 24 text

Thanks!

Slide 25

Slide 25 text

Appendix

Slide 26

Slide 26 text

RayCluster

Slide 27

Slide 27 text

Passing Data ● Lazy loading of data - avoid unnecessary complexity ● Functions allocate on separate pods (not necessarily instances!)

Slide 28

Slide 28 text

Data transfer when actors are on the same pod, but different from source

Slide 29

Slide 29 text

Data transfer when pods are on the same EC2 instace

Slide 30

Slide 30 text

When source and two actors are on the same pod

Slide 31

Slide 31 text

Giving the scheduler free rein.. Data load time 5 19.01

Slide 32

Slide 32 text

Ray Head and Worker Each is a collection of services k exec -n ray example-cluster-ray-head-49rl9 -it /bin/bash

Slide 33

Slide 33 text

Service Language Head/Worker Function log_monitor.py Python Worker, Head Process for monitoring ray log files dashboard/agent.py Python Worker, Head A server that exposes endpoints for collecting metrics such as CPU/Mem utilization raylet C++ (ray-project/ray/blob/maste r/src/ray/raylet/raylet.cc) Worker, Head Consists of node manager (aka scheduler?) and object manager, which are services listening on certain (configurable) ports and responsible for scheduling remote task execution and transfer of data across nodes gcs_server C++ (ray-project/ray/gcs/gcs_se rver/gcs_server_main.cc) Head Server that exposes the global control store, a storage for metadata about actors, nodes, jobs, resources, placement groups etc. See this1 and this2. client.server Python Head Server that enables remote clients to connect to a ray cluster to run programs dashboard Python Head Exposes a UI that exposes cluster-wide state such as per-node logs, actors/tasks running on each node, CPU/Mem utilization etc.

Slide 34

Slide 34 text

Actors ● Actors map 1-1 to a process. Processes are scheduled depending on resource (CPU cores) availability ● Actors can have state ● Actors can be running a long/short running method (eg., spin loop) in a or in an idle state ● Actors support resource hints that ensure they are only scheduled on a worker pod with enough resource, eg., CPU/Memory ○ If an actor consumes more memory than the host pod resource limits, the actor will crash. It is the user’s responsibility to estimate peak resource consumption and provide scheduling hints so actors are scheduled on appropriate host pods ○ Accelerator_type field can be used to force an actor to be scheduled on node with that particular accelerator type available - eg., a certain type of GPU ○ Seems to be easy to get into a deadlock by forcing Ray’s scheduler’s hand too much!

Slide 35

Slide 35 text

Node Manager and Object Manager services are running on the Raylet process Note - :: is the short-form of 0:0:0:0:0:0:0:0, which is the equivalent address of 0.0.0.0 in IPv6

Slide 36

Slide 36 text

Remote Task Execution

Slide 37

Slide 37 text

Remote Task Execution

Slide 38

Slide 38 text

Remote Task Execution

Slide 39

Slide 39 text

Remote Task Execution

Slide 40

Slide 40 text

Remote Task Execution

Slide 41

Slide 41 text

Remote Task Execution

Slide 42

Slide 42 text

Remote Task Execution

Slide 43

Slide 43 text

Remote Task Execution

Slide 44

Slide 44 text

Remote Task Execution

Slide 45

Slide 45 text

Ray Concepts - Tasks: Arbitrary functions to be executed asynchronously on separate Python workers - Actors: A stateful worker. When a new actor is instantiated, a new worker (process) is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker. - Actors can be running a long/short running method (eg., spin loop) or be in an idle state - Actors and tasks support resource hints that ensure they are only scheduled on a worker pod with enough resource, eg., CPU/Memory - If an actor consumes more memory than the host pod resource limits, the actor will crash. It is the user’s responsibility to estimate peak resource consumption and provide scheduling hints so actors are scheduled on appropriate host pods - Tasks and actors create and compute on objects. These objects can be located anywhere on the cluster, hence called remote objects. Remote objects are referred to by object refs Task Actor with spin loop Actor without spin loop

Slide 46

Slide 46 text

Passing Data ● Base case Average execution time: ~4.5 sec

Slide 47

Slide 47 text

Passing Data ● Within-node (instance) data transfer Average execution time: ~10 sec

Slide 48

Slide 48 text

Passing Data ● Across-node (instance) data transfer Average execution time: ~11 sec

Slide 49

Slide 49 text

Passing Data ● Dereferencing remote objects too early Average execution time: ~35 sec

Slide 50

Slide 50 text

We’ll be focussing on the data loading part Case 1 Case 2

Slide 51

Slide 51 text

Ray Kubernetes Cluster - Tasks - Actors - Ray nodes <-> k8s pods - RayCluster - Head node (pod) - Several worker nodes (pods) - Service (ClusterIP/NodePort/Loadbalancer) - Actor/Task -> (1:1)->Process ->(n:1) -> node(pod) - RayCluster CRD describes the desired cluster state. The Ray-Kubernetes operator adjusts cluster current state to track the desired state

Slide 52

Slide 52 text

Ray System Architecture Global Control Store: - List of nodes, actors, tasks - Client workflow zip (similar to Prefect pickle?) - No shared object store at the cluster level! - Shared memory across actors means large objects can be efficiently shared across actors on a single node. This implies that scheduling actors on a single node can be more efficient than scheduling actors on different nodes