store (eg., S3) vs. Ray plasma store. Will also learn setting up a RayCluster on Kubernetes Use-case: Downloading and scattering data to compute nodes is a frequent operation in any data processing or ML workflow and must be efficient to benefit from parallelization Agenda: - Test application demo - Test application system architecture - Ray concepts - Analysis of data transfer latency for several scenarios - Conclusion
on separate Python workers - Actors: A stateful task. When a new actor is instantiated, a new process is created. Methods of the actor are scheduled on that process and can access and mutate the state of the actor. - Tasks and actors create and compute on objects. These objects can be located anywhere on the cluster, hence called remote objects. Remote objects are referred to by object refs - Remote objects are cached in Ray’s distributed object store, also called plasma store - There is one object store per Ray node. There is no “global” object store. If an actor or task running on a node needs a piece of data that is not located in that actor or task’s object store, then the data needs to replicated from where it is located to where it is needed. Task
(base case) Download Actor and Process Actor on different pods on different instances cross-instance data transfer latency Case 2 One Download Actor and multiple Process Actors, each on different instances cross-instance data transfer latency + object store read parallelism Case 3 Download Actor and Process Actors on different pods on the same instance Object store read parallelism within-instance Case 4 Download Actor and Process Actors on the same pod Inter-process data transfer, no object store replication is involved Higher data + compute locality
there is a single download actor that downloads the model from S3 and transfers the model data to the object store of that pod. Then, the data is replicated to multiple process actors running on separate EC2 instances • This scenario measures data transfer across EC2 instances + parallel data replication
(MB/sec) 1 5.8 97 2 10.8 104 3 15.23 110 4 19.79 113 5 25.8 109 Case 2: Multiple process actors on separate instances (Ray < 2.0.0) This graph shows the increase in data transfer time as the number of process actors is increased from 1 to 5, with each process actor scheduled on a separate EC2 instance, for previous version of Ray. The increase in data transfer time is nearly linear, implying that parallel data replication has been significantly optimized in Ray 2.0.0!
+ Process actor base-case Case 2 Multiple actors, each on a separate EC2 instance S3 read-parallelism Case 3 Multiple actors on the same EC2 instance Per-instance S3 read-parallelism Case 4 Multiple actors on the same pod Inter-process data transfer
same EC2 instance Case 3: Downloading from S3, multiple actors on same instance Data transfer rate (#actors = 5) = 5*563/12 = 234 MB/sec Data transfer rate (#actors = 2) = 2*563/6.5 = 173 MB/sec Ray driver output showing actors starting on separate pods kubectl -n ray get pods -o=wide All pods are scheduled on the same EC2 instance
f1 can be scheduled on this pod/node • Need to ensure the pod/node has sufficient CPU/memory allocation • Force pods on a particular instance or group of instances: nodeName, nodeAffinity, taints/tolerations ◦ Need a combination of nodeName + taints/tolerations to ensure only certain pods are scheduled on certain instances, and no other! • Force actors on the same pod: actor resources
custom resources) in task/actor definition. This will ensure the task/actor is only scheduled on a node/pod that provides those (or more) resources • Ray also offers Placement groups that allow users to atomically reserve groups of resources across multiple nodes which can then be used to schedule Ray tasks and actors packed as close as possible for locality (PACK), or spread apart (SPREAD). Placement groups are generally used for gang-scheduling actors, but also support tasks.
mapping index, object store replication is more efficient. This is because actors can be scheduled on a small group of EC2 instances and object store supports moderate level of read-parallelism ◦ Major improvement in parallel read BW in Ray 2.0! ◦ May eliminate the need for setting up NFS/minio/multi-mount EBS volumes ◦ Data stays local to the cluster • For tasks with high mapping index, parallel reads from S3 may be more efficient, this is because S3 offers great read parallelism ◦ However, for small data sizes, object store replication is probably sufficient ◦ Need to consider latency of obtaining S3 read credentials • It helps to locate data source and data consumer close to each other--on the same EC2 instance, or on the same pod, if feasible • Must anticipate peak CPU/Memory use and schedule actors/tasks on pods with sufficient resources, otherwise actor/task will crash. Ray offers several methods to achieve this
monitoring ray log files dashboard/agent.py Python Worker, Head A server that exposes endpoints for collecting metrics such as CPU/Mem utilization raylet C++ (ray-project/ray/blob/maste r/src/ray/raylet/raylet.cc) Worker, Head Consists of node manager (aka scheduler?) and object manager, which are services listening on certain (configurable) ports and responsible for scheduling remote task execution and transfer of data across nodes gcs_server C++ (ray-project/ray/gcs/gcs_se rver/gcs_server_main.cc) Head Server that exposes the global control store, a storage for metadata about actors, nodes, jobs, resources, placement groups etc. See this1 and this2. client.server Python Head Server that enables remote clients to connect to a ray cluster to run programs dashboard Python Head Exposes a UI that exposes cluster-wide state such as per-node logs, actors/tasks running on each node, CPU/Mem utilization etc.
scheduled depending on resource (CPU cores) availability • Actors can have state • Actors can be running a long/short running method (eg., spin loop) in a or in an idle state • Actors support resource hints that ensure they are only scheduled on a worker pod with enough resource, eg., CPU/Memory ◦ If an actor consumes more memory than the host pod resource limits, the actor will crash. It is the user’s responsibility to estimate peak resource consumption and provide scheduling hints so actors are scheduled on appropriate host pods ◦ Accelerator_type field can be used to force an actor to be scheduled on node with that particular accelerator type available - eg., a certain type of GPU ◦ Seems to be easy to get into a deadlock by forcing Ray’s scheduler’s hand too much!
on separate Python workers - Actors: A stateful worker. When a new actor is instantiated, a new worker (process) is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker. - Actors can be running a long/short running method (eg., spin loop) or be in an idle state - Actors and tasks support resource hints that ensure they are only scheduled on a worker pod with enough resource, eg., CPU/Memory - If an actor consumes more memory than the host pod resource limits, the actor will crash. It is the user’s responsibility to estimate peak resource consumption and provide scheduling hints so actors are scheduled on appropriate host pods - Tasks and actors create and compute on objects. These objects can be located anywhere on the cluster, hence called remote objects. Remote objects are referred to by object refs Task Actor with spin loop Actor without spin loop
<-> k8s pods - RayCluster - Head node (pod) - Several worker nodes (pods) - Service (ClusterIP/NodePort/Loadbalancer) - Actor/Task -> (1:1)->Process ->(n:1) -> node(pod) - RayCluster CRD describes the desired cluster state. The Ray-Kubernetes operator adjusts cluster current state to track the desired state
actors, tasks - Client workflow zip (similar to Prefect pickle?) - No shared object store at the cluster level! - Shared memory across actors means large objects can be efficiently shared across actors on a single node. This implies that scheduling actors on a single node can be more efficient than scheduling actors on different nodes