Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ray Community Meetup Talks

Anyscale
PRO
September 29, 2022

Ray Community Meetup Talks

Data transfer speed comparison in a distributed ML application: Ray Plasma store vs. S3

Anyscale
PRO

September 29, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Data Transfer Speed on Ray Ankur Mohan

  2. Outline.. Objective: Discuss data transfer speed from a distributed data

    store (eg., S3) vs. Ray plasma store. Will also learn setting up a RayCluster on Kubernetes Use-case: Downloading and scattering data to compute nodes is a frequent operation in any data processing or ML workflow and must be efficient to benefit from parallelization Agenda: - Test application demo - Test application system architecture - Ray concepts - Analysis of data transfer latency for several scenarios - Conclusion
  3. Test Application workflow Sample audio

  4. None
  5. None
  6. System Architecture

  7. Ray Concepts - Tasks: Arbitrary functions to be executed asynchronously

    on separate Python workers - Actors: A stateful task. When a new actor is instantiated, a new process is created. Methods of the actor are scheduled on that process and can access and mutate the state of the actor. - Tasks and actors create and compute on objects. These objects can be located anywhere on the cluster, hence called remote objects. Remote objects are referred to by object refs - Remote objects are cached in Ray’s distributed object store, also called plasma store - There is one object store per Ray node. There is no “global” object store. If an actor or task running on a node needs a piece of data that is not located in that actor or task’s object store, then the data needs to replicated from where it is located to where it is needed. Task
  8. Now let’s apply these ideas to the downloading the model

    in the speech2text application Object Replication from Plasma Store Object Download from S3
  9. Object Store Replication from Plasma Store Scenario Objective Case 1

    (base case) Download Actor and Process Actor on different pods on different instances cross-instance data transfer latency Case 2 One Download Actor and multiple Process Actors, each on different instances cross-instance data transfer latency + object store read parallelism Case 3 Download Actor and Process Actors on different pods on the same instance Object store read parallelism within-instance Case 4 Download Actor and Process Actors on the same pod Inter-process data transfer, no object store replication is involved Higher data + compute locality
  10. Case 1 (base): Download and Process actors on different nodes

    # Process Actors Data Transfer Time (sec) 1 4.0 Data transfer rate = 563/4.0 ~ 140.75 Mbps
  11. Case 2: Multiple process actors on separate instances • Here,

    there is a single download actor that downloads the model from S3 and transfers the model data to the object store of that pod. Then, the data is replicated to multiple process actors running on separate EC2 instances • This scenario measures data transfer across EC2 instances + parallel data replication
  12. Case 2: Multiple process actors on separate instances (Ray 2.0.0)

    Data transfer rate = 563*5/10.5 ~ 268 Mbps
  13. # Process actors Data transfer time (sec) Data transfer BW

    (MB/sec) 1 5.8 97 2 10.8 104 3 15.23 110 4 19.79 113 5 25.8 109 Case 2: Multiple process actors on separate instances (Ray < 2.0.0) This graph shows the increase in data transfer time as the number of process actors is increased from 1 to 5, with each process actor scheduled on a separate EC2 instance, for previous version of Ray. The increase in data transfer time is nearly linear, implying that parallel data replication has been significantly optimized in Ray 2.0.0!
  14. Case 3: Download Actor and Load Actors on different pods

    on same instance # Process actors Data transfer time (sec) Data transfer BW (MB/sec) 1 3.24 173 2 3.60 312 kubectl get pods -o=wide -n ray
  15. Case 4: Download and Process actors on same pod #

    Process actors Data transfer time (sec) Data transfer BW (MB) 1 2.2 255 2 2.4 469 Object store not involved!
  16. Aside: Discussion about Fred Reiss’s talk Fred Reiss’s talk This

    talk: object store replication + object store to process memory
  17. Object Download from S3 Scenario Objective Case 1 Single Download

    + Process actor base-case Case 2 Multiple actors, each on a separate EC2 instance S3 read-parallelism Case 3 Multiple actors on the same EC2 instance Per-instance S3 read-parallelism Case 4 Multiple actors on the same pod Inter-process data transfer
  18. Case 1 + 2: Downloading from S3, multiple Actors #

    Num Actors Download time (sec) 1 5.4 2 5.4 3 5.4 4 5.6 Object store not utilized!
  19. Network I/O bound when multiple actors are scheduled on the

    same EC2 instance Case 3: Downloading from S3, multiple actors on same instance Data transfer rate (#actors = 5) = 5*563/12 = 234 MB/sec Data transfer rate (#actors = 2) = 2*563/6.5 = 173 MB/sec Ray driver output showing actors starting on separate pods kubectl -n ray get pods -o=wide All pods are scheduled on the same EC2 instance
  20. Case 4 Downloading from S3: Multiple actors on same pod

    Notice: Data copied directly to process memory! Not via object store - Similar to actors on different pods
  21. Scheduling tools and techniques • Maximum of two actors running

    f1 can be scheduled on this pod/node • Need to ensure the pod/node has sufficient CPU/memory allocation • Force pods on a particular instance or group of instances: nodeName, nodeAffinity, taints/tolerations ◦ Need a combination of nodeName + taints/tolerations to ensure only certain pods are scheduled on certain instances, and no other! • Force actors on the same pod: actor resources
  22. Scheduling tools and techniques • Can also specify resources (CPU/memory/GPUs,

    custom resources) in task/actor definition. This will ensure the task/actor is only scheduled on a node/pod that provides those (or more) resources • Ray also offers Placement groups that allow users to atomically reserve groups of resources across multiple nodes which can then be used to schedule Ray tasks and actors packed as close as possible for locality (PACK), or spread apart (SPREAD). Placement groups are generally used for gang-scheduling actors, but also support tasks.
  23. Conclusions • For mapped tasks with a medium (< 10)

    mapping index, object store replication is more efficient. This is because actors can be scheduled on a small group of EC2 instances and object store supports moderate level of read-parallelism ◦ Major improvement in parallel read BW in Ray 2.0! ◦ May eliminate the need for setting up NFS/minio/multi-mount EBS volumes ◦ Data stays local to the cluster • For tasks with high mapping index, parallel reads from S3 may be more efficient, this is because S3 offers great read parallelism ◦ However, for small data sizes, object store replication is probably sufficient ◦ Need to consider latency of obtaining S3 read credentials • It helps to locate data source and data consumer close to each other--on the same EC2 instance, or on the same pod, if feasible • Must anticipate peak CPU/Memory use and schedule actors/tasks on pods with sufficient resources, otherwise actor/task will crash. Ray offers several methods to achieve this
  24. Thanks!

  25. Appendix

  26. RayCluster

  27. Passing Data • Lazy loading of data - avoid unnecessary

    complexity • Functions allocate on separate pods (not necessarily instances!)
  28. Data transfer when actors are on the same pod, but

    different from source
  29. Data transfer when pods are on the same EC2 instace

  30. When source and two actors are on the same pod

  31. Giving the scheduler free rein.. Data load time 5 19.01

  32. Ray Head and Worker Each is a collection of services

    k exec -n ray example-cluster-ray-head-49rl9 -it /bin/bash
  33. Service Language Head/Worker Function log_monitor.py Python Worker, Head Process for

    monitoring ray log files dashboard/agent.py Python Worker, Head A server that exposes endpoints for collecting metrics such as CPU/Mem utilization raylet C++ (ray-project/ray/blob/maste r/src/ray/raylet/raylet.cc) Worker, Head Consists of node manager (aka scheduler?) and object manager, which are services listening on certain (configurable) ports and responsible for scheduling remote task execution and transfer of data across nodes gcs_server C++ (ray-project/ray/gcs/gcs_se rver/gcs_server_main.cc) Head Server that exposes the global control store, a storage for metadata about actors, nodes, jobs, resources, placement groups etc. See this1 and this2. client.server Python Head Server that enables remote clients to connect to a ray cluster to run programs dashboard Python Head Exposes a UI that exposes cluster-wide state such as per-node logs, actors/tasks running on each node, CPU/Mem utilization etc.
  34. Actors • Actors map 1-1 to a process. Processes are

    scheduled depending on resource (CPU cores) availability • Actors can have state • Actors can be running a long/short running method (eg., spin loop) in a or in an idle state • Actors support resource hints that ensure they are only scheduled on a worker pod with enough resource, eg., CPU/Memory ◦ If an actor consumes more memory than the host pod resource limits, the actor will crash. It is the user’s responsibility to estimate peak resource consumption and provide scheduling hints so actors are scheduled on appropriate host pods ◦ Accelerator_type field can be used to force an actor to be scheduled on node with that particular accelerator type available - eg., a certain type of GPU ◦ Seems to be easy to get into a deadlock by forcing Ray’s scheduler’s hand too much!
  35. Node Manager and Object Manager services are running on the

    Raylet process Note - :: is the short-form of 0:0:0:0:0:0:0:0, which is the equivalent address of 0.0.0.0 in IPv6
  36. Remote Task Execution

  37. Remote Task Execution

  38. Remote Task Execution

  39. Remote Task Execution

  40. Remote Task Execution

  41. Remote Task Execution

  42. Remote Task Execution

  43. Remote Task Execution

  44. Remote Task Execution

  45. Ray Concepts - Tasks: Arbitrary functions to be executed asynchronously

    on separate Python workers - Actors: A stateful worker. When a new actor is instantiated, a new worker (process) is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker. - Actors can be running a long/short running method (eg., spin loop) or be in an idle state - Actors and tasks support resource hints that ensure they are only scheduled on a worker pod with enough resource, eg., CPU/Memory - If an actor consumes more memory than the host pod resource limits, the actor will crash. It is the user’s responsibility to estimate peak resource consumption and provide scheduling hints so actors are scheduled on appropriate host pods - Tasks and actors create and compute on objects. These objects can be located anywhere on the cluster, hence called remote objects. Remote objects are referred to by object refs Task Actor with spin loop Actor without spin loop
  46. Passing Data • Base case Average execution time: ~4.5 sec

  47. Passing Data • Within-node (instance) data transfer Average execution time:

    ~10 sec
  48. Passing Data • Across-node (instance) data transfer Average execution time:

    ~11 sec
  49. Passing Data • Dereferencing remote objects too early Average execution

    time: ~35 sec
  50. We’ll be focussing on the data loading part Case 1

    Case 2
  51. Ray Kubernetes Cluster - Tasks - Actors - Ray nodes

    <-> k8s pods - RayCluster - Head node (pod) - Several worker nodes (pods) - Service (ClusterIP/NodePort/Loadbalancer) - Actor/Task -> (1:1)->Process ->(n:1) -> node(pod) - RayCluster CRD describes the desired cluster state. The Ray-Kubernetes operator adjusts cluster current state to track the desired state
  52. Ray System Architecture Global Control Store: - List of nodes,

    actors, tasks - Client workflow zip (similar to Prefect pickle?) - No shared object store at the cluster level! - Shared memory across actors means large objects can be efficiently shared across actors on a single node. This implies that scheduling actors on a single node can be more efficient than scheduling actors on different nodes