Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Ray Summit 2023] Approaching Cluster Multi-ten...

[Ray Summit 2023] Approaching Cluster Multi-tenancy with Ray Job

Jae Sim

June 12, 2024
Tweet

More Decks by Jae Sim

Other Decks in Technology

Transcript

  1. Ikigai Pipeline Engine with Ray ❏ Each step runs its

    task in an individual pod ❏ Steps with heavier workloads run their critical tasks with Ray cluster ❏ Intermediate datasets and ML models are stored in S3 6
  2. What is Multi-tenancy? “Multitenancy is a reference to the mode

    of operation of software where multiple independent instances of one or multiple applications operate in a shared environment. The instances (tenants) are logically isolated, but physically integrated.” 8 Soft Multi-tenancy Easier to implement Hard Multi-tenancy Harder to Implement Generally “Better” Stricter Isolation of Tenants
  3. Why Multi-tenancy? 1. Faster Response No need to wait for

    Ray cluster creation for new tenants 2. Cost-effective We operate less amount of Ray cluster in general and saves cost for cluster creation time 3. Reduced Maintenance Overhead The number of cluster under maintenance is drastically lower 9
  4. Approaching Ray Cluster Multi-tenancy 1. Task Submission with Ray Job

    2. Tenant Isolation with Runtime Environment 3. Resource Management strategies 10
  5. Ikigai Pipeline Engine with Ray Client 12 ❏ A new

    connection is created when pipeline execution starts ❏ Connections are long-running and last until the end of pipeline executions
  6. Ray Client 13 ❏ Great to for interactive development and

    faster PoC process ❏ Not designed for high number of concurrent connections ❏ Run Python code with Ray cluster as if you are running it on local machine ❏ Very minimal changes required in codebase
  7. Ray Job 15 Entrypoint Script Job Submission Job Status Check

    Ray Job submission is a mechanism to submit locally developed and tested applications to a remote Ray cluster. It simplifies the experience of packaging, deploying, and managing a Ray application.
  8. Ikigai Pipeline Engine with Ray Job Still making minimal changes

    to the codebase but no more unpredictable connection failure! 16
  9. Runtime Environment 18 Configuring dependencies and environment outside of Ray

    scripts With RuntimeEnv Object ❏ working_dir The working directory for the Ray workers (local or S3) ❏ py_modules Python modules to be available for import in the Ray workers ❏ pip A list of pip requirements ❏ env_vars Environment variables to set … and more! With Dictionary
  10. Runtime Environment 19 Configuring dependencies and environment outside of Ray

    scripts For Task/Actor For Ray Job Submission For Ray Client
  11. RuntimeEnv: working_dir 20 ❏ Each pipeline’s data and ML models

    will be located in its own S3 path ❏ Tenant’s data and models are logically isolated Each job will have access to its own S3 path, specified by `working_dir`
  12. RuntimeEnv: pip 21 Defined by user 1 Defined by user

    2 ❏ Tenant’s Python environments are logically isolated
  13. Understanding Tasks - Categorize the tasks based on the required

    resources (cpu and memory) - Understand the size and capacity of the Ray cluster - Avoid running small (or tiny) jobs on Ray cluster 24
  14. Required Resources for Task num_cpus 25 memory Lower Limit ❏

    This tells Ray scheduler to reserve the required resource to ensure the tasks’ requirements does not exceed the nodes’ capacity ❏ Specifying resources does not enforce physical isolation of resources
  15. Queues for Tasks 26 Upper Limit ❏ Pipeline executors dump

    tasks to the task queue ❏ Task monitor maintains the resource requirements of running jobs and the cluster capacity
  16. Achieving Ray Cluster Multi-tenancy 28 1. Stable and concurrent Task

    Submission with Ray Job submission 2. Tenant Isolation with datasource separation, Python dependency management, and environment variables 3. Resource Management by providing CPU and memory requirement, alongside with task queue integration