Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to KubeRay - Dmitri Gekhtman, Anyscale & Jiaxin Shan, ByteDance

Introduction to KubeRay - Dmitri Gekhtman, Anyscale & Jiaxin Shan, ByteDance

n this introductory session, we will introduce KubeRay, a Ray cluster management tool built on top of Kubernetes. We will talk about the motivation behind KubeRay, the difference between KubeRay and ray-operator in the Ray core, recent v0.2.0 features, and future updates.

Dmitri Gekhtman is a software engineer on the Infrastructure team at Anyscale. His areas of focus include autoscaling and the integration of Ray into Kubernetes environments.

Jiaxin Shan is a software engineer focusing on serverless infrastructure and cloud-native adoption at Bytedance.


June 09, 2022

More Decks by Anyscale

Other Decks in Technology


  1. What’s KubeRay? • KubeRay is an open source toolkit to

    run Ray applications on Kubernetes. • https://github.com/ray-project/kuberay/ Who’s involved? • KubeRay is a community-driven project ◦ Major contributors and adopters include ▪ Microsoft ▪ ByteDance ▪ Ant Financial ◦ Increasing involvement from primary Ray maintainers and Anyscale • Relatively new project, gaining traction ◦ 100 GitHub stars ◦ 40 forks ◦ PRs merged from 30 contributors ◦ 2000 unique git cloners
  2. KubeRay is Endorsed by the Ray Maintainers 👍 • KubeRay

    emerged as a community-driven project • Still an OSS community collaboration. ◦ Primary Ray maintainers (esp. Anyscale) are getting more involved! • Future vision: The KubeRay operator is the solution for Ray deployments on Kubernetes. • Current work-in-progress: ◦ Key feature integrations (esp. autoscaling). ◦ Better testing for rock-solid stability! ◦ Docs for great user experience! • Goal: Integration of the KubeRay operator into core Ray ecosystem ◦ Beta in Ray 2.0.0
  3. KubeRay Features KubeRay provides several tools to improve running and

    managing Ray's experience on Kubernetes. • Ray Operator (K8s-native management of Ray clusters) • Backend services to create/delete cluster resources • Kubectl plugin/CLI to operate CRD objects • Data Scientist centric workspace for fast prototyping (incubating) • Native Job and Serving integration with Ray Clusters (incubating) • Ray node problem detector and termination handler (future work) • Kubernetes event dumper for ray clusters and underlying resources (future work)
  4. Kubernetes operator pattern Kubernetes' operator pattern concept lets you extend

    the cluster's behaviour without modifying the code of Kubernetes itself by linking controllers to one or more custom resources. Operators are clients of the Kubernetes API that act as controllers for a Custom Resource. https://iximiuz.com/en/posts/kubernetes-operator-pattern/
  5. What is Ray Operator? The Ray Operator is a Kubernetes

    operator to automate provisioning, management, autoscaling and operations of Ray clusters deployed to Kubernetes. Features: • Management of first-class RayClusters via a custom resource. • Support for heterogeneous worker types in a single Ray cluster. • Built-in monitoring via Prometheus. • Automatically adding the volumeMount at /dev/shm for shared memory • Use of ScaleStrategy to remove specific nodes in specific groups
  6. RayCluster Example - continue $ kubectl apply -f raycluster_example.yaml $

    kubectl get pods NAME READY STATUS RESTARTS AGE raycluster-heterogeneous-head-5r6qr 1/1 Running 0 14m raycluster-heterogeneous-worker-gpu-group-ljzzt 1/1 Running 0 14m raycluster-heterogeneous-worker-cpu-group-76qxb 1/1 Running 0 14m raycluster-heterogeneous-worker-cpu-group-gs7ty 1/1 Running 0 14m raycluster-heterogeneous-worker-cpu-group-dcl4d 1/1 Running 0 14m $ kubectl get services NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE raycluster-heterogeneous-my-svc ClusterIP None <none> 80/TCP 15m
  7. Ray Autoscaler - KubeRay implementation (Alpha) WIP on the new

    KubeRay NodeProvider ray-project/ray#21086 ray-project/ray#22348
  8. Ray Autoscaler - KubeRay implementation (Alpha) Automatic scaling based on

    Ray application semantics! • Autoscaler reads load metrics from Ray • KuberayNodeProvider patches RayCluster CR scale to size your cluster.
  9. Protocol Buffers and GRPC services • SDK: programming integration •

    CLI: Individual users • Portal: Platform users
  10. KubeRay Future Works v0.3 release https://github.com/ray-project/kuberay/milestone/3 (Target 2022/06) • Autoscaler

    improvements and expose Ray autoscaler configuration options #246 • Job management and better job based cluster use case support #106 • Serve CRD and operator support for better serve deployment management #214 • Data scientist centric Notebook workspace support #103 • Observability improvements for KubeRay control plane and ray cluster. #267 • Reliable integration with Ray 2.0.0 dev #224
  11. Questions? Ways to contribute to KubeRay project. • Slack: #kuberay

    • Community: https://meet.google.com/uvu-cvui-paq Tuesday 6PM Bi-Weekly ◦ Next meeting: Tuesday, May 31 • Visit the KubeRay GitHub page • Check out KubeRay issues