Slide 1

Slide 1 text

Introduction to KubeRay Dmitri Gekhtman & Jiaxin Shan

Slide 2

Slide 2 text

Speakers Jiaxin Shan Software Engineer, Bytedance, Inc Dmitri Gekhtman Software Engineer, Anyscale

Slide 3

Slide 3 text

What’s KubeRay? ● KubeRay is an open source toolkit to run Ray applications on Kubernetes. ● https://github.com/ray-project/kuberay/ Who’s involved? ● KubeRay is a community-driven project ○ Major contributors and adopters include ■ Microsoft ■ ByteDance ■ Ant Financial ○ Increasing involvement from primary Ray maintainers and Anyscale ● Relatively new project, gaining traction ○ 100 GitHub stars ○ 40 forks ○ PRs merged from 30 contributors ○ 2000 unique git cloners

Slide 4

Slide 4 text

KubeRay is Endorsed by the Ray Maintainers 👍 ● KubeRay emerged as a community-driven project ● Still an OSS community collaboration. ○ Primary Ray maintainers (esp. Anyscale) are getting more involved! ● Future vision: The KubeRay operator is the solution for Ray deployments on Kubernetes. ● Current work-in-progress: ○ Key feature integrations (esp. autoscaling). ○ Better testing for rock-solid stability! ○ Docs for great user experience! ● Goal: Integration of the KubeRay operator into core Ray ecosystem ○ Beta in Ray 2.0.0

Slide 5

Slide 5 text

KubeRay Features KubeRay provides several tools to improve running and managing Ray's experience on Kubernetes. ● Ray Operator (K8s-native management of Ray clusters) ● Backend services to create/delete cluster resources ● Kubectl plugin/CLI to operate CRD objects ● Data Scientist centric workspace for fast prototyping (incubating) ● Native Job and Serving integration with Ray Clusters (incubating) ● Ray node problem detector and termination handler (future work) ● Kubernetes event dumper for ray clusters and underlying resources (future work)

Slide 6

Slide 6 text

Kubernetes operator pattern Kubernetes' operator pattern concept lets you extend the cluster's behaviour without modifying the code of Kubernetes itself by linking controllers to one or more custom resources. Operators are clients of the Kubernetes API that act as controllers for a Custom Resource. https://iximiuz.com/en/posts/kubernetes-operator-pattern/

Slide 7

Slide 7 text

What is Ray Operator? The Ray Operator is a Kubernetes operator to automate provisioning, management, autoscaling and operations of Ray clusters deployed to Kubernetes. Features: ● Management of first-class RayClusters via a custom resource. ● Support for heterogeneous worker types in a single Ray cluster. ● Built-in monitoring via Prometheus. ● Automatically adding the volumeMount at /dev/shm for shared memory ● Use of ScaleStrategy to remove specific nodes in specific groups

Slide 8

Slide 8 text

RayCluster Example

Slide 9

Slide 9 text

RayCluster Example - continue $ kubectl apply -f raycluster_example.yaml $ kubectl get pods NAME READY STATUS RESTARTS AGE raycluster-heterogeneous-head-5r6qr 1/1 Running 0 14m raycluster-heterogeneous-worker-gpu-group-ljzzt 1/1 Running 0 14m raycluster-heterogeneous-worker-cpu-group-76qxb 1/1 Running 0 14m raycluster-heterogeneous-worker-cpu-group-gs7ty 1/1 Running 0 14m raycluster-heterogeneous-worker-cpu-group-dcl4d 1/1 Running 0 14m $ kubectl get services NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE raycluster-heterogeneous-my-svc ClusterIP None 80/TCP 15m

Slide 10

Slide 10 text

Ray Operator under the hood

Slide 11

Slide 11 text

Ray Autoscaler - KubeRay implementation (Alpha) WIP on the new KubeRay NodeProvider ray-project/ray#21086 ray-project/ray#22348

Slide 12

Slide 12 text

Ray Autoscaler - KubeRay implementation (Alpha) Automatic scaling based on Ray application semantics! ● Autoscaler reads load metrics from Ray ● KuberayNodeProvider patches RayCluster CR scale to size your cluster.

Slide 13

Slide 13 text

Protocol Buffers and GRPC services ● SDK: programming integration ● CLI: Individual users ● Portal: Platform users

Slide 14

Slide 14 text

Ongoing Community Collaboration

Slide 15

Slide 15 text

KubeRay Future Works v0.3 release https://github.com/ray-project/kuberay/milestone/3 (Target 2022/06) ● Autoscaler improvements and expose Ray autoscaler configuration options #246 ● Job management and better job based cluster use case support #106 ● Serve CRD and operator support for better serve deployment management #214 ● Data scientist centric Notebook workspace support #103 ● Observability improvements for KubeRay control plane and ray cluster. #267 ● Reliable integration with Ray 2.0.0 dev #224

Slide 16

Slide 16 text

Questions? Ways to contribute to KubeRay project. ● Slack: #kuberay ● Community: https://meet.google.com/uvu-cvui-paq Tuesday 6PM Bi-Weekly ○ Next meeting: Tuesday, May 31 ● Visit the KubeRay GitHub page ● Check out KubeRay issues