Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to KubeRay - Dmitri Gekhtman, Anyscale & Jiaxin Shan, ByteDance

Introduction to KubeRay - Dmitri Gekhtman, Anyscale & Jiaxin Shan, ByteDance

n this introductory session, we will introduce KubeRay, a Ray cluster management tool built on top of Kubernetes. We will talk about the motivation behind KubeRay, the difference between KubeRay and ray-operator in the Ray core, recent v0.2.0 features, and future updates.

Dmitri Gekhtman is a software engineer on the Infrastructure team at Anyscale. His areas of focus include autoscaling and the integration of Ray into Kubernetes environments.

Jiaxin Shan is a software engineer focusing on serverless infrastructure and cloud-native adoption at Bytedance.

Anyscale
PRO

June 09, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Introduction to KubeRay
    Dmitri Gekhtman & Jiaxin Shan

    View Slide

  2. Speakers
    Jiaxin Shan
    Software Engineer,
    Bytedance, Inc
    Dmitri Gekhtman
    Software Engineer,
    Anyscale

    View Slide

  3. What’s KubeRay?
    ● KubeRay is an open source toolkit to run Ray applications on Kubernetes.
    ● https://github.com/ray-project/kuberay/
    Who’s involved?
    ● KubeRay is a community-driven project
    ○ Major contributors and adopters include
    ■ Microsoft
    ■ ByteDance
    ■ Ant Financial
    ○ Increasing involvement from primary Ray maintainers and Anyscale
    ● Relatively new project, gaining traction
    ○ 100 GitHub stars
    ○ 40 forks
    ○ PRs merged from 30 contributors
    ○ 2000 unique git cloners

    View Slide

  4. KubeRay is Endorsed by the Ray Maintainers
    👍
    ● KubeRay emerged as a community-driven project
    ● Still an OSS community collaboration.
    ○ Primary Ray maintainers (esp. Anyscale) are getting more involved!
    ● Future vision: The KubeRay operator is the solution for Ray deployments on Kubernetes.
    ● Current work-in-progress:
    ○ Key feature integrations (esp. autoscaling).
    ○ Better testing for rock-solid stability!
    ○ Docs for great user experience!
    ● Goal: Integration of the KubeRay operator into core Ray ecosystem
    ○ Beta in Ray 2.0.0

    View Slide

  5. KubeRay Features
    KubeRay provides several tools to improve running and managing Ray's experience on Kubernetes.
    ● Ray Operator (K8s-native management of Ray clusters)
    ● Backend services to create/delete cluster resources
    ● Kubectl plugin/CLI to operate CRD objects
    ● Data Scientist centric workspace for fast prototyping (incubating)
    ● Native Job and Serving integration with Ray Clusters (incubating)
    ● Ray node problem detector and termination handler (future work)
    ● Kubernetes event dumper for ray clusters and underlying resources (future work)

    View Slide

  6. Kubernetes operator pattern
    Kubernetes' operator pattern concept lets you extend the cluster's behaviour without modifying the
    code of Kubernetes itself by linking controllers to one or more custom resources. Operators are
    clients of the Kubernetes API that act as controllers for a Custom Resource.
    https://iximiuz.com/en/posts/kubernetes-operator-pattern/

    View Slide

  7. What is Ray Operator?
    The Ray Operator is a Kubernetes operator to automate provisioning, management, autoscaling and
    operations of Ray clusters deployed to Kubernetes.
    Features:
    ● Management of first-class RayClusters via a custom resource.
    ● Support for heterogeneous worker types in a single Ray cluster.
    ● Built-in monitoring via Prometheus.
    ● Automatically adding the volumeMount at /dev/shm for shared memory
    ● Use of ScaleStrategy to remove specific nodes in specific groups

    View Slide

  8. RayCluster Example

    View Slide

  9. RayCluster Example - continue
    $ kubectl apply -f raycluster_example.yaml
    $ kubectl get pods
    NAME READY STATUS RESTARTS AGE
    raycluster-heterogeneous-head-5r6qr 1/1 Running 0 14m
    raycluster-heterogeneous-worker-gpu-group-ljzzt 1/1 Running 0 14m
    raycluster-heterogeneous-worker-cpu-group-76qxb 1/1 Running 0 14m
    raycluster-heterogeneous-worker-cpu-group-gs7ty 1/1 Running 0 14m
    raycluster-heterogeneous-worker-cpu-group-dcl4d 1/1 Running 0 14m
    $ kubectl get services
    NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
    raycluster-heterogeneous-my-svc ClusterIP None 80/TCP 15m

    View Slide

  10. Ray Operator under the hood

    View Slide

  11. Ray Autoscaler - KubeRay implementation (Alpha)
    WIP on the new KubeRay NodeProvider
    ray-project/ray#21086
    ray-project/ray#22348

    View Slide

  12. Ray Autoscaler - KubeRay implementation (Alpha)
    Automatic scaling based on Ray application
    semantics!
    ● Autoscaler reads load metrics from Ray
    ● KuberayNodeProvider patches RayCluster CR
    scale to size your cluster.

    View Slide

  13. Protocol Buffers and GRPC services
    ● SDK: programming integration
    ● CLI: Individual users
    ● Portal: Platform users

    View Slide

  14. Ongoing Community Collaboration

    View Slide

  15. KubeRay Future Works
    v0.3 release https://github.com/ray-project/kuberay/milestone/3 (Target 2022/06)
    ● Autoscaler improvements and expose Ray autoscaler configuration options #246
    ● Job management and better job based cluster use case support #106
    ● Serve CRD and operator support for better serve deployment management #214
    ● Data scientist centric Notebook workspace support #103
    ● Observability improvements for KubeRay control plane and ray cluster. #267
    ● Reliable integration with Ray 2.0.0 dev #224

    View Slide

  16. Questions?
    Ways to contribute to KubeRay project.
    ● Slack: #kuberay
    ● Community: https://meet.google.com/uvu-cvui-paq Tuesday 6PM Bi-Weekly
    ○ Next meeting: Tuesday, May 31
    ● Visit the KubeRay GitHub page
    ● Check out KubeRay issues

    View Slide