$30 off During Our Annual Pro Sale. View Details »

When should we use Kubernetes for the Machine Learning platform?

Asei Sugiyama
December 19, 2022

When should we use Kubernetes for the Machine Learning platform?

機械学習基盤として Kubernetes を採用するために必要となる、組織の能力について国内事例をもとに検討した資料です。この資料は Money Forward 社内で開かれた MLOps についての勉強会のために作成しました。

なお、資料内でさまざまな組織の取り組みについて触れていますが、これらは著者の私見です。現時点で公表されている資料に基づいていますが、各組織の公式な見解ではありません。

Asei Sugiyama

December 19, 2022
Tweet

More Decks by Asei Sugiyama

Other Decks in Technology

Transcript

  1. When should we use Kubernetes for
    the Machine Learning platform?
    Asei Sugiyama

    View Slide

  2. TOC
    Why Kubernetes for ML? <-
    Decisions to use Kubernetes and Kubeflow or not
    Requirements to use Kubernetes as an ML platform
    Summary

    View Slide

  3. Why Kubernetes for ML?
    Use cases of ML platform to consider
    Training/Serving Skew
    Container
    Kubernetes
    Kubeflow
    Let's use Kubernetes / Kubeflow for ML platform!
    Managed ML platform

    View Slide

  4. Use cases of ML platform to consider
    Data analytics & model development
    Model training
    Inference

    View Slide

  5. Data analytics & model
    development
    Workload
    Ad-hoc analysis
    Service independent model
    development
    Google Cloud を活用した大手エンタメ企業様のデータ分析基盤構築事例
    https://youtu.be/BTYO0-avsXI
    Beyond Interactive: Notebook Innovation at Netflix
    https://netflixtechblog.com/notebook-innovation-591ee3221233

    View Slide

  6. Requirements
    Easy and safe access to the
    large dataset
    Visualization without code
    Not required
    Version control
    High availability
    Google Cloud を活用した大手エンタメ企業様のデータ分析基盤構築事例
    https://youtu.be/BTYO0-avsXI
    Beyond Interactive: Notebook Innovation at Netflix
    https://netflixtechblog.com/notebook-innovation-591ee3221233

    View Slide

  7. Model training
    Workload
    Batch processing (training
    pipeline)
    MLOps: Continuous delivery and automation pipelines in machine learning
    https://cloud.google.com/architecture/mlops-continuous-delivery-and-
    automation-pipelines-in-machine-learning

    View Slide

  8. Model training
    Requirements
    Massive amount of compute resources (CPUs, Mems, Accelerators)
    Massive amount of storage access (IOPS, Network bandwidth)
    Visualization
    Version control (code, data, model, and lineage between them)
    Not required
    High availability

    View Slide

  9. Inference
    Workload
    Web API
    Batch processing
    Accelerated Computing on AWS for NLP
    https://speakerdeck.com/icoxfog417/accelerated-computing-on-aws-for-
    nlp

    View Slide

  10. Inference
    Requirements (Web API)
    Low latency
    High availability
    Scalability
    Version control (code, data, model, and lineage between them)
    Not required (Web API)
    Massive amount of storage access for each request (hopefully)

    View Slide

  11. Training/Serving Skew
    Same code at the different
    three use cases
    Moreover, we have to
    consider dev/staging/prod.
    Or, Training/Serving Skew.
    Caused by the difference
    between environments.
    Why We Need DevOps for ML Data https://www.tecton.ai/blog/devops-ml-data/

    View Slide

  12. Container
    What we should manage between these
    environments are:
    Code
    Libraries
    Driver (CUDA, etc)
    OS
    Container (and Machine Image, in the
    past) is the defacto standard format for
    this business.

    View Slide

  13. Kubernetes
    "Kubernetes is an open-source system
    for automating deployment, scaling, and
    management of containerized
    applications."
    We can deploy Web service and batch
    execution on Kubernetes

    View Slide

  14. Kubeflow
    "The Kubeflow project is dedicated to
    making deployments of machine
    learning (ML) workflows on Kubernetes
    simple, portable and scalable."
    At the start point, it was an open-source
    implementation of the Google internal
    ML platform (TFX).
    Now, Kubeflow has no restrictions on
    libraries and cloud services.

    View Slide

  15. Let's use Kubernetes / Kubeflow for ML platform!
    Be careful to use Kubernetes or Kubeflow as an ML platform.
    Both Kubernetes and Kubeflow requires huge amount of effort.
    Several company tried to use Kubeflow and decided to use managed
    ML platform.

    View Slide

  16. Managed ML platform
    Vertex AI: Build, deploy, and scale machine
    learning (ML) models faster, with fully
    managed ML tools for any use case.
    SageMaker: Build, train, and deploy machine
    learning (ML) models for any use case with
    fully managed infrastructure, tools, and
    workflows.
    Origin of these services are their internal ML
    platform (Google & Amazon).

    View Slide

  17. TOC
    Why Kubernetes for ML?
    Decisions to use Kubernetes and Kubeflow or not <-
    Requirements to use Kubernetes as an ML platform
    Summary

    View Slide

  18. Decisions to use Kubernetes and Kubeflow or not
    Ride managed ML platform (Vertex AI)
    Tried Kubeflow but left
    Experts of ML on Kubernetes
    Container platform hopper (Challenger)
    Hybrid: Vertex & Kubernetes

    View Slide

  19. Ride managed ML
    platform (Vertex AI)
    CADDi
    Small team & fast deliver
    CADDi AI Labにおけるマネ
    ージドなMLOps
    OpenSearchで実現する画
    像検索とテスト追加で目指
    す安定運用
    CADDi AI LabにおけるマネージドなMLOps
    https://speakerdeck.com/vaaaaanquish/caddi-ai-
    labniokerumanezidonamlops

    View Slide

  20. Ride managed ML
    platform (Vertex AI)
    CAM (CyberAgent Group)
    Small team
    VertexAIで構築したMLOps基盤の取り組み
    https://speakerdeck.com/cyberagentdevelopers/vertexaidegou-zhu-
    sitamlopsji-pan-falsequ-rizu-mi

    View Slide

  21. Tried Kubeflow but left
    Repro
    Kubeflow is too painful to use
    Cannot update Kubeflow (delete & create)
    Fine grained log costs too high (with Prometheus)
    Too expensive to keep watching Kubeflow & Kubernetes
    Use Vertex AI to avoid managing Kubernetes & Kubeflow

    View Slide

  22. Tried Kubeflow but left
    mercari
    Building internal ML platform is too expensive
    Hard to maintain the code base after key engineer left the
    company
    Decide to use Kubeflow, then, use Vertex AI

    View Slide

  23. Tried Kubeflow but left
    ZOZO
    Hosting multi tenancy Kubeflow
    is too expensive
    Tons of YAMLs and
    customizations
    Hard to scale in the team
    Use Vertex AI to avoid hosting
    Kubeflow by themselves
    KubeflowによるMLOps基盤構築から得られた知見と課題
    https://techblog.zozo.com/entry/mlops-platform-kubeflow

    View Slide

  24. Experts of ML on Kubernetes
    LINE
    From historical and security
    reason, they have extreme
    on-prem clusters
    Excellence in managing
    bare metal servers and
    Kubernetes
    Lupus - A Monitoring System for Accelerating MLOps
    https://speakerdeck.com/line_devday2021/lupus-a-monitoring-system-for-
    accelerating-mlops

    View Slide

  25. Experts of ML on Kubernetes
    Yahoo! Japan
    From historical reason, they
    have extreme on-prem
    clusters
    Excellence in managing
    bare metal servers and
    Kubernetes
    Huge amount of
    investment in Kubernetes
    継続的なモデルモニタリングを実現するKubernetes Operator
    https://www.slideshare.net/techblogyahoo/kubernetes-operator-251612755

    View Slide

  26. Experts of ML on Kubernetes
    PFN
    Powered user of the machine
    learning (ML researchers)
    They need bare metal server to;
    1. use GPUs and CPUs as much
    as possible
    2. create their chip (accelerator)
    and test on their servers
    継続的なモデルモニタリングを実現するKubernetes Operator
    https://www.slideshare.net/techblogyahoo/kubernetes-operator-251612755

    View Slide

  27. Rakuten
    From historical reason, they
    have extreme on-prem
    clusters
    Excellence in managing
    bare metal servers and
    Kubernetes
    Kubernetesによる機械学習基盤、楽天での活用事例 覃子麟 (チンツーリン) /楽天
    株式会社 https://www.slideshare.net/rakutentech/kubernetes-144707493?
    from_action=save
    楽天の規模とクラウドプラットフォーム統括部の役割
    https://www.slideshare.net/rakutentech/ss-253221883

    View Slide

  28. Container platform
    hopper (Challenger)
    ABEJA
    Docker Swarm -> Rancher
    -> Kubernetes (EKS)
    Excellence in Kubernetes
    ABEJAの技術スタックを公開します (2019年11月版) https://tech-
    blog.abeja.asia/entry/tech-stack-201911
    ABEJA Insight for Retailの技術スタックを公開します (2021年10月版)
    https://tech-blog.abeja.asia/entry/retail-tech-stack-202110

    View Slide

  29. Hybrid: Vertex &
    Kubernetes
    DeNA
    Move from Serverless
    services to Vertex Pipelines
    (Training) & Kubernetes
    (Inference)
    DeNA の MLops エンジニアは何をしてるのか【DeNA TechCon 2021 Winter】
    https://speakerdeck.com/dena_tech/techcon2021-winter-5

    View Slide

  30. TOC
    Why Kubernetes for ML?
    Decisions to use Kubernetes and Kubeflow or not
    Requirements to use Kubernetes as an ML platform <-
    Summary

    View Slide

  31. Requirements to use Kubernetes as an ML platform
    Using Kubernetes as a platform everywhere in the organization
    Capability to customize Kubernetes & Kubeflow
    Strong heart to bear the pain caused by breaking change

    View Slide

  32. Discussion
    We (Asei & Yusuke Shibui)
    reached same conclusion

    View Slide

  33. Summary
    Container is a good practice for ML to avoid training/serving skew
    Be careful to use Kubernetes & Kubeflow as an ML platform
    The minimum requirement to use Kubernetes as an ML platform is the
    capability to customize Kubernetes to fit your use cases
    Consider hybrid approach: managed service for training & inference
    service on Kubernetes

    View Slide