When should we use Kubernetes for the Machine Learning platform?

When should we use Kubernetes for the Machine Learning platform?
Asei Sugiyama

TOC Why Kubernetes for ML? <- Decisions to use Kubernetes
and Kubeflow or not Requirements to use Kubernetes as an ML platform Summary

Why Kubernetes for ML? Use cases of ML platform to
consider Training/Serving Skew Container Kubernetes Kubeflow Let's use Kubernetes / Kubeflow for ML platform! Managed ML platform

Use cases of ML platform to consider Data analytics &
model development Model training Inference

Data analytics & model development Workload Ad-hoc analysis Service independent
model development Google Cloud を活用した大手エンタメ企業様のデータ分析基盤構築事例 https://youtu.be/BTYO0-avsXI Beyond Interactive: Notebook Innovation at Netflix https://netflixtechblog.com/notebook-innovation-591ee3221233

Requirements Easy and safe access to the large dataset Visualization
without code Not required Version control High availability Google Cloud を活用した大手エンタメ企業様のデータ分析基盤構築事例 https://youtu.be/BTYO0-avsXI Beyond Interactive: Notebook Innovation at Netflix https://netflixtechblog.com/notebook-innovation-591ee3221233

Model training Workload Batch processing (training pipeline) MLOps: Continuous delivery
and automation pipelines in machine learning https://cloud.google.com/architecture/mlops-continuous-delivery-and- automation-pipelines-in-machine-learning

Model training Requirements Massive amount of compute resources (CPUs, Mems,
Accelerators) Massive amount of storage access (IOPS, Network bandwidth) Visualization Version control (code, data, model, and lineage between them) Not required High availability

Inference Workload Web API Batch processing Accelerated Computing on AWS
for NLP https://speakerdeck.com/icoxfog417/accelerated-computing-on-aws-for- nlp

Inference Requirements (Web API) Low latency High availability Scalability Version
control (code, data, model, and lineage between them) Not required (Web API) Massive amount of storage access for each request (hopefully)

Training/Serving Skew Same code at the different three use cases
Moreover, we have to consider dev/staging/prod. Or, Training/Serving Skew. Caused by the difference between environments. Why We Need DevOps for ML Data https://www.tecton.ai/blog/devops-ml-data/

Container What we should manage between these environments are: Code
Libraries Driver (CUDA, etc) OS Container (and Machine Image, in the past) is the defacto standard format for this business.

Kubernetes "Kubernetes is an open-source system for automating deployment, scaling,
and management of containerized applications." We can deploy Web service and batch execution on Kubernetes

Kubeflow "The Kubeflow project is dedicated to making deployments of
machine learning (ML) workflows on Kubernetes simple, portable and scalable." At the start point, it was an open-source implementation of the Google internal ML platform (TFX). Now, Kubeflow has no restrictions on libraries and cloud services.

Let's use Kubernetes / Kubeflow for ML platform! Be careful
to use Kubernetes or Kubeflow as an ML platform. Both Kubernetes and Kubeflow requires huge amount of effort. Several company tried to use Kubeflow and decided to use managed ML platform.

Managed ML platform Vertex AI: Build, deploy, and scale machine
learning (ML) models faster, with fully managed ML tools for any use case. SageMaker: Build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. Origin of these services are their internal ML platform (Google & Amazon).

TOC Why Kubernetes for ML? Decisions to use Kubernetes and
Kubeflow or not <- Requirements to use Kubernetes as an ML platform Summary

Decisions to use Kubernetes and Kubeflow or not Ride managed
ML platform (Vertex AI) Tried Kubeflow but left Experts of ML on Kubernetes Container platform hopper (Challenger) Hybrid: Vertex & Kubernetes

Ride managed ML platform (Vertex AI) CADDi Small team &
fast deliver CADDi AI LabにおけるマネージドなMLOps OpenSearchで実現する画像検索とテスト追加で目指す安定運用 CADDi AI LabにおけるマネージドなMLOps https://speakerdeck.com/vaaaaanquish/caddi-ai- labniokerumanezidonamlops

Ride managed ML platform (Vertex AI) CAM (CyberAgent Group) Small
team VertexAIで構築したMLOps基盤の取り組み https://speakerdeck.com/cyberagentdevelopers/vertexaidegou-zhu- sitamlopsji-pan-falsequ-rizu-mi

Tried Kubeflow but left Repro Kubeflow is too painful to
use Cannot update Kubeflow (delete & create) Fine grained log costs too high (with Prometheus) Too expensive to keep watching Kubeflow & Kubernetes Use Vertex AI to avoid managing Kubernetes & Kubeflow

Tried Kubeflow but left mercari Building internal ML platform is
too expensive Hard to maintain the code base after key engineer left the company Decide to use Kubeflow, then, use Vertex AI

Tried Kubeflow but left ZOZO Hosting multi tenancy Kubeflow is
too expensive Tons of YAMLs and customizations Hard to scale in the team Use Vertex AI to avoid hosting Kubeflow by themselves KubeflowによるMLOps基盤構築から得られた知見と課題 https://techblog.zozo.com/entry/mlops-platform-kubeflow

Experts of ML on Kubernetes LINE From historical and security
reason, they have extreme on-prem clusters Excellence in managing bare metal servers and Kubernetes Lupus - A Monitoring System for Accelerating MLOps https://speakerdeck.com/line_devday2021/lupus-a-monitoring-system-for- accelerating-mlops

Experts of ML on Kubernetes Yahoo! Japan From historical reason,
they have extreme on-prem clusters Excellence in managing bare metal servers and Kubernetes Huge amount of investment in Kubernetes 継続的なモデルモニタリングを実現するKubernetes Operator https://www.slideshare.net/techblogyahoo/kubernetes-operator-251612755

Experts of ML on Kubernetes PFN Powered user of the
machine learning (ML researchers) They need bare metal server to; 1. use GPUs and CPUs as much as possible 2. create their chip (accelerator) and test on their servers 継続的なモデルモニタリングを実現するKubernetes Operator https://www.slideshare.net/techblogyahoo/kubernetes-operator-251612755

Rakuten From historical reason, they have extreme on-prem clusters Excellence
in managing bare metal servers and Kubernetes Kubernetesによる機械学習基盤、楽天での活用事例覃子麟 (チンツーリン) /楽天株式会社 https://www.slideshare.net/rakutentech/kubernetes-144707493? from_action=save 楽天の規模とクラウドプラットフォーム統括部の役割 https://www.slideshare.net/rakutentech/ss-253221883

Container platform hopper (Challenger) ABEJA Docker Swarm -> Rancher ->
Kubernetes (EKS) Excellence in Kubernetes ABEJAの技術スタックを公開します (2019年11月版) https://techblog.abeja.asia/entry/tech-stack-201911 ABEJA Insight for Retailの技術スタックを公開します (2021年10月版） https://tech-blog.abeja.asia/entry/retail-tech-stack-202110

Hybrid: Vertex & Kubernetes DeNA Move from Serverless services to
Vertex Pipelines (Training) & Kubernetes (Inference) DeNA の MLops エンジニアは何をしてるのか【DeNA TechCon 2021 Winter】 https://speakerdeck.com/dena_tech/techcon2021-winter-5

TOC Why Kubernetes for ML? Decisions to use Kubernetes and
Kubeflow or not Requirements to use Kubernetes as an ML platform <- Summary

Requirements to use Kubernetes as an ML platform Using Kubernetes
as a platform everywhere in the organization Capability to customize Kubernetes & Kubeflow Strong heart to bear the pain caused by breaking change

Discussion We (Asei & Yusuke Shibui) reached same conclusion

Summary Container is a good practice for ML to avoid
training/serving skew Be careful to use Kubernetes & Kubeflow as an ML platform The minimum requirement to use Kubernetes as an ML platform is the capability to customize Kubernetes to fit your use cases Consider hybrid approach: managed service for training & inference service on Kubernetes

When should we use Kubernetes for the Machine L...

When should we use Kubernetes for the Machine Learning platform?

Asei Sugiyama

More Decks by Asei Sugiyama

Other Decks in Technology

Featured

Transcript

When should we use Kubernetes for the Machine Learning platform?

TOC Why Kubernetes for ML? <- Decisions to use Kubernetes

Why Kubernetes for ML? Use cases of ML platform to

Use cases of ML platform to consider Data analytics &

Data analytics & model development Workload Ad-hoc analysis Service independent

Requirements Easy and safe access to the large dataset Visualization

Model training Workload Batch processing (training pipeline) MLOps: Continuous delivery

Model training Requirements Massive amount of compute resources (CPUs, Mems,

Inference Workload Web API Batch processing Accelerated Computing on AWS

Inference Requirements (Web API) Low latency High availability Scalability Version

Training/Serving Skew Same code at the different three use cases

Container What we should manage between these environments are: Code

Kubernetes "Kubernetes is an open-source system for automating deployment, scaling,

Kubeflow "The Kubeflow project is dedicated to making deployments of

Let's use Kubernetes / Kubeflow for ML platform! Be careful

Managed ML platform Vertex AI: Build, deploy, and scale machine

TOC Why Kubernetes for ML? Decisions to use Kubernetes and

Decisions to use Kubernetes and Kubeflow or not Ride managed

Ride managed ML platform (Vertex AI) CADDi Small team &

Ride managed ML platform (Vertex AI) CAM (CyberAgent Group) Small

Tried Kubeflow but left Repro Kubeflow is too painful to

Tried Kubeflow but left mercari Building internal ML platform is

Tried Kubeflow but left ZOZO Hosting multi tenancy Kubeflow is

Experts of ML on Kubernetes LINE From historical and security

Experts of ML on Kubernetes Yahoo! Japan From historical reason,

Experts of ML on Kubernetes PFN Powered user of the

Rakuten From historical reason, they have extreme on-prem clusters Excellence

Container platform hopper (Challenger) ABEJA Docker Swarm -> Rancher ->

Hybrid: Vertex & Kubernetes DeNA Move from Serverless services to

TOC Why Kubernetes for ML? Decisions to use Kubernetes and

Requirements to use Kubernetes as an ML platform Using Kubernetes

Discussion We (Asei & Yusuke Shibui) reached same conclusion

Summary Container is a good practice for ML to avoid