Slide 1

Slide 1 text

When should we use Kubernetes for the Machine Learning platform? Asei Sugiyama

Slide 2

Slide 2 text

TOC Why Kubernetes for ML? <- Decisions to use Kubernetes and Kubeflow or not Requirements to use Kubernetes as an ML platform Summary

Slide 3

Slide 3 text

Why Kubernetes for ML? Use cases of ML platform to consider Training/Serving Skew Container Kubernetes Kubeflow Let's use Kubernetes / Kubeflow for ML platform! Managed ML platform

Slide 4

Slide 4 text

Use cases of ML platform to consider Data analytics & model development Model training Inference

Slide 5

Slide 5 text

Data analytics & model development Workload Ad-hoc analysis Service independent model development Google Cloud を活用した大手エンタメ企業様のデータ分析基盤構築事例 https://youtu.be/BTYO0-avsXI Beyond Interactive: Notebook Innovation at Netflix https://netflixtechblog.com/notebook-innovation-591ee3221233

Slide 6

Slide 6 text

Requirements Easy and safe access to the large dataset Visualization without code Not required Version control High availability Google Cloud を活用した大手エンタメ企業様のデータ分析基盤構築事例 https://youtu.be/BTYO0-avsXI Beyond Interactive: Notebook Innovation at Netflix https://netflixtechblog.com/notebook-innovation-591ee3221233

Slide 7

Slide 7 text

Model training Workload Batch processing (training pipeline) MLOps: Continuous delivery and automation pipelines in machine learning https://cloud.google.com/architecture/mlops-continuous-delivery-and- automation-pipelines-in-machine-learning

Slide 8

Slide 8 text

Model training Requirements Massive amount of compute resources (CPUs, Mems, Accelerators) Massive amount of storage access (IOPS, Network bandwidth) Visualization Version control (code, data, model, and lineage between them) Not required High availability

Slide 9

Slide 9 text

Inference Workload Web API Batch processing Accelerated Computing on AWS for NLP https://speakerdeck.com/icoxfog417/accelerated-computing-on-aws-for- nlp

Slide 10

Slide 10 text

Inference Requirements (Web API) Low latency High availability Scalability Version control (code, data, model, and lineage between them) Not required (Web API) Massive amount of storage access for each request (hopefully)

Slide 11

Slide 11 text

Training/Serving Skew Same code at the different three use cases Moreover, we have to consider dev/staging/prod. Or, Training/Serving Skew. Caused by the difference between environments. Why We Need DevOps for ML Data https://www.tecton.ai/blog/devops-ml-data/

Slide 12

Slide 12 text

Container What we should manage between these environments are: Code Libraries Driver (CUDA, etc) OS Container (and Machine Image, in the past) is the defacto standard format for this business.

Slide 13

Slide 13 text

Kubernetes "Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications." We can deploy Web service and batch execution on Kubernetes

Slide 14

Slide 14 text

Kubeflow "The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable." At the start point, it was an open-source implementation of the Google internal ML platform (TFX). Now, Kubeflow has no restrictions on libraries and cloud services.

Slide 15

Slide 15 text

Let's use Kubernetes / Kubeflow for ML platform! Be careful to use Kubernetes or Kubeflow as an ML platform. Both Kubernetes and Kubeflow requires huge amount of effort. Several company tried to use Kubeflow and decided to use managed ML platform.

Slide 16

Slide 16 text

Managed ML platform Vertex AI: Build, deploy, and scale machine learning (ML) models faster, with fully managed ML tools for any use case. SageMaker: Build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. Origin of these services are their internal ML platform (Google & Amazon).

Slide 17

Slide 17 text

TOC Why Kubernetes for ML? Decisions to use Kubernetes and Kubeflow or not <- Requirements to use Kubernetes as an ML platform Summary

Slide 18

Slide 18 text

Decisions to use Kubernetes and Kubeflow or not Ride managed ML platform (Vertex AI) Tried Kubeflow but left Experts of ML on Kubernetes Container platform hopper (Challenger) Hybrid: Vertex & Kubernetes

Slide 19

Slide 19 text

Ride managed ML platform (Vertex AI) CADDi Small team & fast deliver CADDi AI Labにおけるマネ ージドなMLOps OpenSearchで実現する画 像検索とテスト追加で目指 す安定運用 CADDi AI LabにおけるマネージドなMLOps https://speakerdeck.com/vaaaaanquish/caddi-ai- labniokerumanezidonamlops

Slide 20

Slide 20 text

Ride managed ML platform (Vertex AI) CAM (CyberAgent Group) Small team VertexAIで構築したMLOps基盤の取り組み https://speakerdeck.com/cyberagentdevelopers/vertexaidegou-zhu- sitamlopsji-pan-falsequ-rizu-mi

Slide 21

Slide 21 text

Tried Kubeflow but left Repro Kubeflow is too painful to use Cannot update Kubeflow (delete & create) Fine grained log costs too high (with Prometheus) Too expensive to keep watching Kubeflow & Kubernetes Use Vertex AI to avoid managing Kubernetes & Kubeflow

Slide 22

Slide 22 text

Tried Kubeflow but left mercari Building internal ML platform is too expensive Hard to maintain the code base after key engineer left the company Decide to use Kubeflow, then, use Vertex AI

Slide 23

Slide 23 text

Tried Kubeflow but left ZOZO Hosting multi tenancy Kubeflow is too expensive Tons of YAMLs and customizations Hard to scale in the team Use Vertex AI to avoid hosting Kubeflow by themselves KubeflowによるMLOps基盤構築から得られた知見と課題 https://techblog.zozo.com/entry/mlops-platform-kubeflow

Slide 24

Slide 24 text

Experts of ML on Kubernetes LINE From historical and security reason, they have extreme on-prem clusters Excellence in managing bare metal servers and Kubernetes Lupus - A Monitoring System for Accelerating MLOps https://speakerdeck.com/line_devday2021/lupus-a-monitoring-system-for- accelerating-mlops

Slide 25

Slide 25 text

Experts of ML on Kubernetes Yahoo! Japan From historical reason, they have extreme on-prem clusters Excellence in managing bare metal servers and Kubernetes Huge amount of investment in Kubernetes 継続的なモデルモニタリングを実現するKubernetes Operator https://www.slideshare.net/techblogyahoo/kubernetes-operator-251612755

Slide 26

Slide 26 text

Experts of ML on Kubernetes PFN Powered user of the machine learning (ML researchers) They need bare metal server to; 1. use GPUs and CPUs as much as possible 2. create their chip (accelerator) and test on their servers 継続的なモデルモニタリングを実現するKubernetes Operator https://www.slideshare.net/techblogyahoo/kubernetes-operator-251612755

Slide 27

Slide 27 text

Rakuten From historical reason, they have extreme on-prem clusters Excellence in managing bare metal servers and Kubernetes Kubernetesによる機械学習基盤、楽天での活用事例 覃子麟 (チンツーリン) /楽天 株式会社 https://www.slideshare.net/rakutentech/kubernetes-144707493? from_action=save 楽天の規模とクラウドプラットフォーム統括部の役割 https://www.slideshare.net/rakutentech/ss-253221883

Slide 28

Slide 28 text

Container platform hopper (Challenger) ABEJA Docker Swarm -> Rancher -> Kubernetes (EKS) Excellence in Kubernetes ABEJAの技術スタックを公開します (2019年11月版) https://tech- blog.abeja.asia/entry/tech-stack-201911 ABEJA Insight for Retailの技術スタックを公開します (2021年10月版) https://tech-blog.abeja.asia/entry/retail-tech-stack-202110

Slide 29

Slide 29 text

Hybrid: Vertex & Kubernetes DeNA Move from Serverless services to Vertex Pipelines (Training) & Kubernetes (Inference) DeNA の MLops エンジニアは何をしてるのか【DeNA TechCon 2021 Winter】 https://speakerdeck.com/dena_tech/techcon2021-winter-5

Slide 30

Slide 30 text

TOC Why Kubernetes for ML? Decisions to use Kubernetes and Kubeflow or not Requirements to use Kubernetes as an ML platform <- Summary

Slide 31

Slide 31 text

Requirements to use Kubernetes as an ML platform Using Kubernetes as a platform everywhere in the organization Capability to customize Kubernetes & Kubeflow Strong heart to bear the pain caused by breaking change

Slide 32

Slide 32 text

Discussion We (Asei & Yusuke Shibui) reached same conclusion

Slide 33

Slide 33 text

Summary Container is a good practice for ML to avoid training/serving skew Be careful to use Kubernetes & Kubeflow as an ML platform The minimum requirement to use Kubernetes as an ML platform is the capability to customize Kubernetes to fit your use cases Consider hybrid approach: managed service for training & inference service on Kubernetes