Why Kubernetes for ML? Use cases of ML platform to consider Training/Serving Skew Container Kubernetes Kubeflow Let's use Kubernetes / Kubeflow for ML platform! Managed ML platform
Data analytics & model development Workload Ad-hoc analysis Service independent model development Google Cloud を活用した大手エンタメ企業様のデータ分析基盤構築事例 https://youtu.be/BTYO0-avsXI Beyond Interactive: Notebook Innovation at Netflix https://netflixtechblog.com/notebook-innovation-591ee3221233
Requirements Easy and safe access to the large dataset Visualization without code Not required Version control High availability Google Cloud を活用した大手エンタメ企業様のデータ分析基盤構築事例 https://youtu.be/BTYO0-avsXI Beyond Interactive: Notebook Innovation at Netflix https://netflixtechblog.com/notebook-innovation-591ee3221233
Model training Workload Batch processing (training pipeline) MLOps: Continuous delivery and automation pipelines in machine learning https://cloud.google.com/architecture/mlops-continuous-delivery-and- automation-pipelines-in-machine-learning
Model training Requirements Massive amount of compute resources (CPUs, Mems, Accelerators) Massive amount of storage access (IOPS, Network bandwidth) Visualization Version control (code, data, model, and lineage between them) Not required High availability
Inference Workload Web API Batch processing Accelerated Computing on AWS for NLP https://speakerdeck.com/icoxfog417/accelerated-computing-on-aws-for- nlp
Inference Requirements (Web API) Low latency High availability Scalability Version control (code, data, model, and lineage between them) Not required (Web API) Massive amount of storage access for each request (hopefully)
Training/Serving Skew Same code at the different three use cases Moreover, we have to consider dev/staging/prod. Or, Training/Serving Skew. Caused by the difference between environments. Why We Need DevOps for ML Data https://www.tecton.ai/blog/devops-ml-data/
Container What we should manage between these environments are: Code Libraries Driver (CUDA, etc) OS Container (and Machine Image, in the past) is the defacto standard format for this business.
Kubernetes "Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications." We can deploy Web service and batch execution on Kubernetes
Kubeflow "The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable." At the start point, it was an open-source implementation of the Google internal ML platform (TFX). Now, Kubeflow has no restrictions on libraries and cloud services.
Let's use Kubernetes / Kubeflow for ML platform! Be careful to use Kubernetes or Kubeflow as an ML platform. Both Kubernetes and Kubeflow requires huge amount of effort. Several company tried to use Kubeflow and decided to use managed ML platform.
Managed ML platform Vertex AI: Build, deploy, and scale machine learning (ML) models faster, with fully managed ML tools for any use case. SageMaker: Build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. Origin of these services are their internal ML platform (Google & Amazon).
Decisions to use Kubernetes and Kubeflow or not Ride managed ML platform (Vertex AI) Tried Kubeflow but left Experts of ML on Kubernetes Container platform hopper (Challenger) Hybrid: Vertex & Kubernetes
Ride managed ML platform (Vertex AI) CADDi Small team & fast deliver CADDi AI Labにおけるマネ ージドなMLOps OpenSearchで実現する画 像検索とテスト追加で目指 す安定運用 CADDi AI LabにおけるマネージドなMLOps https://speakerdeck.com/vaaaaanquish/caddi-ai- labniokerumanezidonamlops
Ride managed ML platform (Vertex AI) CAM (CyberAgent Group) Small team VertexAIで構築したMLOps基盤の取り組み https://speakerdeck.com/cyberagentdevelopers/vertexaidegou-zhu- sitamlopsji-pan-falsequ-rizu-mi
Tried Kubeflow but left Repro Kubeflow is too painful to use Cannot update Kubeflow (delete & create) Fine grained log costs too high (with Prometheus) Too expensive to keep watching Kubeflow & Kubernetes Use Vertex AI to avoid managing Kubernetes & Kubeflow
Tried Kubeflow but left mercari Building internal ML platform is too expensive Hard to maintain the code base after key engineer left the company Decide to use Kubeflow, then, use Vertex AI
Tried Kubeflow but left ZOZO Hosting multi tenancy Kubeflow is too expensive Tons of YAMLs and customizations Hard to scale in the team Use Vertex AI to avoid hosting Kubeflow by themselves KubeflowによるMLOps基盤構築から得られた知見と課題 https://techblog.zozo.com/entry/mlops-platform-kubeflow
Experts of ML on Kubernetes LINE From historical and security reason, they have extreme on-prem clusters Excellence in managing bare metal servers and Kubernetes Lupus - A Monitoring System for Accelerating MLOps https://speakerdeck.com/line_devday2021/lupus-a-monitoring-system-for- accelerating-mlops
Experts of ML on Kubernetes Yahoo! Japan From historical reason, they have extreme on-prem clusters Excellence in managing bare metal servers and Kubernetes Huge amount of investment in Kubernetes 継続的なモデルモニタリングを実現するKubernetes Operator https://www.slideshare.net/techblogyahoo/kubernetes-operator-251612755
Experts of ML on Kubernetes PFN Powered user of the machine learning (ML researchers) They need bare metal server to; 1. use GPUs and CPUs as much as possible 2. create their chip (accelerator) and test on their servers 継続的なモデルモニタリングを実現するKubernetes Operator https://www.slideshare.net/techblogyahoo/kubernetes-operator-251612755
Rakuten From historical reason, they have extreme on-prem clusters Excellence in managing bare metal servers and Kubernetes Kubernetesによる機械学習基盤、楽天での活用事例 覃子麟 (チンツーリン) /楽天 株式会社 https://www.slideshare.net/rakutentech/kubernetes-144707493? from_action=save 楽天の規模とクラウドプラットフォーム統括部の役割 https://www.slideshare.net/rakutentech/ss-253221883
Requirements to use Kubernetes as an ML platform Using Kubernetes as a platform everywhere in the organization Capability to customize Kubernetes & Kubeflow Strong heart to bear the pain caused by breaking change
Summary Container is a good practice for ML to avoid training/serving skew Be careful to use Kubernetes & Kubeflow as an ML platform The minimum requirement to use Kubernetes as an ML platform is the capability to customize Kubernetes to fit your use cases Consider hybrid approach: managed service for training & inference service on Kubernetes