Kubernetes-based GPU as a Service Platform at CyberAgent [INSIGHT 2021 Digital]

Slide 1

Slide 1 text

Kubernetes-based GPU-as-a-Service Platform - How CyberAgent Uses GPUs -

Slide 2

Slide 2 text

Who Are We? AI Platform Project Manager After joining CyberAgent in 2016, worked as a solutions architect for ad products and was in charge of private cloud and GKE-compatible container platform development. As AI infrastructure project manager, engaged in AI platform development. Lee Yeongjae Masaya Aoyama Daisuke Takahashi Software Engineer Hired by CyberAgent out of college in 2016. Built a private cloud and GKE-compatible container platform from scratch with OpenStack. Co-chair of Japan's biggest Cloud Native conference, oﬃcial CNCF community organizer, etc. Infrastructure Engineer Hired by CyberAgent out of college in 2019. After working on private cloud operation and container infrastructure development, now leads the ML/3DCG/adtech domains as a Solutions Architect. Handled physical layer design and operation for this project.

Slide 3

Slide 3 text

Agenda 1. CyberAgent 2. The Necessity of GPUs 3. Hardware 4. Kubernetes-based GPU-as-a-Service Platform 5. AI Platform 6. Wrap & Future Outlook

Slide 4

Slide 4 text

"To create the 21st Century's Leading Company" Media Offers many services tailored to the changing internet industry ➔ AbemaTV ➔ AWA ➔ WinTicket Advertising Applies operational and creative capabilities for maximum ad impact to offer comprehensive solutions, including AI-powered adtech Video Games Offers about 50 smartphone games, including 8 major titles ➔ Umamusume: Pretty Derby ➔ GRANBLUE FANTASY ➔ PRINCESS CONNECT! Re:Dive 3 Main Segments *"Abema": © Abema TV, Inc. **"Umamusume: Pretty Derby," "GRANBLUE FANTASY": © Cygames, Inc.

Slide 5

Slide 5 text

Agenda 1. CyberAgent 2. The Necessity of GPUs 3. Hardware 4. Kubernetes-based GPU-as-a-Service Platform 5. AI Platform 6. Wrap & Future Outlook

Slide 6

Slide 6 text

Why Do We Need AI Solutions? Reducing costs ● Gain business domain knowledge ● Automated ad generation Improving impact ● Identify unexpectedly eﬀective ads ● Improve ad impact analysis Reducing risk ● Avoid risk of controversies from ad placement

Slide 7

Slide 7 text

Why Do We Have to Use GPUs? Processing of massive, complex info ● The combination of ads and posted media is huge ● Enormous info referencing population statistics (area, age, etc.) Fast execution ● Trends change fast ● Get real-time info

Slide 8

Slide 8 text

AI Solutions Execution Environment Users: Researchers, Data Scientists, MLOps Engineers Jupyter Notebook : Execution environment for interactive GUI programs Google AI Platform : Manage ML workﬂows with Client Tools/GUI Train & evaluate model Deploy model Model with inferences Monitor inferences Manage model versions Implement code, prepare input data

Slide 9

Slide 9 text

System Architecture DGX A100 AFF A800 GPUaaS (Kubernetes) AI Platform DGX A100 + AFF A800 → Offer performance GPU & storage GPU-as-a-Service → Offers Jupyter Notebook Original AI platform → Offers infrastructure equivalent to Google AI platform

Slide 10

Slide 10 text

Why On-Premises? Features ● Flexible software stack assembly ● Easy connections with existing services Costs ● High cloud expenses ● Inexpensive over the long term

Slide 11

Slide 11 text

Cost of only cloud GPU (at one of our divisions) Why On-Premises?

Slide 12

Slide 12 text

Agenda 1. CyberAgent 2. The Necessity of GPUs 3. Hardware 4. Kubernetes-based GPU-as-a-Service Platform 5. AI Platform 6. Wrap & Future Outlook

Slide 13

Slide 13 text

GPUaaS Hardware

Slide 14

Slide 14 text

History of GPUaaS v1: GPU containers ● Implemented central management of GPU resources for researchers ○ Assigns an entire host to each researcher exclusively v2: GPU containers + Jupyter Notebook ● Managed Notebook environment for researchers ● Or primitive GPU containers, just as v1 v3: GPU containers + Jupyter Notebook + AI Platform ● Expanded availability to developers in addition to researchers ● Hosting AI platform (GCP-compatible) on top of GPUaaS

Slide 15

Slide 15 text

GPUaaS v1 Efficient use of GPU resources ● Centralized management of researchers’ workstations ○ Assigns 1 host (node) per user ● Located at server room in our office ○ GPU: 20x NVIDIA GeForce GTX 1080Ti 11 GB (220 GB) ○ CPU: 324 cores ○ Memory: 1.28 TB Environment for easier reproduction ● Simplified recreation of experiment environment with container virtualization ● Adopted Kubernetes, a proven option at CyberAgent ○ Offers direct access to Kubernetes API

Slide 16

Slide 16 text

GPUaaS v2 Successor to GPUaaS v1 ● Migrated GPU resources from v1 ● Changed assignment policy to shared use (multi-tenancy) NEW: Shared storage for datasets ● Could mount same data from containers ● Software-deﬁned storage on Kubernetes ○ NFS service by Rook (Ceph) ○ Usable Capacity: 48 TB with SATA SSDs NEW: Managed training environment ● Launch Jupyter Notebook w/o Kubernetes knowledge ● Could bring custom container images, optionally

Slide 17

Slide 17 text

Operational Issues for v2 Location / Site ● Office building is NOT a datacenter ○ Reached the limit of power and cooling ○ Regular blackouts for legal inspection ● Poor connection quality ○ Site-to-site VPN only ○ Non-redundant network Machine maintenance ● Lack of remote management feature ○ BMC not equipped (field ops required) ○ Restricted access to office for COVID-19 Performance ● Insufficient GPU memory ○ GeForce series not designed for ML ● Outdated hardware ○ Newer CPUs and GPUs come out ○ Increasing rate of hardware failures

Slide 18

Slide 18 text

Considerations for Improving GPUaaS Developers also want to use researcher-favored platform for their services To achieve the required quality, we had to address issues in v2 Location / Site ● Escape from oﬃce building ● Use existing datacenter in Tokyo for our private cloud Specs ● Brand-new servers for GPUaaS (IPMI required) ● Enterprise-grade GPUs with massive memory ○ Tesla V100, T4, etc.

Slide 19

Slide 19 text

NVIDIA A100 Ampere Architecture ● 20x faster than the predecessor (V100) New hardware features ● Multi-Instance GPU, Sparsity, etc. Faster GPU-to-GPU interconnection ● 3rd Gen NVLink2nd Gen NVSwitch ● Up to 16 GPUs ● Full mesh topology at 600 Gbps each

Slide 20

Slide 20 text

MIG: Multi-Instance GPU MIG mode in the NVIDIA Ampere architecture can run seven jobs in parallel on an A100 GPU (NVIDIA Blog) Multi-tenancy ● For DGX A100, its 8 GPUs can be sliced into 56 GPU instances ● Administrators can assign right-sized GPUs for each job Guaranteed QoS ● All GPU instances include isolated memory (capacity/bandwidth) and cores

Slide 21

Slide 21 text

GPUaaS v3 Renewed GPU hardware ● Adopted NVIDIA DGX A100 ○ GPU: 8x NVIDIA A100 40 GB (320 GB) ○ CPU: 128 cores ○ Memory: 1 TB ○ Testing combination of new HW features and Kubernetes (Thanks to the people at NVIDIA for helping out!) ■ Details on software later ● Installed in DGX-Ready datacenter ○ Veriﬁed location for power, cooling, installation

Slide 22

Slide 22 text

“Storage should be also re-designed.”

Slide 23

Slide 23 text

Storage Improvements Constraints for storage in v2 (hardware specs) ● Low capacity eﬃciency per rack space → Should introduce large capacity drives and/or chassis with many disk slots ● Insuﬃcient throughput for transferring datasets (compared with A100 GPU's performance) → Should improve disk and network performance Focus on using storage, not operating ● Rook (Ceph) was a suitable option to reuse existing resources ○ Not motivated to operate SDS since the purpose is providing storage space → Should consider appliances, not just SDS Additional features ● Want block access for internal metadata DBs of GPUaaS

Slide 24

Slide 24 text

GPUaaS v3 (contd.) Revamped storage for datasets ● Adopted NetApp AFF A800 ○ NVMe SSD 62 TB (All-lash) ■ Capable to scale-out/scale-up by adding: ● Disks (into empty bays) ● Disk shelves ● Controllers ○ Multi-protocol access ■ File (NFS, SMB), Block (iSCSI ,etc.), Object (S3) ○ Details on Kubernetes integration later ● Selected with NVIDIA DGX POD in mind ○ Scalable reference architecture for DGX system and storage ○ NetApp announced as ONTAP AI * Photo of the evaluation system. Some conﬁgurations diﬀer.

Slide 25

Slide 25 text

Reference: v3 Hardware Overview 100 GbE 25 GbE Compute NVIDIA DGX A100 Network Mellanox SN2010 Storage NetApp AFF A800

Slide 26

Slide 26 text

Agenda 1. CyberAgent 2. The Necessity of GPUs 3. Hardware 4. Kubernetes-based GPU-as-a-Service Platform 5. AI Platform 6. Wrap & Future Outlook

Slide 27

Slide 27 text

In-house infrastructure providing GPU interface to users and upper services (Minimum) requirements ● Provision of desired resources cut out from pooled computing resources ● Isolated GPUs to prevent interference during task execution ● High-performance storage allowing simultaneous connections Adopted infrastructure: containers + Kubernetes GPU-as-a-Service Overview Container icons: https://icons8.jp/icons/set/video-card Computing resource pool Storage pool

Slide 28

Slide 28 text

Why Did We Select Containers?

Slide 29

Slide 29 text

Containers vs. VM vs. Bare Metal ● Advantages of containers ○ Can easily create an image of the execution environment (cf. VM, metal) ○ Low overhead, loads in short time (cf. VM) ○ Can implement multi-tenancy environments for multiple users (cf. Metal) ● Disadvantages of containers ○ Low isolation compared to VM (cf. VM) ○ Short environment life cycle (cf. VM, Metal)

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Why Did We Choose Kubernetes?

Slide 35

Slide 35 text

Kubernetes In a production environment, we must manage host machines on multiple containers. That means we need container orchestration tools. Kubernetes is one container orchestration tool. Computing resource pool Storage pool ● Storage system ○ Block ○ Shared ﬁlesystem ○ Others ● Scheduling ● Rolling updates ● Health checks ● Auto/Scaling ● Malfunction self-healing ● Authentication & authorization ● Service discovery ● Load balancing ● Attach conﬁdential info ● Multi-tenancy ● Integration with storage

Slide 36

Slide 36 text

GPU and Kubernetes Kubernetes allows device plugins. Device plugins are plugable. Kubernetes can handle various devices. ● GPU (NVIDIA/AMD/Intel) ● TPU (Tensor Processing Unit) ● FPGA ● etc. We use Prometheus + DCGM Exporter for monitoring. Users have a dashboard to check GPU usage, etc. Note: You can optimize NUMA and GPU Topology by using InﬁniBand with Kubernetes. (Container runtime must be compatible) containers: - name: mljob image: my-mljob:v0.1 resources: limits: nvidia.com/gpu: 1 nvidia.com/mig-3g.20gb: 1

Slide 37

Slide 37 text

Kubernetes' Advantages Reconciliation loop to bring into a declared state Huge ecosystem Highly extendable platform 1 2 3

Slide 38

Slide 38 text

1. Reconciliation Loop to Bring into a Declared State Kubernetes uses programs called controllers to control systems. Multiple controllers bring things into a declared state. ○ Maintain specified number of replicas ○ Recovery from containers shut down by broken nodes ○ Auto reload when changing confidential info or config file ○ Auto management of load balancer members ○ etc. Actual ReplicaSet (replicas=3) Watch ReplicaSet Controller kind: ReplicaSet spec: replicas: 3 template: spec: containers: - image: nginx:1.16 Desired State Watch

Slide 39

Slide 39 text

2. Huge Ecosystem The CNCF and Kubernetes community promote open technology and develop and release various OSS integrated with Kubernetes. By using an OSS expansion controller employing a reconciliation loop, you can let Kubernetes handle most routine operations. ● Prometheus / Grafana: monitors GPU and wide-ranging middleware ● cert-manager: manages auto generation of certiﬁcates using ACME; auto integrates with load balancers ● external-dns: manages provided IP address and DNS records ● oauth2-proxy + nginx ingress: integrates OAuth2 with requests to WebUI ● Others: auto scaling, progressive delivery, data transmission between projects, etc.

Slide 40

Slide 40 text

3. Highly Extendable Platform Kubernetes is made to be easily extendable and enables expansion of features tailored to our company's specific domains. It also offers a framework for mounting controllers (even using general OSS). Examples: ● Auto data load + cash from S3/GCS (Custom Controller) ● Auto inject of authentication info for cloud (Mutating Webhook) ● Metadata storage using application instead of database (Secret/ConfigMap) ● Billing system based on info retrieved from Kubernetes Our implementation also keeps pace with standardization: container runtime (OCI/CRI), networks (CNI), storage (CSI), etc.

Slide 41

Slide 41 text

Kubernetes' Advantages Reconciliation loop to bring into a declared state Huge ecosystem Highly extendable platform ● Restoration capability ● Easy management ● Observability ● Frequent updates with robust automation ● etc. 1 2 3 ⇒ As we track the evolution of OSS, we are guiding business to success by continuing to improve upon platforms.

Slide 42

Slide 42 text

Key Points When Selecting Storage While Using Kubernetes

Slide 43

Slide 43 text

Linking Storage with Kubernetes ● We use the Container Storage Interface (CSI) to integrate storage with Kubernetes. ● CSI is an interface that connects container orchestrators with storage. It can handle multiple orchestrators and multiple storage products. CSI only defines open specifications. Features used differ according to the CSI driver. ○ https://github.com/container-storage-interface/spec ○ https://kubernetes-csi.github.io/docs/drivers.html Storage Container Orchestrator Container Storage Interface ● Volume production/deletion ● Volume attachment/detachment ● Volume expansion ● Volume cloning ● Snapshot & restore ● Topology designation ● RAW volume production

Slide 44

Slide 44 text

CSI Features Set The CSI Driver in Kubernetes is divided into multiple sub-features. Storage features are adequate but unusable without a compatible CSI driver. The lack of infrastructure features could prevent upper services from providing value. CSI driver considerations when selecting storage: 1. Tracking speed of Kubernetes upstream features (release frequency, upstream participation) 2. CSI driver quality (including bug ﬁx speed) Since before CSI, NetApp has made Trident. It has very good release frequency, features, and quality. Container Orchestrator Container Storage Interface Storage Trident

Slide 45

Slide 45 text

Trident as OSS Trident is released as OSS and CSI Driver implementation is not a black box. ● Our team wants to avoid always having to wait when a problem occurs Excellent development organization ● Three-month release cycle (since Dec 2016) ● Proactive contributions to the community mean we can expect fast upstream response Compatibility with both ReadWriteOnce and ReadWriteMany (AFF at CyberAgent) ● NAS: for ML workloads ● SAN: for system applications building GPUaaS (databases, Prometheus, etc.)

Slide 46

Slide 46 text

Implementing Multi-tenancy and Providing GPUs with Kubernetes

Slide 47

Slide 47 text

Two Ways to Use GPU-as-a-Service GPUaaS is operated with a dedicated web console or kubectl. Kubernetes API Server GPUaaS API Server $ kubectl ... ● launch notebooks ● Manage volumes ● Show billing info ● Manage projects ● etc. Web Console kubectl A B

Slide 48

Slide 48 text

Multi-tenancy in Kubernetes The Kubernetes Namespace feature has the same concept as tenants. We use ClusterRole and RoleBinding to manage permissions. "Add member to project from WebUI Console" = "Add RoleBinding" This allows seamless management. (It's like using a user database.) Other processes also possible on the WebUI Console, depending on role. UserA namespace UserB namespace TeamX namespace A B TeamY namespace admin clusterrole member clusterrole rolebinding

Slide 49

Slide 49 text

Multi-tenant Management with Hierarchical Namespaces Hierarchical Namespace Controller assigns shared settings to all namespaces Policy, conﬁg data, shared cloud account authentication info, metadata (CustomResource), etc. UserA namespace UserB namespace TeamX namespace GPUaaS namespace policy policy policy policy A B

Slide 50

Slide 50 text

Providing GPU Instances 1. Oﬀer Jupyter Notebooks with original web console For users unfamiliar with Kubernetes: data scientists, researchers, etc. 2. SSH-like environment using Kubernetes CLI $ kubectl exec -it PODNAME-0 -- bash PODNAME-0 #

Slide 51

Slide 51 text

Providing GPU Instances 3. Provide infrastructure to upper services For developing machine learning infrastructure based on Kubernetes Can't implement as desired if lower layer features are inadequate ⇒ Storage features and CSI driver functionality are important DGX A100 AFF A800 AI Platform GPUaaS (Kubernetes) AI Platform Consider multi-DC rollout using Kubernetes portability

Slide 52

Slide 52 text

Agenda 1. CyberAgent 2. The Necessity of GPUs 3. Hardware 4. Kubernetes-based GPU-as-a-Service Platform 5. AI Platform 6. Wrap & Future Outlook

Slide 53

Slide 53 text

Due to slower development, want to make complete with GCP/AWS Can't train easily like on an AI platform Don't use since migration from cloud is hard (Multiple responses) What do you not like about the currently available GPUaaS? Why We Need an Original AI Platform

Slide 54

Slide 54 text

Our Google AI infrastructure for managing ML workﬂows ● Overview of Training System ● Overview of Inference System AI Platform Overview

Slide 55

Slide 55 text

AI Platform Requirements Machine learning & inferences ● Can offload Google AI platform ML workflows ○ Object storage as hub ● Same operability as Google AI platform ○ kubectl plugin features ● Can reuse Google AI platform configuration files Training ● Capable of hyperparameter tuning ○ Use Kubeflow, a Katib component Inferences ● Can make inference endpoints ○ Use Kubeflow, a KFServing component ● Model version management ○ Use our original model metadata management infrastructure ● External access to inference endpoints ○ Authorization with Istio and External Authorization

Slide 56

Slide 56 text

Training & Inferences

Slide 57

Slide 57 text

Object Storage as a Hub

Slide 58

Slide 58 text

Consistent Operability kubectl ai-platform jobs submit training... kubectl ai-platform jobs submit prediction... kubectl ai-platform predict... kubectl ai-platform models... kubectl ai-platform versions... On-prem resource gcloud ai-platform jobs submit training... gcloud ai-platform jobs submit prediction... gcloud ai-platform predict... gcloud ai-platform models... gcloud ai-platform versions... Cloud resource Google AI Platform Deﬁnition

Slide 59

Slide 59 text

Training

Slide 60

Slide 60 text

Toolkit for Building ML Workﬂows on Kubernetes https://www.kubeflow.org/docs/started/kubeflow-overview/ ● Can't choose deployed environment ○ On-premises deployment ● Resource mgt. with Kubernetes ○ Expandable; big ecosystem ○ Object control with manifest ● Hyperparameter tuning ○ Katib ● Inference endpoint production ○ KFServing Kubeﬂow

Slide 61

Slide 61 text

Katib Components Experiment Suggestion Trial Trial Trial TFJob/PytorchJob/Job Pod Worker Container Experiment ● Individual execution of hyperparameter tuning ● Write all settings (algorithms, etc.) Suggestion ● Hyperparameter generation Trial ● Training executed with the hyperparameters Metrics Container ● Save model training prediction accuracy as metrics Metrics Collector ● Write metrics to database and complete tuning Metrics Collector Katib DB Metrics Container

Slide 62

Slide 62 text

AI Platform Training System Conﬁguration icons: https://icons8.jp/icons/set/video-card

Slide 63

Slide 63 text

Job Submit Flow for AI Platform Training 1. Execute submit command 2. Compress model code 3. Send compressed code to GCS/S3 4. Create Katib experiment 5. Trial makes pods (+PV attachment) 6. Download compressed code 7. Execute training 8. Send model to GCS/S3 1 2 3 4 5 6 7 8 icons: https://icons8.jp/icons/set/video-card

Slide 64

Slide 64 text

Inferences

Slide 65

Slide 65 text

https://www.kubeflow.org/docs/started/kubeflow-overview/ Kubeﬂow

Slide 66

Slide 66 text

KFServing Overview ● Provides inference features (deﬁned with InferenceServer) ● Makes model serving abstract (compatible with TensorFlow, PyTorch, XGBoost, etc.) ● Manages serving containers (Knative) ● Manages traﬃc routing (Istio) Special features ● Auto scaling ● Canary rollout, A/B tests ● Prediction, preprocessing, post-processing, etc. https://github.com/kubeflow/kfserving/blob/master/docs/diagrams/kfserving.png

Slide 67

Slide 67 text

InferenceServer ● KFServing custom resource deﬁnition ● Inference endpoint entities ○ Containers loaded by models ○ Provide inference endpoint (FQDN) ● Both preprocessing and post-processing OK ● PodSpec descriptions allow custom containers

Slide 68

Slide 68 text

Original Model Metadata Management Infrastructure Overview ● Manages metadata assignments to models ● Originally an infrastructure for follow-up tests and reproduction ● Development/operation in other departments Special features ● Savable model version histories ● Can tie metadata to models ○ Code, datasets, etc. ● Controllable model access rights ○ 3 patterns: Read, Write, RW ● Can designate model location ○ Compatible with GCS/S3

Slide 69

Slide 69 text

Istio and External Authorization External authorization ● Part of envoy feature ● Transfers requested authorization to authorized external server ● Can mount original authorized logic ● Authorization mounts predetermined REST API/gRPC Service ● Using Istio simpliﬁes settings apiVersion: security.istio.io/v1beta1 kind: AuthorizationPolicy metadata: name: ext-authz spec: selector: matchLabels: app: istio-ingressgateway action: CUSTOM provider: name: "ext-authz-service" rules: - to: - operation: paths: ["/auth/*"] extensionProviders: - name: "ext-authz-service" envoyExtAuthzGrpc: service: "ext-authz.default.svc.cluster.local" port: 9000 Register authorization Register authorized server

Slide 70

Slide 70 text

AI Platform Prediction System Conﬁguration icons: https://icons8.jp/icons/set/video-card icons: https://icons8.jp/icons/set/video-card

Slide 71

Slide 71 text

AI Platform Prediction Flow Register model 1. Execute "models create" command 2. Generate model creation request on API server 3. Register model on metadata server Execute prediction 8. Execute "predict" command 9. Authorization on external authorization server 10. Transfer request to InferenceServer 11. Execute inference 12. Return inference response Create model version 4. Execute "versions create" command 5. Generate InferenceServer creation request on API server 6. Apply InferenceServer manifest 7. Create InferenceServer (+ download model) 1 2 3 4 5 9 7 8 10 12 11 6 icons: https://icons8.jp/icons/set/video-card

Slide 72

Slide 72 text

Agenda 1. CyberAgent 2. The Necessity of GPUs 3. Hardware 4. Kubernetes-based GPU-as-a-Service Platform 5. AI Platform 6. Wrap & Future Outlook

Slide 73

Slide 73 text

Wrap Why are GPUs necessary? ★ To quickly process massive, complex data On-premises advantages Features ● Flexible software stack assembly ● Easy connections with existing services Costs ● High cloud expenses ● Inexpensive over the long term

Slide 74

Slide 74 text

Features always improving with OSS Maximize Kubernetes' advantages Wrap DGX A100 GPUaaS (Kubernetes) AI Platform AI Platform By making aggressive use of OSS and improving upon platforms, we can make application development more agile and have a big impact on business. AFF A800 Google AI platform compatibility Ultra-high-performance GPUs/storage

Slide 75

Slide 75 text

Future Outlook GPUaaS ● Automated MIG partitioning features AI platform ● Pipeline features ○ ML workﬂow automation ● Black box optimization features Physical hardware ● Augmented DGX A100/A100 GPU and AFF A800 ● Improve cost-eﬀectiveness by integrating with other GPUs (T4, etc.)

Slide 76

Slide 76 text

Thank you for your attention