Kubernetes-based GPU as a Service Platform at CyberAgent [INSIGHT 2021 Digital]

Kubernetes-based GPU-as-a-Service Platform - How CyberAgent Uses GPUs -

Who Are We? AI Platform Project Manager After joining CyberAgent
in 2016, worked as a solutions architect for ad products and was in charge of private cloud and GKE-compatible container platform development. As AI infrastructure project manager, engaged in AI platform development. Lee Yeongjae Masaya Aoyama Daisuke Takahashi Software Engineer Hired by CyberAgent out of college in 2016. Built a private cloud and GKE-compatible container platform from scratch with OpenStack. Co-chair of Japan's biggest Cloud Native conference, oﬃcial CNCF community organizer, etc. Infrastructure Engineer Hired by CyberAgent out of college in 2019. After working on private cloud operation and container infrastructure development, now leads the ML/3DCG/adtech domains as a Solutions Architect. Handled physical layer design and operation for this project.

Agenda 1. CyberAgent 2. The Necessity of GPUs 3. Hardware
4. Kubernetes-based GPU-as-a-Service Platform 5. AI Platform 6. Wrap & Future Outlook

"To create the 21st Century's Leading Company" Media Offers many
services tailored to the changing internet industry ➔ AbemaTV ➔ AWA ➔ WinTicket Advertising Applies operational and creative capabilities for maximum ad impact to offer comprehensive solutions, including AI-powered adtech Video Games Offers about 50 smartphone games, including 8 major titles ➔ Umamusume: Pretty Derby ➔ GRANBLUE FANTASY ➔ PRINCESS CONNECT! Re:Dive 3 Main Segments *"Abema": © Abema TV, Inc. **"Umamusume: Pretty Derby," "GRANBLUE FANTASY": © Cygames, Inc.

Why Do We Need AI Solutions? Reducing costs • Gain
business domain knowledge • Automated ad generation Improving impact • Identify unexpectedly eﬀective ads • Improve ad impact analysis Reducing risk • Avoid risk of controversies from ad placement

Why Do We Have to Use GPUs? Processing of massive,
complex info • The combination of ads and posted media is huge • Enormous info referencing population statistics (area, age, etc.) Fast execution • Trends change fast • Get real-time info

AI Solutions Execution Environment Users: Researchers, Data Scientists, MLOps Engineers
Jupyter Notebook : Execution environment for interactive GUI programs Google AI Platform : Manage ML workﬂows with Client Tools/GUI Train & evaluate model Deploy model Model with inferences Monitor inferences Manage model versions Implement code, prepare input data

System Architecture DGX A100 AFF A800 GPUaaS (Kubernetes) AI Platform
DGX A100 + AFF A800 → Offer performance GPU & storage GPU-as-a-Service → Offers Jupyter Notebook Original AI platform → Offers infrastructure equivalent to Google AI platform

Why On-Premises? Features • Flexible software stack assembly • Easy
connections with existing services Costs • High cloud expenses • Inexpensive over the long term

Cost of only cloud GPU (at one of our divisions)
Why On-Premises?

GPUaaS Hardware

History of GPUaaS v1: GPU containers • Implemented central management
of GPU resources for researchers ◦ Assigns an entire host to each researcher exclusively v2: GPU containers + Jupyter Notebook • Managed Notebook environment for researchers • Or primitive GPU containers, just as v1 v3: GPU containers + Jupyter Notebook + AI Platform • Expanded availability to developers in addition to researchers • Hosting AI platform (GCP-compatible) on top of GPUaaS

GPUaaS v1 Efficient use of GPU resources • Centralized management
of researchers’ workstations ◦ Assigns 1 host (node) per user • Located at server room in our office ◦ GPU: 20x NVIDIA GeForce GTX 1080Ti 11 GB (220 GB) ◦ CPU: 324 cores ◦ Memory: 1.28 TB Environment for easier reproduction • Simplified recreation of experiment environment with container virtualization • Adopted Kubernetes, a proven option at CyberAgent ◦ Offers direct access to Kubernetes API

GPUaaS v2 Successor to GPUaaS v1 • Migrated GPU resources
from v1 • Changed assignment policy to shared use (multi-tenancy) NEW: Shared storage for datasets • Could mount same data from containers • Software-deﬁned storage on Kubernetes ◦ NFS service by Rook (Ceph) ◦ Usable Capacity: 48 TB with SATA SSDs NEW: Managed training environment • Launch Jupyter Notebook w/o Kubernetes knowledge • Could bring custom container images, optionally

Operational Issues for v2 Location / Site • Office building
is NOT a datacenter ◦ Reached the limit of power and cooling ◦ Regular blackouts for legal inspection • Poor connection quality ◦ Site-to-site VPN only ◦ Non-redundant network Machine maintenance • Lack of remote management feature ◦ BMC not equipped (field ops required) ◦ Restricted access to office for COVID-19 Performance • Insufficient GPU memory ◦ GeForce series not designed for ML • Outdated hardware ◦ Newer CPUs and GPUs come out ◦ Increasing rate of hardware failures

Considerations for Improving GPUaaS Developers also want to use researcher-favored
platform for their services To achieve the required quality, we had to address issues in v2 Location / Site • Escape from oﬃce building • Use existing datacenter in Tokyo for our private cloud Specs • Brand-new servers for GPUaaS (IPMI required) • Enterprise-grade GPUs with massive memory ◦ Tesla V100, T4, etc.

NVIDIA A100 Ampere Architecture • 20x faster than the predecessor
(V100) New hardware features • Multi-Instance GPU, Sparsity, etc. Faster GPU-to-GPU interconnection • 3rd Gen NVLink2nd Gen NVSwitch • Up to 16 GPUs • Full mesh topology at 600 Gbps each

MIG: Multi-Instance GPU MIG mode in the NVIDIA Ampere architecture
can run seven jobs in parallel on an A100 GPU (NVIDIA Blog) Multi-tenancy • For DGX A100, its 8 GPUs can be sliced into 56 GPU instances • Administrators can assign right-sized GPUs for each job Guaranteed QoS • All GPU instances include isolated memory (capacity/bandwidth) and cores

GPUaaS v3 Renewed GPU hardware • Adopted NVIDIA DGX A100
◦ GPU: 8x NVIDIA A100 40 GB (320 GB) ◦ CPU: 128 cores ◦ Memory: 1 TB ◦ Testing combination of new HW features and Kubernetes (Thanks to the people at NVIDIA for helping out!) ▪ Details on software later • Installed in DGX-Ready datacenter ◦ Veriﬁed location for power, cooling, installation

“Storage should be also re-designed.”

Storage Improvements Constraints for storage in v2 (hardware specs) •
Low capacity eﬃciency per rack space → Should introduce large capacity drives and/or chassis with many disk slots • Insuﬃcient throughput for transferring datasets (compared with A100 GPU's performance) → Should improve disk and network performance Focus on using storage, not operating • Rook (Ceph) was a suitable option to reuse existing resources ◦ Not motivated to operate SDS since the purpose is providing storage space → Should consider appliances, not just SDS Additional features • Want block access for internal metadata DBs of GPUaaS

GPUaaS v3 (contd.) Revamped storage for datasets • Adopted NetApp
AFF A800 ◦ NVMe SSD 62 TB (All-lash) ▪ Capable to scale-out/scale-up by adding: • Disks (into empty bays) • Disk shelves • Controllers ◦ Multi-protocol access ▪ File (NFS, SMB), Block (iSCSI ,etc.), Object (S3) ◦ Details on Kubernetes integration later • Selected with NVIDIA DGX POD in mind ◦ Scalable reference architecture for DGX system and storage ◦ NetApp announced as ONTAP AI * Photo of the evaluation system. Some conﬁgurations diﬀer.

Reference: v3 Hardware Overview 100 GbE 25 GbE Compute NVIDIA
DGX A100 Network Mellanox SN2010 Storage NetApp AFF A800

In-house infrastructure providing GPU interface to users and upper services
(Minimum) requirements • Provision of desired resources cut out from pooled computing resources • Isolated GPUs to prevent interference during task execution • High-performance storage allowing simultaneous connections Adopted infrastructure: containers + Kubernetes GPU-as-a-Service Overview Container icons: https://icons8.jp/icons/set/video-card Computing resource pool Storage pool

Why Did We Select Containers?

Containers vs. VM vs. Bare Metal • Advantages of containers
◦ Can easily create an image of the execution environment (cf. VM, metal) ◦ Low overhead, loads in short time (cf. VM) ◦ Can implement multi-tenancy environments for multiple users (cf. Metal) • Disadvantages of containers ◦ Low isolation compared to VM (cf. VM) ◦ Short environment life cycle (cf. VM, Metal)

Why Did We Choose Kubernetes?

Kubernetes In a production environment, we must manage host machines
on multiple containers. That means we need container orchestration tools. Kubernetes is one container orchestration tool. Computing resource pool Storage pool • Storage system ◦ Block ◦ Shared ﬁlesystem ◦ Others • Scheduling • Rolling updates • Health checks • Auto/Scaling • Malfunction self-healing • Authentication & authorization • Service discovery • Load balancing • Attach conﬁdential info • Multi-tenancy • Integration with storage

GPU and Kubernetes Kubernetes allows device plugins. Device plugins are
plugable. Kubernetes can handle various devices. • GPU (NVIDIA/AMD/Intel) • TPU (Tensor Processing Unit) • FPGA • etc. We use Prometheus + DCGM Exporter for monitoring. Users have a dashboard to check GPU usage, etc. Note: You can optimize NUMA and GPU Topology by using InﬁniBand with Kubernetes. (Container runtime must be compatible) containers: - name: mljob image: my-mljob:v0.1 resources: limits: nvidia.com/gpu: 1 nvidia.com/mig-3g.20gb: 1

Kubernetes' Advantages Reconciliation loop to bring into a declared state
Huge ecosystem Highly extendable platform 1 2 3

1. Reconciliation Loop to Bring into a Declared State Kubernetes
uses programs called controllers to control systems. Multiple controllers bring things into a declared state. ◦ Maintain specified number of replicas ◦ Recovery from containers shut down by broken nodes ◦ Auto reload when changing confidential info or config file ◦ Auto management of load balancer members ◦ etc. Actual ReplicaSet (replicas=3) Watch ReplicaSet Controller kind: ReplicaSet spec: replicas: 3 template: spec: containers: - image: nginx:1.16 Desired State Watch

2. Huge Ecosystem The CNCF and Kubernetes community promote open
technology and develop and release various OSS integrated with Kubernetes. By using an OSS expansion controller employing a reconciliation loop, you can let Kubernetes handle most routine operations. • Prometheus / Grafana: monitors GPU and wide-ranging middleware • cert-manager: manages auto generation of certiﬁcates using ACME; auto integrates with load balancers • external-dns: manages provided IP address and DNS records • oauth2-proxy + nginx ingress: integrates OAuth2 with requests to WebUI • Others: auto scaling, progressive delivery, data transmission between projects, etc.

3. Highly Extendable Platform Kubernetes is made to be easily
extendable and enables expansion of features tailored to our company's specific domains. It also offers a framework for mounting controllers (even using general OSS). Examples: • Auto data load + cash from S3/GCS (Custom Controller) • Auto inject of authentication info for cloud (Mutating Webhook) • Metadata storage using application instead of database (Secret/ConfigMap) • Billing system based on info retrieved from Kubernetes Our implementation also keeps pace with standardization: container runtime (OCI/CRI), networks (CNI), storage (CSI), etc.

Kubernetes' Advantages Reconciliation loop to bring into a declared state
Huge ecosystem Highly extendable platform • Restoration capability • Easy management • Observability • Frequent updates with robust automation • etc. 1 2 3 ⇒ As we track the evolution of OSS, we are guiding business to success by continuing to improve upon platforms.

Key Points When Selecting Storage While Using Kubernetes

Linking Storage with Kubernetes • We use the Container Storage
Interface (CSI) to integrate storage with Kubernetes. • CSI is an interface that connects container orchestrators with storage. It can handle multiple orchestrators and multiple storage products. CSI only defines open specifications. Features used differ according to the CSI driver. ◦ https://github.com/container-storage-interface/spec ◦ https://kubernetes-csi.github.io/docs/drivers.html Storage Container Orchestrator Container Storage Interface • Volume production/deletion • Volume attachment/detachment • Volume expansion • Volume cloning • Snapshot & restore • Topology designation • RAW volume production

CSI Features Set The CSI Driver in Kubernetes is divided
into multiple sub-features. Storage features are adequate but unusable without a compatible CSI driver. The lack of infrastructure features could prevent upper services from providing value. CSI driver considerations when selecting storage: 1. Tracking speed of Kubernetes upstream features (release frequency, upstream participation) 2. CSI driver quality (including bug ﬁx speed) Since before CSI, NetApp has made Trident. It has very good release frequency, features, and quality. Container Orchestrator Container Storage Interface Storage Trident

Trident as OSS Trident is released as OSS and CSI
Driver implementation is not a black box. • Our team wants to avoid always having to wait when a problem occurs Excellent development organization • Three-month release cycle (since Dec 2016) • Proactive contributions to the community mean we can expect fast upstream response Compatibility with both ReadWriteOnce and ReadWriteMany (AFF at CyberAgent) • NAS: for ML workloads • SAN: for system applications building GPUaaS (databases, Prometheus, etc.)

Implementing Multi-tenancy and Providing GPUs with Kubernetes

Two Ways to Use GPU-as-a-Service GPUaaS is operated with a
dedicated web console or kubectl. Kubernetes API Server GPUaaS API Server $ kubectl ... • launch notebooks • Manage volumes • Show billing info • Manage projects • etc. Web Console kubectl A B

Multi-tenancy in Kubernetes The Kubernetes Namespace feature has the same
concept as tenants. We use ClusterRole and RoleBinding to manage permissions. "Add member to project from WebUI Console" = "Add RoleBinding" This allows seamless management. (It's like using a user database.) Other processes also possible on the WebUI Console, depending on role. UserA namespace UserB namespace TeamX namespace A B TeamY namespace admin clusterrole member clusterrole rolebinding

Multi-tenant Management with Hierarchical Namespaces Hierarchical Namespace Controller assigns shared
settings to all namespaces Policy, conﬁg data, shared cloud account authentication info, metadata (CustomResource), etc. UserA namespace UserB namespace TeamX namespace GPUaaS namespace policy policy policy policy A B

Providing GPU Instances 1. Oﬀer Jupyter Notebooks with original web
console For users unfamiliar with Kubernetes: data scientists, researchers, etc. 2. SSH-like environment using Kubernetes CLI $ kubectl exec -it PODNAME-0 -- bash PODNAME-0 #

Providing GPU Instances 3. Provide infrastructure to upper services For
developing machine learning infrastructure based on Kubernetes Can't implement as desired if lower layer features are inadequate ⇒ Storage features and CSI driver functionality are important DGX A100 AFF A800 AI Platform GPUaaS (Kubernetes) AI Platform Consider multi-DC rollout using Kubernetes portability

Due to slower development, want to make complete with GCP/AWS
Can't train easily like on an AI platform Don't use since migration from cloud is hard (Multiple responses) What do you not like about the currently available GPUaaS? Why We Need an Original AI Platform

Our Google AI infrastructure for managing ML workﬂows • Overview
of Training System • Overview of Inference System AI Platform Overview

AI Platform Requirements Machine learning & inferences • Can offload
Google AI platform ML workflows ◦ Object storage as hub • Same operability as Google AI platform ◦ kubectl plugin features • Can reuse Google AI platform configuration files Training • Capable of hyperparameter tuning ◦ Use Kubeflow, a Katib component Inferences • Can make inference endpoints ◦ Use Kubeflow, a KFServing component • Model version management ◦ Use our original model metadata management infrastructure • External access to inference endpoints ◦ Authorization with Istio and External Authorization

Training & Inferences

Object Storage as a Hub

Consistent Operability kubectl ai-platform jobs submit training... kubectl ai-platform jobs
submit prediction... kubectl ai-platform predict... kubectl ai-platform models... kubectl ai-platform versions... On-prem resource gcloud ai-platform jobs submit training... gcloud ai-platform jobs submit prediction... gcloud ai-platform predict... gcloud ai-platform models... gcloud ai-platform versions... Cloud resource Google AI Platform Deﬁnition

Training

Toolkit for Building ML Workﬂows on Kubernetes https://www.kubeflow.org/docs/started/kubeflow-overview/ • Can't
choose deployed environment ◦ On-premises deployment • Resource mgt. with Kubernetes ◦ Expandable; big ecosystem ◦ Object control with manifest • Hyperparameter tuning ◦ Katib • Inference endpoint production ◦ KFServing Kubeﬂow

Katib Components Experiment Suggestion Trial Trial Trial TFJob/PytorchJob/Job Pod Worker
Container Experiment • Individual execution of hyperparameter tuning • Write all settings (algorithms, etc.) Suggestion • Hyperparameter generation Trial • Training executed with the hyperparameters Metrics Container • Save model training prediction accuracy as metrics Metrics Collector • Write metrics to database and complete tuning Metrics Collector Katib DB Metrics Container

AI Platform Training System Conﬁguration icons: https://icons8.jp/icons/set/video-card

Job Submit Flow for AI Platform Training 1. Execute submit
command 2. Compress model code 3. Send compressed code to GCS/S3 4. Create Katib experiment 5. Trial makes pods (+PV attachment) 6. Download compressed code 7. Execute training 8. Send model to GCS/S3 1 2 3 4 5 6 7 8 icons: https://icons8.jp/icons/set/video-card

Inferences

https://www.kubeflow.org/docs/started/kubeflow-overview/ Kubeﬂow

KFServing Overview • Provides inference features (deﬁned with InferenceServer) •
Makes model serving abstract (compatible with TensorFlow, PyTorch, XGBoost, etc.) • Manages serving containers (Knative) • Manages traﬃc routing (Istio) Special features • Auto scaling • Canary rollout, A/B tests • Prediction, preprocessing, post-processing, etc. https://github.com/kubeflow/kfserving/blob/master/docs/diagrams/kfserving.png

InferenceServer • KFServing custom resource deﬁnition • Inference endpoint entities
◦ Containers loaded by models ◦ Provide inference endpoint (FQDN) • Both preprocessing and post-processing OK • PodSpec descriptions allow custom containers

Original Model Metadata Management Infrastructure Overview • Manages metadata assignments
to models • Originally an infrastructure for follow-up tests and reproduction • Development/operation in other departments Special features • Savable model version histories • Can tie metadata to models ◦ Code, datasets, etc. • Controllable model access rights ◦ 3 patterns: Read, Write, RW • Can designate model location ◦ Compatible with GCS/S3

Istio and External Authorization External authorization • Part of envoy
feature • Transfers requested authorization to authorized external server • Can mount original authorized logic • Authorization mounts predetermined REST API/gRPC Service • Using Istio simpliﬁes settings apiVersion: security.istio.io/v1beta1 kind: AuthorizationPolicy metadata: name: ext-authz spec: selector: matchLabels: app: istio-ingressgateway action: CUSTOM provider: name: "ext-authz-service" rules: - to: - operation: paths: ["/auth/*"] extensionProviders: - name: "ext-authz-service" envoyExtAuthzGrpc: service: "ext-authz.default.svc.cluster.local" port: 9000 Register authorization Register authorized server

AI Platform Prediction System Conﬁguration icons: https://icons8.jp/icons/set/video-card icons: https://icons8.jp/icons/set/video-card

AI Platform Prediction Flow Register model 1. Execute "models create"
command 2. Generate model creation request on API server 3. Register model on metadata server Execute prediction 8. Execute "predict" command 9. Authorization on external authorization server 10. Transfer request to InferenceServer 11. Execute inference 12. Return inference response Create model version 4. Execute "versions create" command 5. Generate InferenceServer creation request on API server 6. Apply InferenceServer manifest 7. Create InferenceServer (+ download model) 1 2 3 4 5 9 7 8 10 12 11 6 icons: https://icons8.jp/icons/set/video-card

Wrap Why are GPUs necessary? ★ To quickly process massive,
complex data On-premises advantages Features • Flexible software stack assembly • Easy connections with existing services Costs • High cloud expenses • Inexpensive over the long term

Features always improving with OSS Maximize Kubernetes' advantages Wrap DGX
A100 GPUaaS (Kubernetes) AI Platform AI Platform By making aggressive use of OSS and improving upon platforms, we can make application development more agile and have a big impact on business. AFF A800 Google AI platform compatibility Ultra-high-performance GPUs/storage

Future Outlook GPUaaS • Automated MIG partitioning features AI platform
• Pipeline features ◦ ML workﬂow automation • Black box optimization features Physical hardware • Augmented DGX A100/A100 GPU and AFF A800 • Improve cost-eﬀectiveness by integrating with other GPUs (T4, etc.)

Thank you for your attention

Kubernetes-based GPU as a Service Platform at C...

Kubernetes-based GPU as a Service Platform at CyberAgent [INSIGHT 2021 Digital]

More Decks by Daisuke Takahashi

Other Decks in Technology

Featured

Transcript