Kubernetes-based GPU as a Service Platform by using Open Source Software [GTC 2020]

by Daisuke Takahashi

Slide 1

Slide 1 text

Kubernetes-based GPU as a Service Platform using Open Source Software

Slide 2

Slide 2 text

Who are we? AI Category Owner Lee joined CyberAgent in 2016. Contributing to improving in-house products as Solution Architects and platform development (e.g., our OpenStack and container service). Lee is also developing an AI platform as an AI category owner. Lee Yeongjae Masaya Aoyama Shuichiro Makigaki Daisuke Takahashi K8s aaS Product Owner Implemented GKE-like Kubernetes as a service on private cloud as product owner and supported the "Developer Experts" for Kubernetes projects at CyberAgent. Co-chair of the largest Cloud Native conference in Japan. ML/Backend Engineer Joined CyberAgent in 2016. Mainly works for in-house system development as backend engineer and architect. He also works in platform development (OpenStack and container service) and is developing an AI platform. Infrastructure Engineer Mainly responsible for development of private OpenStack platform and Kubernetes-as-a-Service as well as effective utilization of various accelerator devices. Building underlying physical infrastructures for GPUaaS/AI platform.

Slide 3

Slide 3 text

Agenda 1. Overview of CyberAgent, Inc. 2. Why we decided to use an on-premise environment 3. Kubernetes-based GPU-as-a-Service Platform 4. AI Platform 5. Physical layer around GPU 6. Conclusion

Slide 4

Slide 4 text

“To create the 21st century’s leading company” Media A variety of media services enjoyed by countless people ➔ AbemaTV ➔ AWA ➔ WinTicket Advertisement Offering comprehensive advertising solutions from agency business to ad technologies ➔ Dynalyst ➔ CA Wise ➔ AIR TRACK Game Developing 50+ smartphone games (including eight major titles on various platforms) ➔ GRANBLUE FANTASY ➔ PRINCESS CONNECT! Re:Dive ➔ Shadowverse 3 Main Segments ※「ABEMA」：© Abema TV, Inc. ※※「GRANBLUE FANTASY」、「PRINCESS CONNECT! Re:Dive」： © Cygames, Inc.

Slide 5

Slide 5 text

Agenda 1. Overview of CyberAgent, Inc. 2. Why we decided to use an on-premise environment 3. Kubernetes-based GPU-as-a-Service Platform 4. AI Platform 5. Physical layer around GPU 6. Conclusion

Slide 6

Slide 6 text

Why AI solution for advertising? ● To reduce the time needed to create effective ads and domain knowledge of customer's business ● To discover new highly effective ad creatives ● To predict the performance of ad creatives and prioritize them by ranking ● To help analyze and improve the effectiveness of ad creatives ● To identify and avoid ads that cause negative reaction ※「GRANBLUE FANTASY」： © Cygames, Inc. 97 Points Similar ad detected! Creative

Slide 7

Slide 7 text

Why GPUs? We must perform high processing volumes at high speed. GPU power can contribute to our business. ● There is a huge number of combinations of advertising and media. ● Computational complexity increases as more demographic information (e.g., region, age, and gender) is considered. ● A fast learning cycle is required because advertisements change rapidly in response to changing consumer interests. ● The advertising system treats bidding; thus, increased inference latency affects our business critically and the requirement is severe.

Slide 8

Slide 8 text

Why on-premises? Functionalities ● To build a flexible software stack ● To link existing services Costs ● Cloud fees remain high ● Total on-premise costs will be lower in the long term

Slide 9

Slide 9 text

Monthly cost ($) of GPU-only usage on cloud (part of the business segment) Why on-premises?

Slide 10

Slide 10 text

Agenda 1. Overview of CyberAgent, Inc. 2. Why we decided to use an on-premise environment 3. Kubernetes-based GPU-as-a-Service Platform 4. AI Platform 5. Physical layer around GPU 6. Conclusion

Slide 11

Slide 11 text

Provide GPU instances for users ● Multiple instances ● Multiple GPUs per instance Isolate GPUs between processes Pay out shared volumes for each tasks GPUaaS architecture overview and minimal requirements Container icons: https://icons8.jp/icons/set/video-card Computing resource pool Storage pool

Slide 12

Slide 12 text

Container-based vs VM-based vs metal-based ● Pros for container-based ○ Easy image packaging to run environment [cf. VM, Metal] ○ Low overhead and short launch time [cf. VM] ○ Environment isolation for multi-tenancy [cf. Metal] ● Cons for container-based ○ Low runtime isolation [cf. VM] ○ Short lifecycle [cf. VM, Metal]

Slide 13

Slide 13 text

Kubernetes Aggregate computing resources and orchestrate containers, volumes, etc. = aggregate GPUs and assigning to processes with volumes Computing resource pool Storage pool ● Storage systems ○ Block ○ Shared filesystem ○ Others

Slide 14

Slide 14 text

Isolation for multi-tenancy Kubernetes namespace can be isolated for multi-tenancy NOTE: Container runtime (Docker / runC) cannot be completely isolated User A namespace User B namespace

Slide 15

Slide 15 text

User authentication/authorization on Kubernetes ● Authentication ○ Service account for Kubernetes ○ OIDC integration ○ Cloud provider user/service account integration ● Authorization ○ Role-based access control（RBAC） ■ CRUD specific resources only

Slide 16

Slide 16 text

Accessing GPU instances (containers) 1. Access via Jupyter notebook from web browser 1. SSH-like access via kubernetes client tool $ kubectl exec -it PODNAME-0 -- bash PODNAME-0 #

Slide 17

Slide 17 text

Why Kubernetes? For "Cloud Native“ ● Resiliency ● Easily managed ● Observability ● Fast updates ● Others https://github.com/cncf/toc/blob/master/DEFINITION.md Methods: A. Reconciliation by Kubernetes B. Ecosystem C. Extending and customizing ⇒ Continue to improve the platform with OSS for business success Cloud Native means:

Slide 18

Slide 18 text

A: Reconciliation loop ● Automatic recover (converge) to desired state by many controllers ○ Re-launch container (process) quickly ○ Replace latest configs and credentials ○ Reassign load balancer members Actual ReplicaSet (replicas = 3) Watch ReplicaSet Controller kind: ReplicaSet spec: replicas: 3 template: spec: containers: - image: nginx:1.16 Desired ReplicaSet

Slide 19

Slide 19 text

B: Automate with Kubernetes ecosystem ● Prometheus/Grafana ○ Monitor GPU and server metrics ● Cert-manager ○ Create and update certificates with ACME ● External-dns ○ Associate IP address and hostnames ● oauth2-proxy + nginx ingress ○ OAuth2 authentication for WebUI ● Others ○ Auto scaling, templating settings, etc.

Slide 20

Slide 20 text

C: Extending and customizing with Kubernetes 1. Implement custom controller with reconciliation model e.g., S3 image caching for volumes 2. Mutating container settings by webhook e.g., automatically inject credentials 3. Any status can be accessed via Kubernetes API e.g., collect usage status for billing 4. Store metadata to Kubernetes using ConfigMap or Secret e.g., a user’s container image references for web UI

Slide 21

Slide 21 text

Why Kubernetes? For "Cloud Native" ● Resiliency ● Easily managed ● Observability ● Fast updates ● Others https://github.com/cncf/toc/blob/master/DEFINITION.md Methods: A. Reconciliation by Kubernetes B. Ecosystem C. Extending and customizing ⇒ Continue to improve the platform with OSS for business success Cloud Native means:

Slide 22

Slide 22 text

NVIDIA and OSS https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/program/schedule/ ● Kubernetes GPU device plugin ● OSS monitoring stack at KubeCon EU 2020 presentation https://github.com/NVIDIA/k8s-device-plugin

Slide 23

Slide 23 text

Agenda 1. Overview of CyberAgent, Inc. 2. Why we decided to use an on-premise environment 3. Kubernetes-based GPU-as-a-Service Platform 4. AI Platform 5. Physical layer around GPU 6. Conclusion

Slide 24

Slide 24 text

Development speed will be reduced; thus, I want to complete all processing with GCP or AWS. Because it's not as easy to perform machine learning as the GCP AI Platform. I don't use it because it's difficult to migrate from the public cloud. User’s voice of our GPUaaS (Multiple answers allowed) Please identify your dissatisfaction with GPUaaS.

Slide 25

Slide 25 text

Using computational resources in the right place We should select what we should use. Creating cutting-edge environment for innovative products It is important to be best friends with the environment. Why on-premise AI platform? The public cloud has already provided many machine learning platforms. Why should we?

Slide 26

Slide 26 text

Example: AI platform training in Google Cloud A service to train models via different customization options Supports different machine types, distributed training, hyperparameter tuning, and GPU/TPU acceleration Four simple steps: 1. Package training codes 2. Prepare job definition by YAML (with hyperparameter tuning if required) 3. Save code&YAML to Google Cloud Storage 4. Submit gcloud ai-platform jobs submit training https://cloud.google.com/ai-platform

Slide 27

Slide 27 text

Idea: GCP AI Platform-compatible on-prem. AI Platform Ease of use is justification: many users, good IO interface, continuous improvement, easy to introduce, etc. Same configuration and codes ● Introducing Kubeflow is reasonable ● Treat GCP AI Platform Job = Kubeflow (Katib) resource ● Abstract TFJob/PytorchJob/K8SJob, etc. Same commands ● Implement compatible commands by kubectl plugins Remove barriers between on-premises & cloud

Slide 28

Slide 28 text

Army knife for machine learning on Kubernetes https://www.kubeflow.org/docs/started/kubeflow-overview/ ● On-prem. deployment ● Resource usage control by Kubernetes ● Hyperparameter tuning by Katib What is Kubeflow?

Slide 29

Slide 29 text

What is Katib (in Kubeflow)? Hyperparameter tuning component Optimize Hyperparameters Neural Architecture Search Optimize neural network structure Multi-machine learning framework support TensorFlow, PyTorch, etc.

Slide 30

Slide 30 text

Katib Resources Experiment Suggestion Trial Trial Trial TFJob/PytorchJob/Job Pod Worker Container Metrics Container Experiment Execution unit of hyperparameter tuning Contains all settings (e.g., algorithms) Suggestion Contains a hyperparameter pair according to the algorithm specified in the Experiment Trial Coordinate each hyperparameter from Suggestions Metrics Collector Katib DB

Slide 31

Slide 31 text

Overview of our AI platform

Slide 32

Slide 32 text

Same configuration/codes/commands kubectl ai-platform jobs submit training kubectl ai-platform jobs list|get kubectl ai-platform jobs describe kubectl ai-platform jobs stream-logs kubectl ai-platform jobs cancel gcloud ai-platform jobs submit training gcloud ai-platform jobs list|get gcloud ai-platform jobs describe gcloud ai-platform jobs stream-logs gcloud ai-platform jobs cancel On-prem. resource Cloud resource GCP Job definition

Slide 33

Slide 33 text

Abstract TFJob/K8SJob, etc. by Katib Experiment Treat GCP AIP Job as Katib Experiment Parse GCP-style Job definition on client side and convert it to Katib Experiment Transparent operation from end users User: create/delete Job = create/delete Experiment (internally) = create/delete TFJob/Pytorch Job (internally) Experiment Suggestion Trial Trial Trial TFJob/PytorchJob/Job Pod Worker Container Metrics Container Metrics Collector Katib DB kubectl plugin implementation

Slide 34

Slide 34 text

Run job w/o hyperparameter tuning If no hyperparameter tuning section in Job definition, substitute it by limiting feasible space Parameters: FeasibleSpace: List: 0.02 Name: dummy ParameterType: discrete Experiment Suggestion Trial Trial Trial TFJob/PytorchJob/Job Pod Worker Container Metrics Container Metrics Collector Katib DB kubectl plugin implementation (cont.)

Slide 35

Slide 35 text

Serving should be in the right place Private Cloud Pros: close to data source, suitable for private test CPU on virtual machine and NVIDIA T4 are available Public Cloud Pros: Flexibility and availability via global platform CPU+GPU and TPU is available Serving can often work using less resource than training

Slide 36

Slide 36 text

Agenda 1. Overview of CyberAgent, Inc. 2. Why we decided to use an on-premise environment 3. Kubernetes-based GPU-as-a-Service Platform 4. AI Platform 5. Physical layer around GPU 6. Conclusion

Slide 37

Slide 37 text

Workstations in MDF room (2019) ● Clustering unused GeForce GTX 1080 Tis with Kubernetes for researchers ○ Much higher demand than expected and many requests for similar service from developers

Slide 38

Slide 38 text

Issues w/ workstation cluster 1. Facility ○ Poor power and cooling capabilities of MDF room for high-power devices ■ e.g., annual power outage ○ High latency connection to our datacenter network (Site-to-site VPN) ■ Not suited for inference serving application 1. Workstation ○ Lack of BMC/IPMI (remote management feature) on our machines ■ Would like to maintain remotely due to COVID-19 pandemic 1. GPU ○ Limited memory capacity of GeForce ■ Insufficient for some workloads

Slide 39

Slide 39 text

Infrastructure considerations (2020) 1. Location ○ Our datacenter in Tokyo ■ Sufficient power, cooling, and network capabilities 1. Hardware ○ Rack-mount servers (with IPMI) ■ Convenient maintenance ○ NVIDIA data center GPUs ■ Sufficient GPU memory We began looking for GPU-accelerated servers at the end of April

Slide 40

Slide 40 text

NVIDIA A100/DGX A100 ● Ampere architecture ○ Notable performance improvements compared to “Volta” ■ Up to 20x faster with sparsity ● 3rd-gen NVLink/2nd-gen NVSwitch ○ Seamlessly scalable up to 16 GPUs ○ 2x faster GPU-to-GPU connection bandwidth than predecessors ● Announce/Release Timing (14th May) ○ Announced while we were making the list of candidate GPU servers ■ Including DGX-1 and DGX-2

Slide 41

Slide 41 text

MIG: Multi-instance GPU MIG mode in the NVIDIA Ampere architecture can run seven jobs in parallel on an A100 GPU (NVIDIA Blog) ● Multi-tenancy ○ For DGX A100, its 8 GPUs can be sliced into 56 GPU instances ○ Administrators can assign right-sized GPU for each job ● Guaranteed QoS ○ All GPU instances include isolated memory and cores

Slide 42

Slide 42 text

DGX A100 ● 1 node (for now) ○ Scale-out if required ● Almost ready ☑ Setup (OS, Kubernetes, etc.) ☑ Benchmark ☐ Evaluate MIG support of Kubernetes device plugin

Slide 43

Slide 43 text

Hardware around DGX A100 100GbE 25GbE Compute NVIDIA DGX A100 Network Mellanox SN2010 Storage NetApp AFF A800

Slide 44

Slide 44 text

Agenda 1. Overview of CyberAgent, Inc. 2. Why we decided to use an on-premise environment 3. Kubernetes-based GPU-as-a-Service Platform 4. AI Platform 5. Physical layer around GPU 6. Conclusion

Slide 45

Slide 45 text

Conclusion: Purpose Why do we need GPUs? We must perform high processing volumes at high speed. GPU power can contribute to our business. Advantages of our on-premises resources Functionalities ● To build a flexible software stack ● To link existing services Costs ● Cloud fees remain high ● Total on-premise costs will be lower in the long term

Slide 46

Slide 46 text

To improve the platform by OSS stack Operation automation with Kubernetes Conclusion: our solutions DGX A100 GPUaaS (Kubernetes) AI Platform AI Platform The agility of application development will be increased by actively using OSS and improving the platform, which will have a significant impact on the business. AFF A800 AI Platform compatible with GCP High-performance GPU and storage

Slide 47

Slide 47 text

ToDos GPUaaS ● Automatic slicing of GPU instances (MIG) On-premises AI Platform ● Serving implementation ● Pipeline implementation A100 GPU / DGX A100 ● Add more DGX A100 along with our business growth ● Explore more new possibilities of MIG and Kubernetes ● Integration of A100 with other GPUs (e.g. T4) for cost-efficiency

Slide 48

Slide 48 text

Thank you for listening