Slide 1

Slide 1 text

Toward Cloud Native HPC

Slide 2

Slide 2 text

Outline Cloud Native Paradigm CNCF Ecosystem HPC Adoption Public Cloud Use Cases What’s Next?

Slide 3

Slide 3 text

Non-Business Use Adoption of Public Clouds in HPC sites 13% 74% 2011 2018 Hyperion Research study “Cloud Computing Comes of Age”, 2019

Slide 4

Slide 4 text

4 Non-Business Use Develop, Deploy & Run Mostly Open Source Cloud Computing Model Cloud Native

Slide 5

Slide 5 text

5 Non-Business Use Architectural design that breaks an application to independent, loosely-coupled, individually deployable services. • Portability was a challenge. Orchestration Containers Microservices

Slide 6

Slide 6 text

6 Non-Business Use Bundling of an application and all its dependencies as a package to be deployed regardless of environment. Orchestration Containers Microservices

Slide 7

Slide 7 text

7 Non-Business Use Automation of the operational effort required to run the lifecycle of a container; its workloads and services . • provisioning, deployment, scaling (up and down), networking, load balancing and more. • Enabling DevOps and CI/CD Orchestration Containers Microservices

Slide 8

Slide 8 text

8 Non-Business Use Google & Linux Foundation Project Founded in 2015 Advance Container Technology

Slide 9

Slide 9 text

9 Non-Business Use Google & Linux Foundation Project Founded in 2015 Advance Container Technology

Slide 10

Slide 10 text

10 Non-Business Use Google & Linux Foundation Project Founded in 2015 Advance Container Technology App Definition & Development Database, Streaming & Messaging, App Def & Image building, CICD Orchestration & Management Scheduling & Orchestration, Coordination & Service Discovery, Remote Procedure Call, Service Proxy, API Gateway, Service Mesh Runtime Cloud Native Storage, Container Runtime, Cloud Native Network Provisioning Automation & Configuration, Container Registry, Security & Compliance, Key Management Special Kubernetes Certified Service Provider, Kubernetes Training Partner, Platform Certified Kubernetes Distribution, Host, Installer Observability & Analysis Monitoring, Logging, Tracing, Chaos Engineering, Continuous Optimization Serverless

Slide 11

Slide 11 text

11 Non-Business Use Google & Linux Foundation Project Founded in 2015 Advance Container Technology App Definition & Development Database, Streaming & Messaging, App Def & Image building, CICD Orchestration & Management Scheduling & Orchestration, Coordination & Service Discovery, Remote Procedure Call, Service Proxy, API Gateway, Service Mesh Runtime Cloud Native Storage, Container Runtime, Cloud Native Network Provisioning Automation & Configuration, Container Registry, Security & Compliance, Key Management Special Kubernetes Certified Service Provider, Kubernetes Training Partner, Platform Certified Kubernetes Distribution, Host, Installer Observability & Analysis Monitoring, Logging, Tracing, Chaos Engineering, Continuous Optimization Serverless Scheduling Observability Storage Network UX High Performance Computing

Slide 12

Slide 12 text

Cloud Native Distributed Cloud Kubernetes CNCF launched v1.0 GA Huawei Cloud Container Engine (CCE) Google Kubernetes Engine (GKE) KubeEdge CNCF’s first intelligent edge computing project Volcano CNCF’s first batch scheduling project Distributed Cloud Native Slurmnetes Batch scheduling failed attempts KubeFlow Machine learning framework for operations, pipelines, training & deployment. MindSpore Deep Learning framework for mobile, edge, cloud scenarios Karmada CNCF’s first multi-cloud container orchestration project Evolution Timeline Kueue Kubernetes-native job queueing Cern 1000 node POC 2015 2016 2019 2020 2021 2017 2018 2022 2011 Cycle Computing Running cloud HPC around 8 regions Expanded upon chart from https://bit.ly/FrontiersCloudNative

Slide 13

Slide 13 text

HPC Cloud Adoption Challenges Special Hardware Data Gravity Paradigm Shift • Network latency, as in special IB • GPUS, accelerators, Numa …etc • CPU architecture and topology TOP 500

Slide 14

Slide 14 text

HPC Cloud Adoption Challenges Special Hardware Data Gravity Paradigm Shift • Data governance • Data residency • Egress cost • Higher the availability, higher the cost Services Data Apps Throughput Latency

Slide 15

Slide 15 text

HPC Cloud Adoption Challenges Special Hardware Persistent Storage Kubernetes Control Plane K8s Kubelet K8s Kubelet K8s Kubelet Image Registry Data Gravity Paradigm Shift • Both, learning and adoption • Distributing workload as images (registry)

Slide 16

Slide 16 text

Research End User: CERN https://bit.ly/HPCSAUDI-cern-org CERN is the European Organization for Nuclear Research. • Kubernetes use case: Particle Physics • Experimented with virtualization early to enable ease of management and automation. 2017 first Kubernetes POC 1000 worker nodes Data 330 PB Hybrid on-demand infra 3hrs > 15 min

Slide 17

Slide 17 text

Public Cloud Use Cases “Focus on your application and results” • Dynamically provision resources • Plans, schedules, and executes • Fully managed “Serverless” • Free • Integration with AWS services 2020 Statistics Largest Cluster 1,243,000 vCPUS Largest Container Image 30 GB No. simulatenous jobs 500,000 Customers Thousands 1000s

Slide 18

Slide 18 text

The CNCF Community It's very hard right now to justify developing a new product in-house. There is really no real reason to keep doing that. It's much easier for us to try it out, and if we see it's a good solution, we try to reach out to the community and start working with that community.”

Slide 19

Slide 19 text

Where to next? • Kubernetes Batch HPC Day North America 2022 • SC22 Containers and New Orchestration Paradigms for Isolated Environments in HPC • CNCF Research User Group • CNCF Technical Advisory Group for Runtime • Kubernetes Community: Batch WorkGroup • CNCF Batch System Initiative Working Group