Slide 1

Slide 1 text

Scalable Kubernetes Service 03-2021

Slide 2

Slide 2 text

Outline ● Intro ● Software at Exoscale ● Kubernetes at Exoscale ● Challenges met ● SKS: Scalable Kubernetes Service

Slide 3

Slide 3 text

Intro and context

Slide 4

Slide 4 text

Exoscale in a nutshell ● Infrastructure as a Service, 6 zones throughout Europe ● Now part of A1 group ● Public cloud in Geneva since 2013

Slide 5

Slide 5 text

The product

Slide 6

Slide 6 text

Software at Exoscale

Slide 7

Slide 7 text

What’s in a cloud provider? ● Datacenter & Network Operations ● Security ● Automation ● DBA ● Software development

Slide 8

Slide 8 text

The software we write ● Object Storage controller ● Internal SDN ● Compute orchestrator ● Load-balancer orchestrator ● Kubernetes orchestrator ● Web portal ● Customer Management ● Usage Metering ● Billing ● Integration tooling (CLI, terraform provider, …) ● Command and control, automation support

Slide 9

Slide 9 text

Things that didn’t exist in 2012 ● Ansible ● Terraform ● Docker ● Kubernetes ● Wifi ● Television

Slide 10

Slide 10 text

Initial stack ● Puppet for configuration management, in-house command and control ● 5 large external facing services, databases, a number of batch processing tools ● VM profiles per role, horizontal scaling where possible

Slide 11

Slide 11 text

Why container orchestration then? ● Puppet becomes a hot spot of activity ○ Hard to convey the entire infrastructure need of an application in one place ○ Configuration scattered across different places (load-balancing, firewalling, software, monitoring) ● Always making allocation decisions “on what class of machines should this run?” ● Overall low utilization, but contention during peaks! ● Large MTTR for failed nodes

Slide 12

Slide 12 text

Kubernetes at Exoscale

Slide 13

Slide 13 text

Initial exploration ● Strong interest in Apache Mesos (not tied to docker, distributed systems building toolbox) ● Witnessed Kubernetes fast adoption ● Swarm and nomad didn’t fit the bill for a number of reasons

Slide 14

Slide 14 text

Going for Kubernetes ● Traction ● The kicker were the open-ended abstractions: Service, Ingress, CRI, CNI, CSI ○ These allow providers to step in and provide a best in class implementation of the abstraction ○ The abstraction allows for a much better shot at expressing infrastructure independent from the location ● We decided to start with our API gateway ○ One of the most active projects at the time ○ Extremely sensitive to disruption

Slide 15

Slide 15 text

Challenges met

Slide 16

Slide 16 text

Keeping our promises in a containerized world ● Config management ○ Now next to the application: huge progress ○ Added internal tooling to generate manifests ● Deployments ○ Registries vs. Debian repositories ○ ArgoCD for managing deployments ● Security ○ Network and security policies ○ OPA (wip)

Slide 17

Slide 17 text

Container networking ● Network used to be boring ○ A public IP per VM ○ Security groups to provide isolation ● Exoscale private networks not ready for CNI ● Performance analysis led to the use of Calico

Slide 18

Slide 18 text

SKS: Scalable Kubernetes Service

Slide 19

Slide 19 text

Redux of what we learnt ● Network ○ Calico ● Security ○ Several certificate authorities per cluster ○ Encryption key for secrets, per cluster ○ Wireguard available on the template ○ Cluster access using certificates (support for users, groups, TTL) ● Exoscale Cloud Controller Manager ○ To validate worker nodes ○ Network Load Balancer integration

Slide 20

Slide 20 text

Full integration in the Exoscale stack ● Network Load Balancer ○ “LoadBalancer” Kubernetes services ○ Configuration using annotations ● Instance Pools ○ We rely on instance pools for the nodepools ○ Same properties (nodes cycling…) ● Security groups (per nodepool) ● Anti affinity groups (per nodepool) ● API and tooling ○ CLI, Terraform

Slide 21

Slide 21 text

Product objectives ● Speed ○ Create clusters in ~100 seconds ○ New nodes in the cluster in ~120 seconds (available in “kubectl get nodes”) ○ Should be faster in the future ● Seamless start ● CNCF compliance ● Reliability: two offerings ○ starter: no SLA, non-HA control plane, free ○ pro: SLA, HA control plane

Slide 22

Slide 22 text

Demo!

Slide 23

Slide 23 text

Kubernete dashboard Kubernetes “LoadBalancer” Service Exoscale Load Balancer Kubernetes Cluster Outside world

Slide 24

Slide 24 text

Kubernete dashboard Kubernetes “LoadBalancer” Service Exoscale Load Balancer Kubernetes Cluster Outside world Kubernetes “LoadBalancer” Service Nginx ingress controller App App Exoscale Load Balancer

Slide 25

Slide 25 text

Additional notes and future work

Slide 26

Slide 26 text

Advanced use cases ● Cluster lifecycle management ○ Cluster upgrades (next patchs, next minor) ● Certificate management ○ You can retrieve various CA certificates in order to configure some components ● Multiple nodepools ○ Each nodepool is independant ○ Can have different disk sizes, offerings, anti affinity groups, networking rules… ○ Can be scaled independently

Slide 27

Slide 27 text

Ongoing work ● Cluster autoscaler (short-term) ○ Automatically scale nodepools based on Kubernetes metrics ● Web portal (short-term) ● Blueprints (short-term) ○ Manifests examples for common things ● GPU nodepools ● More add-ons: dashboard, ingress, metrics-server ○ metrics-server should arrive soon ● Persistent volumes: specific add-on ● Automatic security group management ● Managed container registry ● Advanced IAM integration