The Co-Evolution of Kubernetes and GCP Networking, KubeCon EU 2019

Slide 1

Slide 1 text

Purvi Desai Tim Hockin Co-Evolution of Kubernetes and GCP Networking

Slide 2

Slide 2 text

Why did Kubernetes take off? ● Focused on app owners and app problems ● “Opinionated enough” ● Assumes platform implementations will vary ● Designed to work with popular OSS ● Follows understood conventions (mostly)

Slide 3

Slide 3 text

Networking is at the heart of Kubernetes ● Almost every k8s-deployed app needs it ● Networking can be complex ● Details vary a lot between environments ● App developers shouldn’t have to be networking experts

Slide 4

Slide 4 text

In the beginning Lineage of Borg Survey of Container Networking

Slide 5

Slide 5 text

Borg model 10.240.0.1 10.240.0.2 Task A 3306 Task B Task C 8000

Slide 6

Slide 6 text

Borg model 10.240.0.1 Task A 3306 Task B 3306

Slide 7

Slide 7 text

10.240.0.2 netns: 172.16.1.1 Container C 8000 10.240.0.1 netns: 172.16.1.1 Container A 80 netns: 172.16.1.12 B: 172.16.1.2 Original docker model 9376 11878 DNAT SNAT SNAT DNAT

Slide 8

Slide 8 text

Kubernetes network model Users should never have to worry about collisions that they themselves didn’t cause App developers shouldn’t have to be networking experts

Slide 9

Slide 9 text

A real IP for every Pod ● Pod IPs are accessible from other pods, regardless of which VM they are on ● No brokering of port numbers ● Focus on the UX we want

Slide 10

Slide 10 text

10.240.0.2 netns: 172.16.2.1 Container C 8000 10.240.0.1 netns: 172.16.1.1 Container A 80 netns: 172.16.1.12 B: 172.16.1.2 Kubernetes model

Slide 11

Slide 11 text

Proof of concept Early Experiments on GCP

Slide 12

Slide 12 text

Cloud networking ● VM Centric ● Containers are not really a part of design space ● What were the possibilities? VPC VM VM

Slide 13

Slide 13 text

Found a toehold ● The “Routes” API ● Every VM claims to be a router ● Disable IP spoofing protection VPC Node A GKE cbr0 Pod IP Space = 10.1.1.0/24 Pod Pod Pod IP Spoofing Off cbr0 IP Spoofing Off Pod Pod Pod Node B Pod IP Space = 10.1.2.0/24 Route 10.1.1.0/24 to Node A Route 10.1.2.0/24 to Node B

Slide 14

Slide 14 text

The beginning of co-evolution ● Foundations were set ● UX was good - IP-per-Pod worked! ● We were able to push limits to 100 routes ● Does anyone remember how many nodes Kubernetes 1.0 supported?

Slide 15

Slide 15 text

Co-evolution Journey Cluster Networking Services and L4 Load Balancers L7 load balancer

Slide 16

Slide 16 text

50 250 500 1000 2000 5000 In 2 years Cluster Networking Routes model ● Drove major architectural changes to scale GCP’s Routes subsystem ● Rapid scaling over 2 years 100

Slide 17

Slide 17 text

● IP spooﬁng disabled ● Semi-hidden allocations - potential for collisions with future uses of IPs ● Overlapping routes caused real confusion, hard to debug What’s the catch? x.y.z/24 Node A x.y.z.0/24 VPC GKE IP Spooﬁng x.y.z.0/24 Node A

Slide 18

Slide 18 text

We can do better Better integration with other products Easy to reason about & debug Need a deeper concept: Alias IPs

Slide 19

Slide 19 text

● Allocate range for nodes Alias IPs & integrated networking Node A GKE Node B VPC RFC-1918 Node range

Slide 20

Slide 20 text

● Allocate range for nodes ● Allocate ranges for pods and services Alias IPs & integrated networking Node A GKE Node B VPC RFC-1918 Node range Pod range Services range

Slide 21

Slide 21 text

● Allocate range for nodes ● Allocate ranges for pods and services ● Carve off per-VM pod-ranges automatically as alias IPs ● SDN understands Alias IPs ● Per-node IPAM is in cloud Alias IPs & integrated networking Node A GKE Node B VPC RFC-1918 Node range Pod range Services range

Slide 22

Slide 22 text

● Allocate range for nodes ● Allocate ranges for pods and services ● Carve off per-VM pod-ranges automatically as alias IPs ● SDN understands Alias IPs ● Per-node IPAM is in cloud, on-node IPAM is on-node ● No VPC collisions, now or future Alias IPs & integrated networking Node A GKE Pod Pod Pod Node B VPC RFC-1918 Pod Pod Pod Node range Pod range Services range

Slide 23

Slide 23 text

Services & load-balancers LB support centered around clouds Implemented by the cloud provider controller

Slide 24

Slide 24 text

● LB Delivers Packet from original client IP to original VIP ● IPTables are programmed to capture the VIP just like a Cluster IP ● IPTables takes care of the rest ● GCP’s Network LB is VIP-Like ● LB only knows Nodes, k8s translates to Services and Pods VIP-Like LBs Node A Pod Pod Pod Node B Pod Pod Pod VIP Like LB src: client IP dst: VIP:port src: client IP dst: VIP:port iptables

Slide 25

Slide 25 text

● LB acts as proxy and delivers packet from proxy to Node or Pod ● AWS’s ELB is Proxy-Like ● Again, LBs only understand Nodes, not Pods or Services ● How to indicate which Service? Proxy-Like LBs Node A Pod Pod Pod Node B Pod Pod Pod Proxy Like LB src: client IP dst: VIP:port src: LB IP (pool) dst: node IP:??? ?????

Slide 26

Slide 26 text

● Allocate a static port across all nodes, one for each LB’ed Service ● Simple to understand model ● Portable: No external dependencies Introduction of NodePorts Node A Pod Pod Pod Node B Pod Pod Pod Proxy Like LB :31234 :31234 src: client IP dst: VIP:port src: LB IP (pool) dst: node IP:nodeport

Slide 27

Slide 27 text

What about portability? apiVersion: v1 kind: Service metadata: name: frontend spec: type: LoadBalancer ports: - port: 80 selector: app: guestbook tier: frontend

Slide 28

Slide 28 text

LoadBalancer NodePort ClusterIP

Slide 29

Slide 29 text

What about portability? apiVersion: v1 kind: Service metadata: name: frontend spec: type: LoadBalancer clusterIP: 10.15.251.118 ports: - port: 80 protocol: TCP targetPort: 80 nodePort: 30669 selector: app: guestbook tier: frontend status: loadBalancer: ingress: - ip: 35.193.47.73 apiVersion: v1 kind: Service metadata: name: frontend spec: type: LoadBalancer ports: - port: 80 selector: app: guestbook tier: frontend

Slide 30

Slide 30 text

Ingress: L7 LB All (or almost) L7 LBs are proxy like NodePorts are a decent starting point

Slide 31

Slide 31 text

Portable L7 LB Abstraction Ingress

Slide 32

Slide 32 text

● Two levels of load balancing ● Inaccurate cloud health checks ● Inaccurate Load Balancing ● Multiple Network hops ● Loss of LB features Advancing LBs From

Slide 33

Slide 33 text

● A feature of GCP’s HTTP LB ● LB returns a cookie to client ● Ensures repeated connections go to same backend Example: Cookie Affinity Node A Pod Pod Pod Node B Pod Pod Pod Client LB

Slide 34

Slide 34 text

● A feature of GCP’s HTTP LB ● LB returns a cookie to client ● Ensures repeated connections go to same backend Example: Cookie Affinity Node A Pod Pod Pod Node B Pod Pod Pod Client LB iptables ﬁrst connection

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

● A feature of GCP’s HTTP LB ● LB returns a cookie to client ● Ensures repeated connections go to same backend ● Second hop is not cookie-aware Example: Cookie Affinity Node A Pod Pod Pod Node B Pod Pod Pod Client LB iptables

Slide 38

Slide 38 text

Why can’t we load balance to Pod IPs? Node A Pod Pod Pod Node B Pod Pod Pod Client LB

Slide 39

Slide 39 text

● Now HTTP LB can target pod IPs, not just VMs ● Features like cookie aﬃnity “Just Work” ● Balances the load without downsides of a second hop Network Endpoint Groups in GCE LB

Slide 40

Slide 40 text

Containers as first Class GCP SDN endpoints Alias IPs made Pods as first class endpoint on VPC Network endpoint groups made load balancing for containers as efficient and feature rich as VMs

Slide 41

Slide 41 text

Problems when load-balancing to Pods Programming external LBs is slower than iptables Possible to cause an outage by rolling update going faster than LB

Slide 42

Slide 42 text

Rolling Update Pod Pod Pod ReplicaSet - name: my-app-v1 - replicas: 3 - selector: - app: MyApp - version: v1 LB ReplicaSet - name: my-app-v2 - replicas: 1 - selector: - app: MyApp - version: v2

Slide 43

Slide 43 text

Rolling Update Pod Pod Pod Pod ReplicaSet - name: my-app-v1 - replicas: 3 - selector: - app: MyApp - version: v1 LB ReplicaSet - name: my-app-v2 - replicas: 1 - selector: - app: MyApp - version: v2 Pod - live Pod - ready Infra - ? ● Pod Liveness : state of application in pod -a live or not ● Pod Readiness : ready to receive traﬃc

Slide 44

Slide 44 text

● LB not programmed but Pod reports ready ● Pod from previous replicaset removed. ● Capacity reduced !. Wait for Infrastructure? Pod Pod Pod ReplicaSet - name: my-app-v1 - replicas: 3 - selector: - app: MyApp - version: v1 LB Pod - live Pod - ready Infra - ? ReplicaSet - name: my-app-v2 - replicas: 1 - selector: - app: MyApp - version: v2

Slide 45

Slide 45 text

- New state in Pod life cycle to wait - Pod Ready ++ Pod Ready ++ Pod Pod Pod ReplicaSet - name: my-app-v1 - replicas: 3 - selector: - app: MyApp - version: v1 LB ReplicaSet - name: my-app-v2 - replicas: 1 - selector: - app: MyApp - version: v2 Pod - live Pod - ready Infra - wait

Slide 46

Slide 46 text

Slide 47

Slide 47 text

What about all the features? Every LB has features not expressed by Kubernetes Principle: Most implementations must be able to support most features

Slide 48

Slide 48 text

● CRD to the rescue ○ Linked from Service ○ Implementation specific ● BackendConfig ○ Allows us to expose features to GCP users without bothering anyone else Express GCP’s LB features Ingress Service X Service Y BackendConfig X BackendConfig Y GCLB

Slide 49

Slide 49 text

apiVersion: cloud.google.com/v1beta1 kind: BackendConfig metadata: name: config-http spec: cdn: enabled: true cachePolicy: includeHost: true includeProtocol: true iap: enabled: false timeoutSec: 5 sessionAffinity: affinityType: GENERATED_COOKIE affinityCookieTtlSec: 180 apiVersion: v1 kind: Service metadata: name: my-service annotations: beta.cloud.google.com/backend-config: '{"ports": {"http":"config-http"}}' spec: type: NodePort selector: app: my-app ports: - name: http port: 80 targetPort: 8080 BackendConfig

Slide 50

Slide 50 text

Mistakes in Abstractions? Too Flexible? Not Flexible Enough? Too Monolithic?

Slide 51

Slide 51 text

● Service is a very ﬂexible abstraction ○ Target ports ○ Named ports ● Makes it hard to implement in some fabrics ○ DSR is incompatible with port remapping ● Inspired by docker’s port-mapping model ● Hindsight: should probably have made it simpler Too flexible? VIP :80 -> pod :http Pod Y http = 8000 Pod X http = 8080 Pod Z http = 8001

Slide 52

Slide 52 text

● Service is not ﬂexible enough in other ways ○ Can’t forward ranges ○ Can’t forward a whole IP ● Makes it hard for some apps to use services ○ Dynamic ports ○ Large numbers of ports Not flexible enough? VIP :80 -> pod :8080 VIP:443 -> pod :8443 Pod Y :8080 :8443 Pod X :8080 :8443 Pod Z :8080 :8443

Slide 53

Slide 53 text

● Service API is monolithic and complex ○ `type` ﬁeld does not capture all variants ○ Headless vs VIP ○ Selector vs manual ● External LB support is built-in but primitive ○ Should have had readiness gates long ago ○ No meaningful status Too monolithic?