Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Co-Evolution of Kubernetes and GCP Networking, KubeCon EU 2019

The Co-Evolution of Kubernetes and GCP Networking, KubeCon EU 2019

569f10721398d92f5033097ac6d9132c?s=128

Tim Hockin

May 22, 2019
Tweet

Transcript

  1. Purvi Desai Tim Hockin Co-Evolution of Kubernetes and GCP Networking

  2. Why did Kubernetes take off? • Focused on app owners

    and app problems • “Opinionated enough” • Assumes platform implementations will vary • Designed to work with popular OSS • Follows understood conventions (mostly)
  3. Networking is at the heart of Kubernetes • Almost every

    k8s-deployed app needs it • Networking can be complex • Details vary a lot between environments • App developers shouldn’t have to be networking experts
  4. In the beginning Lineage of Borg Survey of Container Networking

  5. Borg model 10.240.0.1 10.240.0.2 Task A 3306 Task B Task

    C 8000
  6. Borg model 10.240.0.1 Task A 3306 Task B 3306

  7. 10.240.0.2 netns: 172.16.1.1 Container C 8000 10.240.0.1 netns: 172.16.1.1 Container

    A 80 netns: 172.16.1.12 B: 172.16.1.2 Original docker model 9376 11878 DNAT SNAT SNAT DNAT
  8. Kubernetes network model Users should never have to worry about

    collisions that they themselves didn’t cause App developers shouldn’t have to be networking experts
  9. A real IP for every Pod • Pod IPs are

    accessible from other pods, regardless of which VM they are on • No brokering of port numbers • Focus on the UX we want
  10. 10.240.0.2 netns: 172.16.2.1 Container C 8000 10.240.0.1 netns: 172.16.1.1 Container

    A 80 netns: 172.16.1.12 B: 172.16.1.2 Kubernetes model
  11. Proof of concept Early Experiments on GCP

  12. Cloud networking • VM Centric • Containers are not really

    a part of design space • What were the possibilities? VPC VM VM
  13. Found a toehold • The “Routes” API • Every VM

    claims to be a router • Disable IP spoofing protection VPC Node A GKE cbr0 Pod IP Space = 10.1.1.0/24 Pod Pod Pod IP Spoofing Off cbr0 IP Spoofing Off Pod Pod Pod Node B Pod IP Space = 10.1.2.0/24 Route 10.1.1.0/24 to Node A Route 10.1.2.0/24 to Node B
  14. The beginning of co-evolution • Foundations were set • UX

    was good - IP-per-Pod worked! • We were able to push limits to 100 routes • Does anyone remember how many nodes Kubernetes 1.0 supported?
  15. Co-evolution Journey Cluster Networking Services and L4 Load Balancers L7

    load balancer
  16. 50 250 500 1000 2000 5000 In 2 years Cluster

    Networking Routes model • Drove major architectural changes to scale GCP’s Routes subsystem • Rapid scaling over 2 years 100
  17. • IP spoofing disabled • Semi-hidden allocations - potential for

    collisions with future uses of IPs • Overlapping routes caused real confusion, hard to debug What’s the catch? x.y.z/24 Node A x.y.z.0/24 VPC GKE IP Spoofing x.y.z.0/24 Node A
  18. We can do better Better integration with other products Easy

    to reason about & debug Need a deeper concept: Alias IPs
  19. • Allocate range for nodes Alias IPs & integrated networking

    Node A GKE Node B VPC RFC-1918 Node range
  20. • Allocate range for nodes • Allocate ranges for pods

    and services Alias IPs & integrated networking Node A GKE Node B VPC RFC-1918 Node range Pod range Services range
  21. • Allocate range for nodes • Allocate ranges for pods

    and services • Carve off per-VM pod-ranges automatically as alias IPs • SDN understands Alias IPs • Per-node IPAM is in cloud Alias IPs & integrated networking Node A GKE Node B VPC RFC-1918 Node range Pod range Services range
  22. • Allocate range for nodes • Allocate ranges for pods

    and services • Carve off per-VM pod-ranges automatically as alias IPs • SDN understands Alias IPs • Per-node IPAM is in cloud, on-node IPAM is on-node • No VPC collisions, now or future Alias IPs & integrated networking Node A GKE Pod Pod Pod Node B VPC RFC-1918 Pod Pod Pod Node range Pod range Services range
  23. Services & load-balancers LB support centered around clouds Implemented by

    the cloud provider controller
  24. • LB Delivers Packet from original client IP to original

    VIP • IPTables are programmed to capture the VIP just like a Cluster IP • IPTables takes care of the rest • GCP’s Network LB is VIP-Like • LB only knows Nodes, k8s translates to Services and Pods VIP-Like LBs Node A Pod Pod Pod Node B Pod Pod Pod VIP Like LB src: client IP dst: VIP:port src: client IP dst: VIP:port iptables
  25. • LB acts as proxy and delivers packet from proxy

    to Node or Pod • AWS’s ELB is Proxy-Like • Again, LBs only understand Nodes, not Pods or Services • How to indicate which Service? Proxy-Like LBs Node A Pod Pod Pod Node B Pod Pod Pod Proxy Like LB src: client IP dst: VIP:port src: LB IP (pool) dst: node IP:??? ?????
  26. • Allocate a static port across all nodes, one for

    each LB’ed Service • Simple to understand model • Portable: No external dependencies Introduction of NodePorts Node A Pod Pod Pod Node B Pod Pod Pod Proxy Like LB :31234 :31234 src: client IP dst: VIP:port src: LB IP (pool) dst: node IP:nodeport
  27. What about portability? apiVersion: v1 kind: Service metadata: name: frontend

    spec: type: LoadBalancer ports: - port: 80 selector: app: guestbook tier: frontend
  28. LoadBalancer NodePort ClusterIP

  29. What about portability? apiVersion: v1 kind: Service metadata: name: frontend

    spec: type: LoadBalancer clusterIP: 10.15.251.118 ports: - port: 80 protocol: TCP targetPort: 80 nodePort: 30669 selector: app: guestbook tier: frontend status: loadBalancer: ingress: - ip: 35.193.47.73 apiVersion: v1 kind: Service metadata: name: frontend spec: type: LoadBalancer ports: - port: 80 selector: app: guestbook tier: frontend
  30. Ingress: L7 LB All (or almost) L7 LBs are proxy

    like NodePorts are a decent starting point
  31. Portable L7 LB Abstraction Ingress

  32. • Two levels of load balancing • Inaccurate cloud health

    checks • Inaccurate Load Balancing • Multiple Network hops • Loss of LB features Advancing LBs From
  33. • A feature of GCP’s HTTP LB • LB returns

    a cookie to client • Ensures repeated connections go to same backend Example: Cookie Affinity Node A Pod Pod Pod Node B Pod Pod Pod Client LB
  34. • A feature of GCP’s HTTP LB • LB returns

    a cookie to client • Ensures repeated connections go to same backend Example: Cookie Affinity Node A Pod Pod Pod Node B Pod Pod Pod Client LB iptables first connection
  35. • A feature of GCP’s HTTP LB • LB returns

    a cookie to client • Ensures repeated connections go to same backend Example: Cookie Affinity Node A Pod Pod Pod Node B Pod Pod Pod Client LB iptables response with cookie for Node A
  36. • A feature of GCP’s HTTP LB • LB returns

    a cookie to client • Ensures repeated connections go to same backend Example: Cookie Affinity Node A Pod Pod Pod Node B Pod Pod Pod Client LB iptables second connection goes to Node A, because of cookie
  37. • A feature of GCP’s HTTP LB • LB returns

    a cookie to client • Ensures repeated connections go to same backend • Second hop is not cookie-aware Example: Cookie Affinity Node A Pod Pod Pod Node B Pod Pod Pod Client LB iptables
  38. Why can’t we load balance to Pod IPs? Node A

    Pod Pod Pod Node B Pod Pod Pod Client LB
  39. • Now HTTP LB can target pod IPs, not just

    VMs • Features like cookie affinity “Just Work” • Balances the load without downsides of a second hop Network Endpoint Groups in GCE LB
  40. Containers as first Class GCP SDN endpoints Alias IPs made

    Pods as first class endpoint on VPC Network endpoint groups made load balancing for containers as efficient and feature rich as VMs
  41. Problems when load-balancing to Pods Programming external LBs is slower

    than iptables Possible to cause an outage by rolling update going faster than LB
  42. Rolling Update Pod Pod Pod ReplicaSet - name: my-app-v1 -

    replicas: 3 - selector: - app: MyApp - version: v1 LB ReplicaSet - name: my-app-v2 - replicas: 1 - selector: - app: MyApp - version: v2
  43. Rolling Update Pod Pod Pod Pod ReplicaSet - name: my-app-v1

    - replicas: 3 - selector: - app: MyApp - version: v1 LB ReplicaSet - name: my-app-v2 - replicas: 1 - selector: - app: MyApp - version: v2 Pod - live Pod - ready Infra - ? • Pod Liveness : state of application in pod -a live or not • Pod Readiness : ready to receive traffic
  44. • LB not programmed but Pod reports ready • Pod

    from previous replicaset removed. • Capacity reduced !. Wait for Infrastructure? Pod Pod Pod ReplicaSet - name: my-app-v1 - replicas: 3 - selector: - app: MyApp - version: v1 LB Pod - live Pod - ready Infra - ? ReplicaSet - name: my-app-v2 - replicas: 1 - selector: - app: MyApp - version: v2
  45. - New state in Pod life cycle to wait -

    Pod Ready ++ Pod Ready ++ Pod Pod Pod ReplicaSet - name: my-app-v1 - replicas: 3 - selector: - app: MyApp - version: v1 LB ReplicaSet - name: my-app-v2 - replicas: 1 - selector: - app: MyApp - version: v2 Pod - live Pod - ready Infra - wait
  46. - New state in Pod life cycle to wait -

    Pod Ready ++ Pod Ready ++ Pod Pod Pod ReplicaSet - name: my-app-v1 - replicas: 3 - selector: - app: MyApp - version: v1 LB ReplicaSet - name: my-app-v2 - replicas: 1 - selector: - app: MyApp - version: v2 Pod -live Pod - ready Infra - ready
  47. What about all the features? Every LB has features not

    expressed by Kubernetes Principle: Most implementations must be able to support most features
  48. • CRD to the rescue ◦ Linked from Service ◦

    Implementation specific • BackendConfig ◦ Allows us to expose features to GCP users without bothering anyone else Express GCP’s LB features Ingress Service X Service Y BackendConfig X BackendConfig Y GCLB
  49. apiVersion: cloud.google.com/v1beta1 kind: BackendConfig metadata: name: config-http spec: cdn: enabled:

    true cachePolicy: includeHost: true includeProtocol: true iap: enabled: false timeoutSec: 5 sessionAffinity: affinityType: GENERATED_COOKIE affinityCookieTtlSec: 180 apiVersion: v1 kind: Service metadata: name: my-service annotations: beta.cloud.google.com/backend-config: '{"ports": {"http":"config-http"}}' spec: type: NodePort selector: app: my-app ports: - name: http port: 80 targetPort: 8080 BackendConfig
  50. Mistakes in Abstractions? Too Flexible? Not Flexible Enough? Too Monolithic?

  51. • Service is a very flexible abstraction ◦ Target ports

    ◦ Named ports • Makes it hard to implement in some fabrics ◦ DSR is incompatible with port remapping • Inspired by docker’s port-mapping model • Hindsight: should probably have made it simpler Too flexible? VIP :80 -> pod :http Pod Y http = 8000 Pod X http = 8080 Pod Z http = 8001
  52. • Service is not flexible enough in other ways ◦

    Can’t forward ranges ◦ Can’t forward a whole IP • Makes it hard for some apps to use services ◦ Dynamic ports ◦ Large numbers of ports Not flexible enough? VIP :80 -> pod :8080 VIP:443 -> pod :8443 Pod Y :8080 :8443 Pod X :8080 :8443 Pod Z :8080 :8443
  53. • Service API is monolithic and complex ◦ `type` field

    does not capture all variants ◦ Headless vs VIP ◦ Selector vs manual • External LB support is built-in but primitive ◦ Should have had readiness gates long ago ◦ No meaningful status Too monolithic?
  54. Looking ahead

  55. Want more? Come to the SIG-Network Intro & Deep-Dive on

    Thursday!
  56. Thank You! Purvi Desai @purvid Tim Hockin @thockin