Kubernetes: Container Orchestration at Scale

Kubernetes: Container Orchestration at Scale

Talk at Container Days Boston 2015.

5fc0ed00a838fb5c773f824339380bff?s=128

Maxwell Forbes

June 05, 2015
Tweet

Transcript

  1. 1.

    Google confidential │ Do not distribute Google confidential │ Do

    not distribute Max Forbes <maxforbes@google.com> Container Days Boston 2015 Thanks to Brendan Burns and Tim Hockin for nearly all of the slides. Kubernetes Container Orchestration at Scale
  2. 2.

    Google confidential │ Do not distribute Everything at Google runs

    in containers: • Gmail, Web Search, Maps, ... • MapReduce, batch, ... • GFS, Colossus, ... • Even GCE itself: VMs in containers
  3. 3.

    Google confidential │ Do not distribute Everything at Google runs

    in containers: • Gmail, Web Search, Maps, ... • MapReduce, batch, ... • GFS, Colossus, ... • Even GCE itself: VMs in containers We launch over 2 billion containers per week.
  4. 5.

    Google confidential │ Do not distribute More than just “running”

    containers Scheduling: Where should my job be run?
  5. 6.

    Google confidential │ Do not distribute More than just “running”

    containers Scheduling: Where should my job be run? Lifecycle: Keep my job running
  6. 7.

    Google confidential │ Do not distribute More than just “running”

    containers Scheduling: Where should my job be run? Lifecycle: Keep my job running Discovery: Where is my job now?
  7. 8.

    Google confidential │ Do not distribute More than just “running”

    containers Scheduling: Where should my job be run? Lifecycle: Keep my job running Discovery: Where is my job now? Constituency: Who is part of my job?
  8. 9.

    Google confidential │ Do not distribute More than just “running”

    containers Scheduling: Where should my job be run? Lifecycle: Keep my job running Discovery: Where is my job now? Constituency: Who is part of my job? Scale-up: Making my jobs bigger or smaller
  9. 10.

    Google confidential │ Do not distribute More than just “running”

    containers Scheduling: Where should my job be run? Lifecycle: Keep my job running Discovery: Where is my job now? Constituency: Who is part of my job? Scale-up: Making my jobs bigger or smaller Auth{n,z}: Who can do things to my job?
  10. 11.

    Google confidential │ Do not distribute More than just “running”

    containers Scheduling: Where should my job be run? Lifecycle: Keep my job running Discovery: Where is my job now? Constituency: Who is part of my job? Scale-up: Making my jobs bigger or smaller Auth{n,z}: Who can do things to my job? Monitoring: What’s happening with my job?
  11. 12.

    Google confidential │ Do not distribute More than just “running”

    containers Scheduling: Where should my job be run? Lifecycle: Keep my job running Discovery: Where is my job now? Constituency: Who is part of my job? Scale-up: Making my jobs bigger or smaller Auth{n,z}: Who can do things to my job? Monitoring: What’s happening with my job? Health: How is my job feeling?
  12. 13.

    Google confidential │ Do not distribute More than just “running”

    containers Scheduling: Where should my job be run? Lifecycle: Keep my job running Discovery: Where is my job now? Constituency: Who is part of my job? Scale-up: Making my jobs bigger or smaller Auth{n,z}: Who can do things to my job? Monitoring: What’s happening with my job? Health: How is my job feeling? ...
  13. 14.
  14. 15.

    Google confidential │ Do not distribute Kubernetes Greek for “Helmsman”;

    also the root of the word “Governor” • Container orchestration • Runs Docker containers • Supports multiple cloud and bare-metal environments • Inspired and informed by Google’s experiences and internal systems • Open source, written in Go Manage applications, not machines
  15. 17.

    Google confidential │ Do not distribute users master nodes A

    50000 foot view CLI API UI apiserver kubelet kubelet kubelet scheduler
  16. 18.

    Google confidential │ Do not distribute A 50000 foot view

    apiserver kubelet kubelet kubelet scheduler Run X Replicas = 2 Memory = 4Gi CPU = 2.5
  17. 19.

    Google confidential │ Do not distribute A 50000 foot view

    apiserver kubelet kubelet kubelet scheduler SUCCESS UID=8675309
  18. 20.

    Google confidential │ Do not distribute A 50000 foot view

    apiserver kubelet kubelet kubelet scheduler Which nodes for X ?
  19. 21.

    Google confidential │ Do not distribute A 50000 foot view

    apiserver kubelet kubelet kubelet scheduler Run X Run X
  20. 22.

    Google confidential │ Do not distribute A 50000 foot view

    apiserver kubelet kubelet kubelet scheduler Registry pull X pull X
  21. 23.

    Google confidential │ Do not distribute A 50000 foot view

    apiserver kubelet kubelet kubelet scheduler Status X Status X X X
  22. 24.

    Google confidential │ Do not distribute A 50000 foot view

    apiserver kubelet kubelet kubelet scheduler X X GET X
  23. 25.

    Google confidential │ Do not distribute A 50000 foot view

    apiserver kubelet kubelet kubelet scheduler X X Status X
  24. 26.

    Google confidential │ Do not distribute All you really care

    about Run X Master Container Cluster X X Status X
  25. 28.

    Google confidential │ Do not distribute Design principles Declarative >

    imperative: State your desired results, let the system actuate
  26. 29.

    Google confidential │ Do not distribute Design principles Declarative >

    imperative: State your desired results, let the system actuate Control loops: Observe, rectify, repeat
  27. 30.

    Google confidential │ Do not distribute Design principles Declarative >

    imperative: State your desired results, let the system actuate Control loops: Observe, rectify, repeat Simple > Complex: Try to do as little as possible
  28. 31.

    Google confidential │ Do not distribute Design principles Declarative >

    imperative: State your desired results, let the system actuate Control loops: Observe, rectify, repeat Simple > Complex: Try to do as little as possible Modularity: Components, interfaces, & plugins
  29. 32.

    Google confidential │ Do not distribute Design principles Declarative >

    imperative: State your desired results, let the system actuate Control loops: Observe, rectify, repeat Simple > Complex: Try to do as little as possible Modularity: Components, interfaces, & plugins Legacy compatible: Requiring apps to change is a non-starter
  30. 33.

    Google confidential │ Do not distribute Design principles Declarative >

    imperative: State your desired results, let the system actuate Control loops: Observe, rectify, repeat Simple > Complex: Try to do as little as possible Modularity: Components, interfaces, & plugins Legacy compatible: Requiring apps to change is a non-starter No grouping: Labels are the only groups
  31. 34.

    Google confidential │ Do not distribute Design principles Declarative >

    imperative: State your desired results, let the system actuate Control loops: Observe, rectify, repeat Simple > Complex: Try to do as little as possible Modularity: Components, interfaces, & plugins Legacy compatible: Requiring apps to change is a non-starter No grouping: Labels are the only groups Cattle > Pets: Manage your workload in bulk
  32. 36.

    Google confidential │ Do not distribute Design principles Declarative >

    imperative: State your desired results, let the system actuate Control loops: Observe, rectify, repeat Simple > Complex: Try to do as little as possible Modularity: Components, interfaces, & plugins Legacy compatible: Requiring apps to change is a non-starter No grouping: Labels are the only groups Cattle > Pets: Manage your workload in bulk
  33. 37.

    Google confidential │ Do not distribute Design principles Declarative >

    imperative: State your desired results, let the system actuate Control loops: Observe, rectify, repeat Simple > Complex: Try to do as little as possible Modularity: Components, interfaces, & plugins Legacy compatible: Requiring apps to change is a non-starter No grouping: Labels are the only groups Cattle > Pets: Manage your workload in bulk Open > Closed: Open Source, standards, REST, JSON, etc.
  34. 40.

    Google confidential │ Do not distribute Primary concepts 0. Container:

    A sealed application package (Docker) 1. Pod: A small group of tightly coupled Containers example: content syncer & web server
  35. 41.

    Google confidential │ Do not distribute Primary concepts 0. Container:

    A sealed application package (Docker) 1. Pod: A small group of tightly coupled Containers example: content syncer & web server 2. Controller: A loop that drives current state towards desired state example: replication controller
  36. 42.

    Google confidential │ Do not distribute Primary concepts 0. Container:

    A sealed application package (Docker) 1. Pod: A small group of tightly coupled Containers example: content syncer & web server 2. Controller: A loop that drives current state towards desired state example: replication controller
  37. 43.

    Google confidential │ Do not distribute Primary concepts 0. Container:

    A sealed application package (Docker) 1. Pod: A small group of tightly coupled Containers example: content syncer & web server 2. Controller: A loop that drives current state towards desired state example: replication controller 3. Service: A set of running pods that work together example: load-balanced backends
  38. 44.

    Google confidential │ Do not distribute Primary concepts 0. Container:

    A sealed application package (Docker) 1. Pod: A small group of tightly coupled Containers example: content syncer & web server 2. Controller: A loop that drives current state towards desired state example: replication controller 3. Service: A set of running pods that work together example: load-balanced backends 4. Labels: Identifying metadata attached to other objects example: phase=canary vs. phase=prod 5. Selector: A query against labels, producing a set result example: all pods where label phase == prod
  39. 47.
  40. 48.

    Google confidential │ Do not distribute Pods Small group of

    containers & volumes Tightly coupled The atom of cluster scheduling & placement
  41. 49.

    Google confidential │ Do not distribute Pods Small group of

    containers & volumes Tightly coupled The atom of cluster scheduling & placement Shared namespace • share IP address & localhost
  42. 50.

    Google confidential │ Do not distribute Pods Small group of

    containers & volumes Tightly coupled The atom of cluster scheduling & placement Shared namespace • share IP address & localhost Ephemeral • can die and be replaced
  43. 52.

    Google confidential │ Do not distribute Pods Small group of

    containers & volumes Tightly coupled The atom of cluster scheduling & placement Shared namespace • share IP address & localhost Ephemeral • can die and be replaced
  44. 53.

    Google confidential │ Do not distribute Pods Small group of

    containers & volumes Tightly coupled The atom of cluster scheduling & placement Shared namespace • share IP address & localhost Ephemeral • can die and be replaced Example: data puller & web server Pod File Puller Web Server Volume Consumers Content Manager
  45. 55.

    Google confidential │ Do not distribute Why pods? Pod Web

    Server Volume Consumers Content Manager File Puller
  46. 56.

    Google confidential │ Do not distribute Why pods? Pod File

    Puller Web Server Volume Consumers Content Manager • infeasible for provider to build and maintain all variants of this “as a service”
  47. 63.

    Google confidential │ Do not distribute Why not put everything

    in one container? - transparency - decouple software dependencies - ease of use - efficiency
  48. 64.

    Google confidential │ Do not distribute Why not something besides

    pods? like co-scheduling? - simpler to have scheduling atom - other benefits of pods - resource sharing - IPC - shared fate - simplified management
  49. 66.

    Google confidential │ Do not distribute Pod lifecycle Once scheduled

    to a node, pods do not move • restart policy means restart in-place
  50. 67.

    Google confidential │ Do not distribute Pod lifecycle Once scheduled

    to a node, pods do not move • restart policy means restart in-place Pods can be observed pending, running, succeeded, or failed • failed is really the end - no more restarts • no complex state machine logic
  51. 68.

    Google confidential │ Do not distribute Pod lifecycle Once scheduled

    to a node, pods do not move • restart policy means restart in-place Pods can be observed pending, running, succeeded, or failed • failed is really the end - no more restarts • no complex state machine logic Pods are not rescheduled by the scheduler or apiserver • even if a node dies • controllers are responsible for this • keeps the scheduler simple Apps should consider these rules • Services hide this • Makes pod-to-pod communication more formal
  52. 70.
  53. 73.

    Google confidential │ Do not distribute Labels - "release" :

    "stable", "canary", … - "environment" : "dev", "qa", "production" ...
  54. 74.

    Google confidential │ Do not distribute Labels - "release" :

    "stable", "canary", … - "environment" : "dev", "qa", "production" ... - "tier" : "frontend", "backend", "middleware", …
  55. 75.

    Google confidential │ Do not distribute Labels - "release" :

    "stable", "canary", … - "environment" : "dev", "qa", "production" ... - "tier" : "frontend", "backend", "middleware", … - "partition" : "customerA", "customerB", …
  56. 76.

    Google confidential │ Do not distribute Labels - "release" :

    "stable", "canary", … - "environment" : "dev", "qa", "production" ... - "tier" : "frontend", "backend", "middleware", … - "partition" : "customerA", "customerB", … - "track" : "daily", "weekly", ...
  57. 77.
  58. 78.

    Google confidential │ Do not distribute Labels Arbitrary metadata Attached

    to any API object Generally represent identity Queryable by selectors • think SQL ‘select ... where ...’ The only grouping mechanism • pods under a ReplicationController • pods in a Service • capabilities of a node (constraints) Example: “phase: canary”
  59. 79.

    Google confidential │ Do not distribute Selectors App: Nifty Phase:

    Dev Role: FE App: Nifty Phase: Test Role: FE App: Nifty Phase: Dev Role: BE App: Nifty Phase: Test Role: BE
  60. 80.

    Google confidential │ Do not distribute App == Nifty App:

    Nifty Phase: Dev Role: FE App: Nifty Phase: Test Role: FE App: Nifty Phase: Dev Role: BE App: Nifty Phase: Test Role: BE Selectors
  61. 81.

    Google confidential │ Do not distribute App == Nifty Role

    == FE App: Nifty Phase: Dev Role: FE App: Nifty Phase: Test Role: FE App: Nifty Phase: Dev Role: BE App: Nifty Phase: Test Role: BE Selectors
  62. 82.

    Google confidential │ Do not distribute App == Nifty Role

    == BE App: Nifty Phase: Dev Role: FE App: Nifty Phase: Test Role: FE App: Nifty Phase: Dev Role: BE App: Nifty Phase: Test Role: BE Selectors
  63. 83.

    Google confidential │ Do not distribute App == Nifty Phase

    == Dev App: Nifty Phase: Dev Role: FE App: Nifty Phase: Test Role: FE App: Nifty Phase: Dev Role: BE App: Nifty Phase: Test Role: BE Selectors
  64. 84.

    Google confidential │ Do not distribute App == Nifty Phase

    == Test App: Nifty Phase: Dev Role: FE App: Nifty Phase: Test Role: FE App: Nifty Phase: Dev Role: BE App: Nifty Phase: Test Role: BE Selectors
  65. 85.

    Google confidential │ Do not distribute Replication Controllers Canonical example

    of control loops Runs out-of-process wrt API server Have 1 job: ensure N copies of a pod • if too few, start new ones • if too many, kill some • group == selector Cleanly layered on top of the core • all access is by public APIs Replicated pods are fungible • No implied ordinality or identity Replication Controller - Name = “nifty-rc” - Selector = {“App”: “Nifty”} - PodTemplate = { ... } - NumReplicas = 4 API Server How many? 3 Start 1 more OK How many? 4
  66. 86.

    Google confidential │ Do not distribute Replication Controllers node 1

    f0118 node 3 node 4 node 2 d9376 b0111 a1209 Replication Controller - Desired = 4 - Current = 4
  67. 87.

    Google confidential │ Do not distribute Replication Controllers node 1

    f0118 node 3 node 4 node 2 Replication Controller - Desired = 4 - Current = 4 d9376 b0111 a1209
  68. 88.

    Google confidential │ Do not distribute Replication Controllers node 1

    f0118 node 3 node 4 Replication Controller - Desired = 4 - Current = 3 b0111 a1209
  69. 89.

    Google confidential │ Do not distribute Replication Controllers node 1

    f0118 node 3 node 4 Replication Controller - Desired = 4 - Current = 4 b0111 a1209 c9bad
  70. 90.

    Google confidential │ Do not distribute Replication Controllers node 1

    f0118 node 3 node 4 node 2 Replication Controller - Desired = 4 - Current = 5 d9376 b0111 a1209 c9bad
  71. 91.

    Google confidential │ Do not distribute Replication Controllers node 1

    f0118 node 3 node 4 node 2 Replication Controller - Desired = 4 - Current = 4 d9376 b0111 a1209 c9bad
  72. 92.

    Google confidential │ Do not distribute Pod networking Pod IPs

    are routable • Docker default is private IP Pods can reach each other without NAT • even across nodes No brokering of port numbers This is a fundamental requirement • several SDN solutions
  73. 93.

    Google confidential │ Do not distribute Services A group of

    pods that act as one == Service • group == selector Defines access policy • only “load balanced” for now Gets a stable virtual IP and port • called the service portal • also a DNS name VIP is captured by kube-proxy • watches the service constituency • updates when backends change Hide complexity - ideal for non-native apps Portal (VIP) Client
  74. 94.

    Google confidential │ Do not distribute Services 10.0.0.1 : 9376

    Client kube-proxy Service - Name = “nifty-svc” - Selector = {“App”: “Nifty”} - Port = 9376 - ContainerPort = 8080 Portal IP is assigned iptables DNAT TCP / UDP apiserver watch 10.240.2.2 : 8080 10.240.1.1 : 8080 10.240.3.3 : 8080 TCP / UDP
  75. 95.

    Google confidential │ Do not distribute Services A group of

    pods that act as one == Service • group == selector Defines access policy • only “load balanced” for now Gets a stable virtual IP and port • called the service portal • also a DNS name VIP is captured by kube-proxy • watches the service constituency • updates when backends change Hide complexity - ideal for non-native apps Portal (VIP) Client
  76. 97.

    Google confidential │ Do not distribute Services kube-proxy Pod -

    Name = “pod1” - Labels = {“App”: “Nifty”} - Port = 9376 apiserver POST pods WATCH Services, Endpoints
  77. 98.

    Google confidential │ Do not distribute Services kube-proxy apiserver pod1

    10.240.1.1 : 9376 pod2 10.240.2.2 : 9376 pod3 10.240.3.3 : 9376 run pods Pod - Name = “pod1” - Labels = {“App”: “Nifty”} - Port = 9376 WATCH Services, Endpoints
  78. 99.

    Google confidential │ Do not distribute POST service pod1 10.240.1.1

    : 9376 pod2 10.240.2.2 : 9376 pod3 10.240.3.3 : 9376 Services kube-proxy Service - Name = “nifty-svc” - Selector = {“App”: “Nifty”} - Port = 80 - TargetPort = 9376 - PortalIP - 10.9.8.7 apiserver WATCH Services, Endpoints
  79. 100.

    Google confidential │ Do not distribute pod1 10.240.1.1 : 9376

    pod2 10.240.2.2 : 9376 pod3 10.240.3.3 : 9376 Services kube-proxy apiserver Service - Name = “nifty-svc” - Selector = {“App”: “Nifty”} - Port = 80 - TargetPort = 9376 - PortalIP - 10.9.8.7 WATCH Services, Endpoints new service!
  80. 101.

    Google confidential │ Do not distribute pod1 10.240.1.1 : 9376

    pod2 10.240.2.2 : 9376 pod3 10.240.3.3 : 9376 Services kube-proxy apiserver Linux listen on port X (random) Service - Name = “nifty-svc” - Selector = {“App”: “Nifty”} - Port = 80 - TargetPort = 9376 - PortalIP - 10.9.8.7 WATCH Services, Endpoints
  81. 102.

    Google confidential │ Do not distribute pod1 10.240.1.1 : 9376

    pod2 10.240.2.2 : 9376 pod3 10.240.3.3 : 9376 Services kube-proxy apiserver Linux listen on port X iptables redirect 10.9.8.7:80 to localhost:X Service - Name = “nifty-svc” - Selector = {“App”: “Nifty”} - Port = 80 - TargetPort = 9376 - PortalIP - 10.9.8.7 WATCH Services, Endpoints
  82. 103.

    Google confidential │ Do not distribute pod1 10.240.1.1 : 9376

    pod2 10.240.2.2 : 9376 pod3 10.240.3.3 : 9376 Services kube-proxy apiserver Linux listen on port X iptables redirect 10.9.8.7:80 to localhost:X Service - Name = “nifty-svc” - Selector = {“App”: “Nifty”} - Port = 80 - TargetPort = 9376 - PortalIP - 10.9.8.7 WATCH Services, Endpoints new endpoints!
  83. 104.

    Google confidential │ Do not distribute pod1 10.240.1.1 : 9376

    pod2 10.240.2.2 : 9376 pod3 10.240.3.3 : 9376 Services kube-proxy apiserver Linux listen on port X iptables redirect 10.9.8.7:80 to localhost:X Service - Name = “nifty-svc” - Selector = {“App”: “Nifty”} - Port = 80 - TargetPort = 9376 - PortalIP - 10.9.8.7
  84. 105.

    Google confidential │ Do not distribute pod1 10.240.1.1 : 9376

    pod2 10.240.2.2 : 9376 pod3 10.240.3.3 : 9376 Services kube-proxy apiserver Linux listen on port X iptables Client redirect 10.9.8.7:80 to localhost:X Service - Name = “nifty-svc” - Selector = {“App”: “Nifty”} - Port = 80 - TargetPort = 9376 - PortalIP - 10.9.8.7 connect to 10.9.8.7:80
  85. 106.

    Google confidential │ Do not distribute pod1 10.240.1.1 : 9376

    pod2 10.240.2.2 : 9376 pod3 10.240.3.3 : 9376 Services kube-proxy apiserver Linux listen on port X iptables Client redirect 10.9.8.7:80 to localhost:X Service - Name = “nifty-svc” - Selector = {“App”: “Nifty”} - Port = 80 - TargetPort = 9376 - PortalIP - 10.9.8.7 connect to 10.9.8.7:80
  86. 107.

    Google confidential │ Do not distribute pod1 10.240.1.1 : 9376

    pod2 10.240.2.2 : 9376 pod3 10.240.3.3 : 9376 Services kube-proxy apiserver Linux iptables Client Service - Name = “nifty-svc” - Selector = {“App”: “Nifty”} - Port = 80 - TargetPort = 9376 - PortalIP - 10.9.8.7 connect to localhost:X
  87. 108.

    Google confidential │ Do not distribute pod1 10.240.1.1 : 9376

    pod2 10.240.2.2 : 9376 pod3 10.240.3.3 : 9376 Services kube-proxy apiserver Linux listen on port X iptables Client Service - Name = “nifty-svc” - Selector = {“App”: “Nifty”} - Port = 80 - TargetPort = 9376 - PortalIP - 10.9.8.7 proxy for client
  88. 109.

    Google confidential │ Do not distribute Events A central place

    for information about your cluster • filed by any component: kubelet, scheduler, etc Real-time information on the current state of your pod • kubectl describe pod foo Real-time information on the current state of your cluster • kubectl get --watch-only events • You can also ask only for events that mention some object you care about.
  89. 110.

    Google confidential │ Do not distribute Monitoring Optional add-on to

    Kubernetes clusters Run cAdvisor as a pod on each node • gather stats from all containers • export via REST Run Heapster as a pod in the cluster • just another pod, no special access • aggregate stats Run Influx and Grafana in the cluster • more pods • alternately: store in Google Cloud Monitoring
  90. 111.

    Google confidential │ Do not distribute Logging Optional add-on to

    Kubernetes clusters Run fluentd as a pod on each node • gather logs from all containers • export to elasticsearch Run Elasticsearch as a pod in the cluster • just another pod, no special access • aggregate logs Run Kibana in the cluster • yet another pod • alternately: store in Google Cloud Logging
  91. 112.
  92. 113.

    Google confidential │ Do not distribute Kubernetes is Open Source

    We want your help! http://kubernetes.io https://github.com/GoogleCloudPlatform/kubernetes irc.freenode.net #google-containers @kubernetesio