Container Management at Google Scale

Slide 1

Slide 1 text

Google confidential │ Do not distribute Google confidential │ Do not distribute SCALE 13x Container Management at Google Scale Tim Hockin Senior Staff Software Engineer @thockin

Slide 2

Slide 2 text

Google confidential │ Do not distribute Google confidential │ Do not distribute SCALE 13x Container Management at Google Scale Container Tim Hockin Senior Staff Software Engineer @thockin

Slide 3

Slide 3 text

Google confidential │ Do not distribute Old Way: Shared machines kernel libs app app app No isolation No namespacing Common libs Highly coupled apps and OS app

Slide 4

Slide 4 text

Google confidential │ Do not distribute Old Way: Virtual machines Some isolation Expensive and inefficient Still highly coupled to the guest OS Hard to manage app libs kernel libs app app kernel app libs libs kernel kernel

Slide 5

Slide 5 text

Google confidential │ Do not distribute New Way: Containers libs app kernel libs app libs app libs app

Slide 6

Slide 6 text

Google confidential │ Do not distribute But what ARE they? Lightweight VMs • no guest OS, lower overhead than VMs, but no virtualization hardware Better packages • no DLL hell Hermetically sealed static binaries • no external dependencies Provide Isolation (from each other and from the host) • Resources (CPU, RAM, Disk, etc.) • Users • Filesystem • Network

Slide 7

Slide 7 text

Google confidential │ Do not distribute How? Implemented by a number of (unrelated) Linux APIs: • cgroups: Restrict resources a process can consume • CPU, memory, disk IO, ... • namespaces: Change a process’s view of the system • Network interfaces, PIDs, users, mounts, ... • capabilities: Limits what a user can do • mount, kill, chown, ... • chroots: Determines what parts of the filesystem a user can see

Slide 8

Slide 8 text

Google confidential │ Do not distribute Google has been developing and using containers to manage our applications for over 10 years. Images by Connie Zhou

Slide 9

Slide 9 text

Google confidential │ Do not distribute Everything at Google runs in containers: • Gmail, Web Search, Maps, ... • MapReduce, batch, ... • GFS, Colossus, ... • Even GCE itself: VMs in containers

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Google confidential │ Do not distribute Why containers? • Performance • Repeatability • Isolation • Quality of service • Accounting • Visibility • Portability A fundamentally different way of managing applications Images by Connie Zhou

Slide 12

Slide 12 text

Google confidential │ Do not distribute Docker Source: Google Trends

Slide 13

Slide 13 text

Google confidential │ Do not distribute But what IS Docker? An implementation of the container idea A package format An ecosystem A company An open-source juggernaut A phenomenon Hoorah! The world is starting to adopt containers!

Slide 14

Slide 14 text

Google confidential │ Do not distribute LMCTFY Also an implementation of the container idea (from Google) Also open-source Literally the same code that Google uses internally “Let Me Contain That For You”

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Google confidential │ Do not distribute Docker vs. LMCTFY Docker is primarily about namespacing: control what you can see • resource and performance isolation were afterthoughts LMCTFY is primarily about performance isolation: jobs can not hurt each other • namespacing was an afterthought Docker focused on making things simple and self-contained • “sealed” images, a repository of pre-built images, simple tooling LMCTFY focused on solving the isolation problem very thoroughly • totally ignored images and tooling

Slide 17

Slide 17 text

Google confidential │ Do not distribute About isolation Principles: • Apps must not be able to affect each other’s perf • if so it is an isolation failure • Repeated runs of the same app should see ~equal perf • Graduated QoS drives resource decisions in real-time • Correct in all cases, optimal in some • reduce unreliable components • SLOs are the lingua franca App 1 App 2

Slide 18

Slide 18 text

Google confidential │ Do not distribute Strong isolation 0 2048 4096 6144 8192 Memory (MB) CPU (cores) 4 3 2 1 0

Slide 19

Slide 19 text

Google confidential │ Do not distribute Strong isolation 0 2048 4096 6144 8192 Memory (MB) CPU (cores) 4 3 2 1 0 RAM=2GB CPU=1.0

Slide 20

Slide 20 text

Google confidential │ Do not distribute Strong isolation 0 2048 4096 6144 8192 Memory (MB) CPU (cores) 4 3 2 1 0 RAM=2GB CPU=1.0 RAM=4GB CPU=2.5

Slide 21

Slide 21 text

Google confidential │ Do not distribute Strong isolation 0 2048 4096 6144 8192 Memory (MB) CPU (cores) 4 3 2 1 0 RAM=2GB CPU=1.0 RAM=1GB CPU=0.5 RAM=4GB CPU=2.5

Slide 22

Slide 22 text

Google confidential │ Do not distribute Strong isolation 0 2048 4096 6144 8192 Memory (MB) CPU (cores) 4 3 2 1 0 RAM=2GB CPU=1.0 RAM=1GB CPU=0.5 RAM=4GB CPU=2.5 RAM=1GB stranded!

Slide 23

Slide 23 text

Google confidential │ Do not distribute Pros: • Sharing - users don’t worry about interference (aka the noisy neighbor problem) • Predictable - allows us to offer strong SLAs to apps Cons: • Stranding - arbitrary slices mean some resources get lost • Confusing - how do I know how much I need? • analog: what size VM should I use? • smart auto-scaling is needed! • Expensive - you pay for certainty In reality this is a multi-dimensional bin-packing problem: CPU, memory, disk space, IO bandwidth, network bandwidth, ... Strong isolation

Slide 24

Slide 24 text

Google confidential │ Do not distribute A dose of reality The kernel itself uses some resources “off the top” • We can estimate it statistically but we can’t really limit it

Slide 25

Slide 25 text

Google confidential │ Do not distribute A dose of reality 0 2048 4096 6144 8192 Memory (MB) CPU (cores) 4 3 2 1 0 OS RAM=4GB CPU=2.5 RAM=2GB CPU=1.0 RAM=1GB CPU=0.5 over-committed!

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Google confidential │ Do not distribute A dose of reality 0 2048 4096 6144 8192 Memory (MB) CPU (cores) 4 3 2 1 0 OS RAM=4GB CPU=2.5 RAM=2GB CPU=1.0 Sys

Slide 28

Slide 28 text

Google confidential │ Do not distribute A dose of reality The kernel itself uses some resources “off the top” • We can estimate it statistically but we can’t really limit it System daemons (e.g. our node agent) use some resources • We can (and do) limit these, but failure modes are not always great If ANYONE is uncontained, then all SLOs are void. We pretend that the kernel is contained, but only because we have no real choice. Experience shows this holds up most of the time. Hold this thought for later...

Slide 29

Slide 29 text

Google confidential │ Do not distribute Results Overall this works VERY well for latency-sensitive serving jobs Shortcomings: • There are still some things that can not be easily contained in real time • e.g. cache (see CPI2) • Some resource dimensions are really hard to schedule • e.g. disk IO - so little of it, so bursty, and SO SLOW • Low utilization: nobody uses 100% of what they request • Not well tuned for compute-heavy work (e.g. batch) • Users don’t really know how much CPU/RAM/etc. to request

Slide 30

Slide 30 text

Google confidential │ Do not distribute Usage vs bookings 0 2048 4096 6144 8192 Memory (MB) CPU (cores) 4 3 2 1 0

Slide 31

Slide 31 text

Google confidential │ Do not distribute Making better use of it all Proposition: Re-sell unused resources with lower SLOs • Perfect for batch work • Probabilistically “good enough” Shortcomings: • Even more emphasis on isolation failures • we can’t let batch hurt “paying” customers • Requires a lot of smarts in the lowest parts of the stack • e.g. deterministic OOM killing by priority • we have a number of kernel patches we want to mainline, but we have had a hard time getting upstream kernel on board

Slide 32

Slide 32 text

Google confidential │ Do not distribute Usage vs bookings 0 2048 4096 6144 8192 Memory (MB) CPU (cores) 4 3 2 1 0 batch batch batch b batch batch batch

Slide 33

Slide 33 text

Google confidential │ Do not distribute Back to Docker Container isolation today: • ...does not handle most of this • ...is fundamentally voluntary • ...is an obvious area for improvement in the coming year(s)

Slide 34

Slide 34 text

Google confidential │ Do not distribute More than just isolation Scheduling: Where should my job be run? Lifecycle: Keep my job running Discovery: Where is my job now? Constituency: Who is part of my job? Scale-up: Making my jobs bigger or smaller Auth{n,z}: Who can do things to my job? Monitoring: What’s happening with my job? Health: How is my job feeling? ...

Slide 35

Slide 35 text

Google confidential │ Do not distribute Enter Kubernetes Greek for “Helmsman”; also the root of the word “Governor” • Container orchestrator • Runs Docker containers • Supports multiple cloud and bare-metal environments • Inspired and informed by Google’s experiences and internal systems • Open source, written in Go Manage applications, not machines

Slide 36

Slide 36 text

Google confidential │ Do not distribute Design principles Declarative > imperative: State your desired results, let the system actuate Control loops: Observe, rectify, repeat Simple > Complex: Try to do as little as possible Modularity: Components, interfaces, & plugins Legacy compatible: Requiring apps to change is a non-starter Network-centric: IP addresses are cheap No grouping: Labels are the only groups Cattle > Pets: Manage your workload in bulk Open > Closed: Open Source, standards, REST, JSON, etc.

Slide 37

Slide 37 text

Google confidential │ Do not distribute Pets vs. Cattle

Slide 38

Slide 38 text

Google confidential │ Do not distribute High level design CLI API UI apiserver users master kubelet kubelet kubelet nodes scheduler

Slide 39

Slide 39 text

Google confidential │ Do not distribute Primary concepts Container: A sealed application package (Docker) Pod: A small group of tightly coupled Containers example: content syncer & web server Controller: A loop that drives current state towards desired state example: replication controller Service: A set of running pods that work together example: load-balanced backends Labels: Identifying metadata attached to other objects example: phase=canary vs. phase=prod Selector: A query against labels, producing a set result example: all pods where label phase == prod

Slide 40

Slide 40 text

Google confidential │ Do not distribute Pods

Slide 41

Slide 41 text

Google confidential │ Do not distribute Pods

Slide 42

Slide 42 text

Google confidential │ Do not distribute Pods Small group of containers & volumes Tightly coupled The atom of cluster scheduling & placement Shared namespace • share IP address & localhost Ephemeral • can die and be replaced Example: data puller & web server Pod File Puller Web Server Volume Consumers Content Manager

Slide 43

Slide 43 text

Google confidential │ Do not distribute 10.1.1.0/24 172.16.1.1 172.16.1.2 Docker networking 10.1.2.0/24 172.16.1.1 10.1.3.0/24 172.16.1.1

Slide 44

Slide 44 text

Google confidential │ Do not distribute 10.1.1.0/24 172.16.1.1 172.16.1.2 Docker networking 10.1.2.0/24 172.16.1.1 10.1.3.0/24 172.16.1.1 NAT NAT NAT NAT NAT

Slide 45

Slide 45 text

Google confidential │ Do not distribute Pod networking Pod IPs are routable • Docker default is private IP Pods can reach each other without NAT • even across nodes No brokering of port numbers This is a fundamental requirement • several SDN solutions

Slide 46

Slide 46 text

Google confidential │ Do not distribute 10.1.1.0/24 10.1.1.93 10.1.1.113 Pod networking 10.1.2.0/24 10.1.2.118 10.1.3.0/24 10.1.3.129

Slide 47

Slide 47 text

Google confidential │ Do not distribute Labels Arbitrary metadata Attached to any API object Generally represent identity Queryable by selectors • think SQL ‘select ... where ...’ The only grouping mechanism • pods under a ReplicationController • pods in a Service • capabilities of a node (constraints) Example: “phase: canary” App: Nifty Phase: Dev Role: FE App: Nifty Phase: Dev Role: BE App: Nifty Phase: Test Role: FE App: Nifty Phase: Test Role: BE

Slide 48

Slide 48 text

Google confidential │ Do not distribute Selectors App: Nifty Phase: Dev Role: FE App: Nifty Phase: Test Role: FE App: Nifty Phase: Dev Role: BE App: Nifty Phase: Test Role: BE

Slide 49

Slide 49 text

Google confidential │ Do not distribute App == Nifty App: Nifty Phase: Dev Role: FE App: Nifty Phase: Test Role: FE App: Nifty Phase: Dev Role: BE App: Nifty Phase: Test Role: BE Selectors

Slide 50

Slide 50 text

Google confidential │ Do not distribute App == Nifty Role == FE App: Nifty Phase: Dev Role: FE App: Nifty Phase: Test Role: FE App: Nifty Phase: Dev Role: BE App: Nifty Phase: Test Role: BE Selectors

Slide 51

Slide 51 text

Google confidential │ Do not distribute App == Nifty Role == BE App: Nifty Phase: Dev Role: FE App: Nifty Phase: Test Role: FE App: Nifty Phase: Dev Role: BE App: Nifty Phase: Test Role: BE Selectors

Slide 52

Slide 52 text

Google confidential │ Do not distribute App == Nifty Phase == Dev App: Nifty Phase: Dev Role: FE App: Nifty Phase: Test Role: FE App: Nifty Phase: Dev Role: BE App: Nifty Phase: Test Role: BE Selectors

Slide 53

Slide 53 text

Google confidential │ Do not distribute App == Nifty Phase == Test App: Nifty Phase: Dev Role: FE App: Nifty Phase: Test Role: FE App: Nifty Phase: Dev Role: BE App: Nifty Phase: Test Role: BE Selectors

Slide 54

Slide 54 text

Google confidential │ Do not distribute Replication Controllers Canonical example of control loops Runs out-of-process wrt API server Have 1 job: ensure N copies of a pod • if too few, start new ones • if too many, kill some • group == selector Cleanly layered on top of the core • all access is by public APIs Replicated pods are fungible • No implied ordinality or identity Replication Controller - Name = “nifty-rc” - Selector = {“App”: “Nifty”} - PodTemplate = { ... } - NumReplicas = 4 API Server How many? 3 Start 1 more OK How many? 4

Slide 55

Slide 55 text

Google confidential │ Do not distribute Replication Controllers node 1 f0118 node 3 node 4 node 2 d9376 b0111 a1209 Replication Controller - Desired = 4 - Current = 4

Slide 56

Slide 56 text

Google confidential │ Do not distribute Replication Controllers node 1 f0118 node 3 node 4 node 2 Replication Controller - Desired = 4 - Current = 4 d9376 b0111 a1209

Slide 57

Slide 57 text

Google confidential │ Do not distribute Replication Controllers node 1 f0118 node 3 node 4 Replication Controller - Desired = 4 - Current = 3 b0111 a1209

Slide 58

Slide 58 text

Google confidential │ Do not distribute Replication Controllers node 1 f0118 node 3 node 4 Replication Controller - Desired = 4 - Current = 4 b0111 a1209 c9bad

Slide 59

Slide 59 text

Google confidential │ Do not distribute Replication Controllers node 1 f0118 node 3 node 4 node 2 Replication Controller - Desired = 4 - Current = 5 d9376 b0111 a1209 c9bad

Slide 60

Slide 60 text

Google confidential │ Do not distribute Replication Controllers node 1 f0118 node 3 node 4 node 2 Replication Controller - Desired = 4 - Current = 4 d9376 b0111 a1209 c9bad

Slide 61

Slide 61 text

Google confidential │ Do not distribute Services A group of pods that act as one == Service • group == selector Defines access policy • only “load balanced” for now Gets a stable virtual IP and port • called the service portal • also a DNS name VIP is captured by kube-proxy • watches the service constituency • updates when backends change Hide complexity - ideal for non-native apps Portal (VIP) Client

Slide 62

Slide 62 text

Google confidential │ Do not distribute Services 10.0.0.1 : 9376 Client kube-proxy Service - Name = “nifty-svc” - Selector = {“App”: “Nifty”} - Port = 9376 - ContainerPort = 8080 Portal IP is assigned iptables DNAT TCP / UDP apiserver watch 10.240.2.2 : 8080 10.240.1.1 : 8080 10.240.3.3 : 8080 TCP / UDP

Slide 63

Slide 63 text

Google confidential │ Do not distribute Kubernetes Status & plans Open sourced in June, 2014 • won the BlackDuck “rookie of the year” award • so did cAdvisor :) Google launched Google Container Engine (GKE) • hosted Kubernetes • https://cloud.google.com/container-engine/ Roadmap: • https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/roadmap.md Driving towards a 1.0 release in O(months) • O(100) nodes, O(50) pods per node • focus on web-like app serving use-cases

Slide 64

Slide 64 text

Google confidential │ Do not distribute Monitoring Optional add-on to Kubernetes clusters Run cAdvisor as a pod on each node • gather stats from all containers • export via REST Run Heapster as a pod in the cluster • just another pod, no special access • aggregate stats Run Influx and Grafana in the cluster • more pods • alternately: store in Google Cloud Monitoring

Slide 65

Slide 65 text

Google confidential │ Do not distribute Logging Optional add-on to Kubernetes clusters Run fluentd as a pod on each node • gather logs from all containers • export to elasticsearch Run Elasticsearch as a pod in the cluster • just another pod, no special access • aggregate logs Run Kibana in the cluster • yet another pod • alternately: store in Google Cloud Logging

Slide 66

Slide 66 text

Google confidential │ Do not distribute Kubernetes and isolation We support isolation... • ...inasmuch as Docker does We want better isolation • issues are open with Docker • parent cgroups, GIDs, in-place updates, • will also need kernel work • we have lots of tricks we want to share! We have to meet users where they are • strong isolation is new to most people • we’ll all have to grow into it

Slide 67

Slide 67 text

Google confidential │ Do not distribute Example: nested cgroups pod1 cgroup CPU: 4 cores Memory: 8 GB c1 cgroup CPU: 2 cores Memory: 4 GB c2 cgroup CPU: 1 core Memory: 4 GB c2 cgroup CPU: 1 core Memory: 4 GB pod2 cgroup CPU: 3 cores Memory: 5 GB c1 cgroup CPU: 3 cores Memory: 5 GB c1 cgroup CPU: Memory: machine CPU: 8 cores Memory: 16 GB leftovers CPU: 1 cores Memory: 3 GB pod3 cgroup CPU: Memory:

Slide 68

Slide 68 text

Google confidential │ Do not distribute The Goal: Shake things up Containers is a new way of working Requires new concepts and new tools Google has a lot of experience... ...but we are listening to the users Workload portability is important!

Slide 69

Slide 69 text

Google confidential │ Do not distribute Kubernetes is Open Source We want your help! http://kubernetes.io https://github.com/GoogleCloudPlatform/kubernetes irc.freenode.net #google-containers @kubernetesio

Slide 70

Slide 70 text

Google confidential │ Do not distribute Questions? Images by Connie Zhou http://kubernetes.io

Slide 71

Slide 71 text

Google confidential │ Do not distribute Backup Slides

Slide 72

Slide 72 text

Google confidential │ Do not distribute Control loops Drive current state -> desired state Act independently APIs - no shortcuts or back doors Observed state is truth Recurring pattern in the system Example: ReplicationController observe diff act

Slide 73

Slide 73 text

Google confidential │ Do not distribute Modularity Loose coupling is a goal everywhere • simpler • composable • extensible Code-level plugins where possible Multi-process where possible Isolate risk by interchangeable parts Example: ReplicationController Example: Scheduler

Slide 74

Slide 74 text

Google confidential │ Do not distribute Atomic storage Backing store for all master state Hidden behind an abstract interface Stateless means scalable Watchable • this is a fundamental primitive • don’t poll, watch Using CoreOS etcd

Slide 75

Slide 75 text

Google confidential │ Do not distribute Volumes Pod scoped Share pod’s lifetime & fate Support various types of volumes • Empty directory (default) • Host file/directory • Git repository • GCE Persistent Disk • ...more to come, suggestions welcome Pod Container Container Git GitHub Host Host’s FS GCE GCE PD Empty

Slide 76

Slide 76 text

Google confidential │ Do not distribute Pod lifecycle Once scheduled to a node, pods do not move • restart policy means restart in-place Pods can be observed pending, running, succeeded, or failed • failed is really the end - no more restarts • no complex state machine logic Pods are not rescheduled by the scheduler or apiserver • even if a node dies • controllers are responsible for this • keeps the scheduler simple Apps should consider these rules • Services hide this • Makes pod-to-pod communication more formal

Slide 77

Slide 77 text

Google confidential │ Do not distribute Cluster services Logging, Monitoring, DNS, etc. All run as pods in the cluster - no special treatment, no back doors Open-source solutions for everything • cadvisor + influxdb + heapster == cluster monitoring • fluentd + elasticsearch + kibana == cluster logging • skydns + kube2sky == cluster DNS Can be easily replaced by custom solutions • Modular clusters to fit your needs