▣ Kubernetes
□ Platform to manage containers in a cluster
▣ Understand its core functionality
□ Mechanisms and policies
▣ Major questions
□ Scheduling policy
□ Admission control
□ Autoscaling policy
□ Effect of failures
Goals
Slide 5
Slide 5 text
▣ Monitor state changes
□ Force system into initial state
□ Introduce stimuli
□ Observe the change towards the final state
▣ Requirements
□ Small Kubernetes cluster with resource
monitoring
□ Simple workloads to drive the changes
Our Approach
Slide 6
Slide 6 text
▣ Kubernetes tries to be simple and
minimal
▣ Scheduling and admission control
□ Based on resource requirements
□ Spreading across nodes
▣ Response to failures
□ Timeout and restart
□ Can push to undesirable states
▣ Autoscaling as expected
□ Control loop with damping
Observations
▣ Workloads have shifted from using VMs
to containers
□ Better resource utilization
□ Faster deployment
□ Simplifies config and portability
▣ More than just scheduling
□ Load balancing
□ Replication for services
□ Application health checking
□ Ease of use for
■ Scaling
■ Rolling updates
Need for Container Management
Slide 9
Slide 9 text
api-server
scheduler
kublet
pod pod
High Level Design
kublet
pod pod
User Master Nodes
Slide 10
Slide 10 text
Pods
▣ Small group of containers
▣ Shared namespace
□ Share IP and localhost
□ Volume: shared directory
▣ Scheduling unit
▣ Resource quotas
□ Limit
□ Min request
▣ Once scheduled, pods do
not move
File Puller Web Server
Volume
Content Consumer
Pod
Slide 11
Slide 11 text
General Concepts
Pod
▣ Replication Controller
□ Maintain count of pod replicas
▣ Service
□ A set of running pods accessible by virtual IP
▣ Network model
□ IP for every pod, service and node
□ Makes all to all communication easy
Slide 12
Slide 12 text
3.
Experimental
Setup
Slide 13
Slide 13 text
Experimental Setup
Pod
▣ Google Compute Engine cluster
□ 1 master, 6 nodes
▣ Limited by free trial
□ Could not perform experiments on
scalability
Google Compute Engine
Slide 14
Slide 14 text
Simplified Workloads
Low request - Low usage
Low request - High usage
High request - Low usage
High request - High usage
▣ Simple scripts
running in
containers
▣ Consume specified
amount of CPU and
Memory
▣ Set the request and
usage
Slide 15
Slide 15 text
5.
Experiments
Scheduling
Behavior
● Scheduling
based on min-
request or
actual usage?
Slide 16
Slide 16 text
Scheduling based on min-request or
actual usage?
Pod
▣ Initial experiments showed that scheduler
tries to spread the load,
□ Based on actual usage or min request?
▣ Set up two nodes with no background
containers
□ Node A has a high cpu usage but a low request
□ Node B has low cpu usage but higher request
▣ See where a new pod gets scheduled
Slide 17
Slide 17 text
Scheduling based on Min-Request or
Actual Usage CPU? - Before
Node A
Pod1
Request: 10%
Usage : 67%
Node B
Pod2
Request: 10%
Usage : 1%
Pod3
Request: 10%
Usage : 1%
Slide 18
Slide 18 text
Scheduling based on Min-Request or
Actual Usage CPU? - Before
Node A
Pod1
Request: 10%
Usage : 42%
Node B
Pod2
Request: 10%
Usage : 1%
Pod3
Request: 10%
Usage : 1%
Pod4
Request: 10%
Usage : 43%
Slide 19
Slide 19 text
Scheduling based on Min-Request or
Actual Usage Memory?
▣ We saw the same results when running pods
with changing memory usage and request
▣ Scheduling is based on min-request
Slide 20
Slide 20 text
5.
Experiments
Scheduling
Behavior
● Are Memory
and CPU given
equal
weightage for
making
scheduling
decisions?
Slide 21
Slide 21 text
Are Memory and CPU given Equal
Weightage?
▣ First Experiment (15 trials):
□ Both nodes have 20% CPU request and 20%
Memory request
□ Average request 20%
▣ New pod equally likely to get scheduled on
both nodes.
Slide 22
Slide 22 text
New Pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 20%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 20%
Pod3
CPU Request: 20%
Memory Request : 20%
Slide 23
Slide 23 text
New Pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 20%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 20%
Pod3
CPU Request: 20%
Memory Request : 20%
Slide 24
Slide 24 text
▣ Second Experiment (15 trials):
□ Node A has 20% CPU request and 10% Memory
request
■ Average request 15%
□ Node B has 20% CPU request and 20% Memory
request
■ Average request 20%
▣ New pod should always be scheduled on
Node A
Are Memory and CPU given Equal
Weightage?
Slide 25
Slide 25 text
New Pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 20%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 10%
Pod3
CPU Request: 20%
Memory Request : 20%
Slide 26
Slide 26 text
New Pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 20%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 10%
Pod3
CPU Request: 20%
Memory Request : 20%
Slide 27
Slide 27 text
Are Memory and CPU given Equal
Weightage?
▣ Third Experiment (15 trials):
□ Node A has 20% CPU request and 10% Memory
request.
■ Average 15%
□ Node B has 10% CPU request and 20% Memory
request
■ Average 15%
▣ Equally likely to get scheduled on both again
Slide 28
Slide 28 text
New pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 10%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 10%
Pod3
CPU Request: 20%
Memory Request : 20%
Slide 29
Slide 29 text
New pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 10%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 10%
Pod3
CPU Request: 20%
Memory Request : 20%
Slide 30
Slide 30 text
Are Memory and CPU given Equal
Weightage?
▣ From the experiments we can see that
Memory and CPU requests are given equal
weightage in scheduling decisions
Slide 31
Slide 31 text
5.
Experiments
Admission Control
● Is Admission
control based
on resource
usage or
resource
request?
Slide 32
Slide 32 text
Is Admission Control based on
Resource Usage or Request?
Node A
Pod1
Request: 1%
Usage : 21%
Pod2
Request: 1%
Usage : 21%
Pod3
Request: 1%
Usage : 21%
Pod4
Request: 1%
Usage : 21%
Slide 33
Slide 33 text
Is Admission Control based on Actual
Usage? : 70% CPU request
Node A
Pod3
Request: 1%
Usage : 2%
Pod4
Request: 1%
Usage : 2%
Pod5
Request: 70%
Usage : 78%
Pod2
Request: 1%
Usage : 2%
Pod1
Request: 1%
Usage : 2%
Slide 34
Slide 34 text
Is Admission Control based on Actual
Usage?: 98% CPU request
Node A
Pod1
Request: 1%
Usage : 21%
Pod2
Request: 1%
Usage : 21%
Pod3
Request: 1%
Usage : 21%
Pod4
Request: 1%
Usage : 21%
Pod1
Request: 98%
Usage : 1
Slide 35
Slide 35 text
Is Admission Control based on Actual
Usage?: 98% CPU request
Node A
Pod1
Request: 1%
Usage : 21%
Pod2
Request: 1%
Usage : 21%
Pod3
Request: 1%
Usage : 21%
Pod4
Request: 1%
Usage : 21%
Pod1
Request: 98%
Usage : 1
Slide 36
Slide 36 text
Is Admission Control based on Actual
Usage?
▣ From the previous 2 slides we can show that
admission control is also based on min-
request and not actual usage
Slide 37
Slide 37 text
5.
Experiments
Does kubernetes
always guarantee
minimum request?
Slide 38
Slide 38 text
Before Background Load
Node A
Pod1
Request: 70%
Usage : 75%
Slide 39
Slide 39 text
After Background Load (100 Processes)
Node A
Pod1
Request: 70%
Usage : 27%
High load
background
process
Slide 40
Slide 40 text
Does Kubernetes always guarantee Min
Request?
▣ Background processes on the node are not
part of any pods, so kubernetes has no control
over them
▣ This can prevent pods from getting their min-
request
Slide 41
Slide 41 text
5.
Experiments
Fault Tolerance
and effect of
failures
● Container and
Node crash
Slide 42
Slide 42 text
Response to Failure
▣ Container crash
□ Detected via the docker daemon on the node
□ More sophisticated probes to detect slowdown
deadlock
▣ Node crash
□ Detected via node controller, 40 second heartbeat
□ Pods of failed node, rescheduled after 5 min
Slide 43
Slide 43 text
5.
Experiments
Fault Tolerance
and effect of
failures
● Interesting
consequence
of crash,
reboot
Slide 44
Slide 44 text
Pod Layout before Crash
Node A
Pod1
Request: 10%
Usage : 35%
Node B
Pod2
Request: 10%
Usage : 45%
Pod3
Request: 10%
Usage : 40%
Slide 45
Slide 45 text
Pod Layout after Crash
Node A
Pod1
Request: 10%
Usage : 35%
Node B
Pod2
Request: 10%
Usage : 45%
Pod3
Request: 10%
Usage : 40%
Slide 46
Slide 46 text
Pod Layout after Crash & before
Recovery
Node A Node B
Pod2
Request: 10%
Usage : 27%
Pod3
Request: 10%
Usage : 26%
Pod1
Request: 10%
Usage : 29%
Slide 47
Slide 47 text
Pod Layout after Crash & after
Recovery
Node A Node B
Pod2
Request: 10%
Usage : 27%
Pod3
Request: 10%
Usage : 26%
Pod1
Request: 10%
Usage : 29%
Slide 48
Slide 48 text
Interesting Consequence of Crash,
Reboot
▣ Can shift the container placement into an
undesirable or less optimal state
▣ Multiple ways to mitigate this
□ Have kubernetes reschedule
■ Increases complexity
□ Users set their requirements carefully so as not to
get in that situation
□ Reset the entire system to get back to the desired
configuration
Slide 49
Slide 49 text
5.
Experiments
Autoscaling
● How does
kubernetes do
autoscaling?
Slide 50
Slide 50 text
Autoscaling
▣ Control Loop
□ Set target CPU utilization for a pod
□ Check CPU utilization of all pods
□ Adjust number of replicas to meet target utilization
□ Here utilization is % of Pod request
▣ How does normal autoscaling behavior look
like for a stable load?
Slide 51
Slide 51 text
Normal Behavior of Autoscaler
Target
Utilization
50%
Slide 52
Slide 52 text
Normal Behavior of Autoscaler
Target
Utilization
50%
High load is added to the system.
The cpu usage and number of
pods increase
Slide 53
Slide 53 text
Normal Behavior of Autoscaler
Target
Utilization
50%
The load is now spread across nodes and
the measured cpu usage is now the average
cpu usage of 4 nodes
Slide 54
Slide 54 text
Normal Behavior of Autoscaler
Target
Utilization
50%
The load was
removed and pods
get removed
Slide 55
Slide 55 text
Autoscaling Parameters
▣ Auto scaler has two important parameters
▣ Scale up
□ Delay for 3 minutes before last scaling event
▣ Scale down
□ Delay for 5 minutes before last scaling event
▣ How does the auto scaler react to a more
transient load?
Slide 56
Slide 56 text
Autoscaling Parameters
Target
Utilization
50%
Slide 57
Slide 57 text
Autoscaling Parameters
Target
Utilization
50%
The load
went down
Slide 58
Slide 58 text
Autoscaling Parameters
Target
Utilization
50%
The number
of pod don’t
scale down
as quick
Slide 59
Slide 59 text
Autoscaling Parameters
Target
Utilization
50%
The number
of pod don’t
scale down
as quick
The is
repeated in
other runs too
Slide 60
Slide 60 text
Autoscaling Parameters
▣ Needs to be tuned for the nature of the
workload
▣ Generally conservative
□ Scales up faster
□ Scales down slower
▣ Tries to avoid thrashing
Slide 61
Slide 61 text
5.
Summary
Slide 62
Slide 62 text
▣ Scheduling and Admission control policy is
based on min-request of resource
□ CPU and Memory given equal weightage
▣ Crashes can drive system towards
undesirable states
▣ Autoscaler works as expected
□ Has to be tuned for workload
Summary
Slide 63
Slide 63 text
6.
Conclusion
Slide 64
Slide 64 text
▣ Philosophy of control loops
□ Observe, rectify, repeat
□ Drive system towards desired state
▣ Kubernetes tries to do as little as possible
□ Not a lot of policies
□ Makes it easier to reason about
□ But can be too simplistic in some cases
Conclusion
Slide 65
Slide 65 text
Thanks!
Any questions?
Slide 66
Slide 66 text
References
▣ http://kubernetes.io/
▣ http://blog.kubernetes.io/
▣ Verma, Abhishek, et al. "Large-scale cluster management at
Google with Borg." Proceedings of the Tenth European
Conference on Computer Systems. ACM, 2015.
Slide 67
Slide 67 text
Backup slides
Slide 68
Slide 68 text
4.
Experiments
Scheduling
Behavior
● Is the policy
based on
spreading load
across
resources?
Slide 69
Slide 69 text
Is the Policy based on Spreading Load
across Resources?
Pod
▣ Launch a Spark cluster on kubernetes
▣ Increase the number of workers one at a time
▣ Expect to see them scheduled across the
nodes
▣ Shows the spreading policy of scheduler
Slide 70
Slide 70 text
Individual Node Memory Usage
Slide 71
Slide 71 text
Increase in Memory Usage across
Nodes
Slide 72
Slide 72 text
Final Pod Layout after Scheduling
Node A
Worker 1 Worker 2 Worker 3
DNS Logging
Node B
Worker 4 Worker 5 Worker 6
Graphana Logging Master
Node C
Worker 7 Worker 8 Worker 9
LB
Controller
Logging Kube-UI
Node D
Worker
10
Worker
11
Worker
12
Heapster Logging
KubeDash
Slide 73
Slide 73 text
Is the Policy based on Spreading Load
across Resources?
Pod
▣ Exhibits spreading behaviour
▣ Inconclusive
□ Based on resource usage or request?
□ Background pods add to noise
□ Spark workload hard to gauge
Slide 74
Slide 74 text
▣ CPU Utilization of pod
□ Actual usage / Amount requested
Target Num Pods = Ceil( Sum( All Pods Util ) / Target Util )
Autoscaling Algorithm
Slide 75
Slide 75 text
Master
▣ API Server
□ Client access to
master
▣ etcd
□ Distributed consistent
storage using raft
▣ Scheduler
▣ Controller
□ Replication
Control Plane Components
Node
▣ Kubelet
□ Manage pods,
containers
▣ Kube-proxy
□ Load balance among
replicas of pod for a
service
Slide 76
Slide 76 text
Detailed Architecture
Slide 77
Slide 77 text
Autoscaling for Long Stable Loads (10
high, 10 low)
Slide 78
Slide 78 text
New Pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 20%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 20%
Pod3 (Iter 1)
CPU Request: 20%
Memory Request : 20%
Slide 79
Slide 79 text
New Pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 20%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 20%
Pod3 (Iter 2)
CPU Request: 20%
Memory Request : 20%
Slide 80
Slide 80 text
New Pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 20%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 20%
Pod3 (Iter 3)
CPU Request: 20%
Memory Request : 20%
Slide 81
Slide 81 text
New Pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 20%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 10%
Pod3 (Iter 1)
CPU Request: 20%
Memory Request : 20%
Slide 82
Slide 82 text
New Pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 20%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 10%
Pod3 (Iter 2)
CPU Request: 20%
Memory Request : 20%
Slide 83
Slide 83 text
New Pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 20%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 10%
Pod3 (Iter 3)
CPU Request: 20%
Memory Request : 20%
Slide 84
Slide 84 text
New pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 10%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 10%
Pod3 (Iter 1)
CPU Request: 20%
Memory Request : 20%
Slide 85
Slide 85 text
New pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 10%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 10%
Pod3 (Iter 2)
CPU Request: 20%
Memory Request : 20%
Slide 86
Slide 86 text
New pod with 20% CPU and 20%
Memory Request
Node A Node B
Pod2
CPU Request: 10%
Memory Request : 20%
Pod1
CPU Request: 20%
Memory Request : 10%
Pod3 (Iter 3)
CPU Request: 20%
Memory Request : 20%