Slide 1

Slide 1 text

Understanding Kubernetes Through Real-World Phenomena and Analogies Lucas Käldström - CNCF Ambassador May 9, 2023 – Helsinki Image credit: CNCF

Slide 2

Slide 2 text

© 2023 Lucas Käldström 2 $ whoami Lucas Käldström, 4th-year BSc student at Aalto University, Finland CNCF Ambassador, Certified Kubernetes Administrator and Emeritus Kubernetes WG/SIG Lead KubeCon Speaker in Berlin, Austin, Copenhagen, Shanghai, Seattle, San Diego & Valencia KubeCon Keynote Speaker in Barcelona Former Kubernetes approver and subproject owner, active in the OSS community for 7+ years. Worked on e.g. SIG Cluster Lifecycle => kubeadm to GA. Weaveworks contractor, Weave Ignite & libgitops author Cloud Native Nordics co-founder & meetup organizer Guild of Automation and Systems Technology corporate relations & CFO

Slide 3

Slide 3 text

© 2023 Lucas Käldström 3 Credits to Simon Sinek

Slide 4

Slide 4 text

© 2023 Lucas Käldström 4 Credits to Simon Sinek

Slide 5

Slide 5 text

© 2023 Lucas Käldström 5 Let’s start by defining it

Slide 6

Slide 6 text

© 2023 Lucas Käldström 6 A Container Orchestrator? Yes

Slide 7

Slide 7 text

© 2023 Lucas Käldström 7 A Container Orchestrator? Yes But in fact, even more than that

Slide 8

Slide 8 text

© 2023 Lucas Käldström 8 Kubernetes: A Control Plane for (any) infrastructure

Slide 9

Slide 9 text

© 2023 Lucas Käldström 9 Kubernetes: A Control Plane for (any) infrastructure = A set of automated controllers with operational knowledge of how to control a target system

Slide 10

Slide 10 text

© 2023 Lucas Käldström 10

Slide 11

Slide 11 text

© 2023 Lucas Käldström 11 Run anywhere Self-healing Scalable workload scheduling Service discovery + config mgmt What?

Slide 12

Slide 12 text

© 2023 Lucas Käldström 12 Specify once; Kubernetes makes your dream true JSON container workload specification REST API server HTTP POST JSON object *The process doesn’t look exactly like this, it is a simplified mental model for now

Slide 13

Slide 13 text

© 2023 Lucas Käldström 13 Specify once; Kubernetes makes your dream true JSON container workload specification REST API server HTTP POST JSON object Container Workload Controller read desired state *The process doesn’t look exactly like this, it is a simplified mental model for now

Slide 14

Slide 14 text

© 2023 Lucas Käldström 14 Specify once; Kubernetes makes your dream true JSON container workload specification REST API server HTTP POST JSON object Container Workload Controller read desired state *The process doesn’t look exactly like this, it is a simplified mental model for now pull start re-start monitor

Slide 15

Slide 15 text

© 2023 Lucas Käldström 15 Run anywhere Self-healing Scalable workload scheduling Service discovery + config mgmt How? Closed-loop controllers Uniform, declarative and extensible API

Slide 16

Slide 16 text

© 2023 Lucas Käldström 16 Credits to Simon Sinek

Slide 17

Slide 17 text

© 2023 Lucas Käldström 17 What problem are we trying to solve?

Slide 18

Slide 18 text

© 2023 Lucas Käldström 18 Based on decades of experience

Slide 19

Slide 19 text

© 2023 Lucas Käldström 19 Comes with 24 pages of API design guidelines!

Slide 20

Slide 20 text

But is it inherently “too complex” for most?

Slide 21

Slide 21 text

Problems hiding in plain sight It just takes longer for small-scale users to notice problems due to e.g. randomness 100 days time servers 3 3 days time servers 100 Small-scale users Large-scale users

Slide 22

Slide 22 text

=> unknown unknowns for small systems

Slide 23

Slide 23 text

Chaos is Inevitable

Slide 24

Slide 24 text

Google Finding: “Failure is the Norm”

Slide 25

Slide 25 text

“deliberately leave significant headroom for workload growth, occasional ‘black swan’ events, load spikes, machine failures, hardware upgrades, and large-scale partial failures (e.g., a power supply bus duct)” Source: (Verma et. al., 2015) Google Finding: “Failure is the Norm”

Slide 26

Slide 26 text

© 2023 Lucas Käldström 26 Entropy: Systems become less ordered Time Entropy Order Start Stop Chaos area of uncertainty grows!

Slide 27

Slide 27 text

© 2023 Lucas Käldström 27 Entropy: Putting order to chaos Time Entropy Order Start Stop Chaos Reversing, ordering process

Slide 28

Slide 28 text

© 2023 Lucas Käldström 28 Kubernetes: The dishwasher of servers Time Entropy Order Start Stop Chaos Reversing, ordering process

Slide 29

Slide 29 text

What does this mean for server systems? ✨ ✨ ✨ 1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config A Power On OS v1 Config A Power On OS v1 Config A Power On 1 2 3 Example: Sysadmin A gets three new servers, and install the same operating system onto all of them, with exactly the same configuration. In the beginning, the system is completely ordered, all instances are identically configured.

Slide 30

Slide 30 text

What does this mean for server systems? 1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config A Power On OS v2 Config A Power On OS v2 Config A Power On 1 2 3 After some time, a critical “v2” security upgrade to the operating system becomes available, and sysadmin A upgrades servers 2 and 3, but not 1, as it is running a critical database service, so A is afraid to disturb it.

Slide 31

Slide 31 text

What does this mean for server systems? 1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out Slow disk access time OS v1 Config B Power On OS v2 Config A Power On OS v2 Config A Power On 1 2 3 Server 1 complains about slow disk access time, due to a misconfiguration in the operating system. Sysadmin A fixes it imperatively on the computer that complains until it stops, but none of the other servers.

Slide 32

Slide 32 text

What does this mean for server systems? 1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config B Power On OS v2 Config A Power Off OS v2 Config A Power On 1 2 3 Sysadmin A has noticed that the amount of users has dropped because of a seasonal trend, so A decides to turn server 2 off to save on energy costs.

Slide 33

Slide 33 text

What does this mean for server systems? 1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out Slow disk access time OS v1 Config B Power On OS v2 Config A Power Off OS v2 Config C Power On 1 2 3 The next week, when sysadmin A is on vacation, server 3 complains about the same error as server 1 earlier. Sysadmin B “solves” the issue (in another way than A for server 1), but does nothing to the other servers.

Slide 34

Slide 34 text

What does this mean for server systems? 1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config B Power On OS v2 Config A Power Off OS v3 Config C Power On 1 2 3 desired state change Now, a new version of the operating system is released with a very cool feature that would be useful to the sysadmins. However, upgrading is risky because of incompatibilities, so they only upgrade server 3 to try it out.

Slide 35

Slide 35 text

What does this mean for server systems? 1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config B Power On OS v2 Config A Power Off OS v3 Config C Power Out 1 2 3 emergent state change Suddenly, a thunderstorm enters the area where the servers are, and the lightning strikes. Due to the lack of overvoltage protection, server 3’s power supply becomes unusable, and thus shuts down.

Slide 36

Slide 36 text

© 2023 Lucas Käldström 36 Entropy: Systems become less ordered Time Entropy Order Start Stop Chaos

Slide 37

Slide 37 text

© 2023 Lucas Käldström 37 Kubernetes: The dishwasher of servers

Slide 38

Slide 38 text

© 2023 Lucas Käldström 38 Kubernetes: The dishwasher of servers

Slide 39

Slide 39 text

© 2023 Lucas Käldström 39

Slide 40

Slide 40 text

© 2023 Lucas Käldström 40

Slide 41

Slide 41 text

© 2023 Lucas Käldström 41 Game Theory: An Infinite Game against Chaos

Slide 42

Slide 42 text

© 2023 Lucas Käldström 42 Key Takeaways a) Systems are inevitably becoming less ordered, and thus b) need some periodic corrective action to steer the course towards c) some declared desired state of the system.

Slide 43

Slide 43 text

© 2023 Lucas Käldström 43 Declarative + reconciler vs imperative Web UI System operated (e.g. webservers), in faulty conditions, e.g. too few replicas Click Corrective Action Imperative flow:

Slide 44

Slide 44 text

© 2023 Lucas Käldström 44 Declarative + reconciler vs imperative Web UI System operated, correct state Imperative flow: goes home / to sleep

Slide 45

Slide 45 text

© 2023 Lucas Käldström 45 Declarative + reconciler vs imperative Web UI Half of the replicas go down :( Imperative flow: is home / at sleep Nothing happens; need either admin to wake up or to wait for next morning Area of uncertainty grows!

Slide 46

Slide 46 text

© 2023 Lucas Käldström 46 Declarative + reconciler vs imperative Web UI Declarative flow: Desired State Store Define Desired State

Slide 47

Slide 47 text

© 2023 Lucas Käldström 47 Declarative + reconciler vs imperative Web UI System operating Declarative flow: Desired State Store Shortly thereafter, reconciler sync get scale up

Slide 48

Slide 48 text

© 2023 Lucas Käldström 48 Declarative + reconciler vs imperative Web UI Declarative flow: Desired State Store goes home / to sleep System operating

Slide 49

Slide 49 text

© 2023 Lucas Käldström 49 Declarative + reconciler vs imperative Web UI Half of the replicas go down :( Declarative flow: Desired State Store is home / at sleep

Slide 50

Slide 50 text

© 2023 Lucas Käldström 50 Declarative + reconciler vs imperative Web UI Replicas scaled up to full health Declarative flow: Desired State Store Periodic reconciler sync, sees drift get scale up is home / at sleep

Slide 51

Slide 51 text

© 2023 Lucas Käldström 51 Declarative + reconciler vs imperative Web UI System operating in good condition Declarative flow: Desired State Store This design philosophy is why e.g. Kubernetes is called “self-healing”. is home / at sleep

Slide 52

Slide 52 text

© 2023 Lucas Käldström 52 WHAT

Slide 53

Slide 53 text

© 2023 Lucas Käldström 53 HOW

Slide 54

Slide 54 text

© 2023 Lucas Käldström 54 HOW

Slide 55

Slide 55 text

“If you don’t know where you’re going, any road will take you there”

Slide 56

Slide 56 text

© 2023 Lucas Käldström 56 controllers + extensible API = abstraction layer

Slide 57

Slide 57 text

© 2023 Lucas Käldström 57 Abstraction Layers: Pluggable interfaces Cloud Native is all about pluggable APIs forming consistent abstractions that projects can implement and/or rely on. These CNCF/LF projects contain only a specification, no implementation:

Slide 58

Slide 58 text

© 2023 Lucas Käldström 58 Kubernetes is a “platform for platforms” Platform A Platform B Platform C Platform D

Slide 59

Slide 59 text

© 2023 Lucas Käldström 59 Kubernetes is a “platform for platforms” Platform A Platform B Platform C Platform D

Slide 60

Slide 60 text

*except for the problem of too many layers of indirection :D

Slide 61

Slide 61 text

© 2023 Lucas Käldström 61 Controllers, or reconcile loops, fulfill the claim(s) Observe and diff Desired State Source Target System 2 1 2: Actual State 1: Desired State

Slide 62

Slide 62 text

© 2023 Lucas Käldström 62 Controllers, or reconcile loops, fulfill the claim(s) Observe and diff Act Desired State Source 3 Target System 2 1 2: Actual State 1: Desired State 3: Action Plan

Slide 63

Slide 63 text

© 2023 Lucas Käldström 63 Controllers, or reconcile loops, fulfill the claim(s) Observe and diff Act Desired State Source 3 Target System 2 1 2: Actual State 1: Desired State 4: Action 3: Action Plan 4

Slide 64

Slide 64 text

© 2023 Lucas Käldström 64 Controllers, or reconcile loops, fulfill the claim(s) Observe and diff Act Desired State Source 3 Report (Actual State Sink) Target System 2 1 2, 6: Actual State 1: Desired State 4: Action 3: Action Plan 5: Result 4 5 (6)

Slide 65

Slide 65 text

© 2023 Lucas Käldström 65 Controllers, or reconcile loops, fulfill the claim(s) Observe and diff Act Desired State Source 3 Report (Actual State Sink) Target System 2 1 7: Requeue 2, 6: Actual State 1: Desired State 4: Action 3: Action Plan 5: Result 4 5 (6) 7

Slide 66

Slide 66 text

Operators: Encode human-like knowledge

Slide 67

Slide 67 text

= Automated reconcile loops with “human-like” operational knowledge Coined in 2016 by Brandon Phillips, back then at CoreOS Operators: Encode human-like knowledge

Slide 68

Slide 68 text

© 2023 Lucas Käldström 68 Example: Cilium - Implements Kubernetes APIs: Endpoints, Services, Ingress & Gateway - Registers its custom Network APIs with Kubernetes for advanced features - “Compiles” routing and eBPF rules for you on the fly, based on the desired state you specified in the cluster => you never have to write detailed rules - Encodes human-like operational knowledge about configuring networks into a reusable tool controlled by declarative APIs

Slide 69

Slide 69 text

Not: Humans Operating Machines

Slide 70

Slide 70 text

Instead: Humans Operating Automation that in turn Operate Machines

Slide 71

Slide 71 text

Further Reading

Slide 72

Slide 72 text

© 2023 Lucas Käldström 72 Check out my thesis for more details! Available openly on Github: https://github.com/luxas/research CC-BY-SA 4.0 licensed Encoding human-like operational knowledge using declarative Kubernetes operator patterns

Slide 73

Slide 73 text

© 2023 Lucas Käldström 73 Control Theory (Vallery Lancery, QCon, 2018) My talk on control theory + declarative APIs = Kubernetes

Slide 74

Slide 74 text

© 2023 Lucas Käldström 74 Promise Theory (The Kubernetes Documentary, Honeypot, 2022)

Slide 75

Slide 75 text

© 2023 Lucas Käldström 75 Wrap-up: The 4 Whys: 1. “Control through choreography” based on experience

Slide 76

Slide 76 text

© 2023 Lucas Käldström 76 Wrap-up: The 4 Whys: 1. “Control through choreography” based on experience 2. Periodic controller action for fighting inevitable chaos

Slide 77

Slide 77 text

© 2023 Lucas Käldström 77 Wrap-up: The 4 Whys: 1. “Control through choreography” based on experience 2. Periodic controller action for fighting inevitable chaos 3. Declarativeness allows defining a (portable) end goal

Slide 78

Slide 78 text

© 2023 Lucas Käldström 78 Wrap-up: The 4 Whys: 1. “Control through choreography” based on experience 2. Periodic controller action for fighting inevitable chaos 3. Declarativeness allows defining a (portable) end goal 4. control loops + extensible declarative APIs = operators

Slide 79

Slide 79 text

© 2023 Lucas Käldström 79 Kubernetes = 1 database + 1 REST API + 30 operators Uniform, declarative and extensible REST API

Slide 80

Slide 80 text

Summary Baim Hanif on Unsplash Thank you! @luxas on Github @luxas on LinkedIn @luxas on SpeakerDeck @kubernetesonarm on Twitter [email protected]