Understanding Kubernetes Through Real-World Phenomena and Analogies

Understanding Kubernetes Through Real-World Phenomena and Analogies Lucas Käldström -
CNCF Ambassador May 9, 2023 – Helsinki Image credit: CNCF

© 2023 Lucas Käldström 2 $ whoami Lucas Käldström, 4th-year
BSc student at Aalto University, Finland CNCF Ambassador, Certified Kubernetes Administrator and Emeritus Kubernetes WG/SIG Lead KubeCon Speaker in Berlin, Austin, Copenhagen, Shanghai, Seattle, San Diego & Valencia KubeCon Keynote Speaker in Barcelona Former Kubernetes approver and subproject owner, active in the OSS community for 7+ years. Worked on e.g. SIG Cluster Lifecycle => kubeadm to GA. Weaveworks contractor, Weave Ignite & libgitops author Cloud Native Nordics co-founder & meetup organizer Guild of Automation and Systems Technology corporate relations & CFO

© 2023 Lucas Käldström 7 A Container Orchestrator? Yes But
in fact, even more than that

© 2023 Lucas Käldström 8 Kubernetes: A Control Plane for
(any) infrastructure

© 2023 Lucas Käldström 9 Kubernetes: A Control Plane for
(any) infrastructure = A set of automated controllers with operational knowledge of how to control a target system

© 2023 Lucas Käldström 11 Run anywhere Self-healing Scalable workload
scheduling Service discovery + config mgmt What?

© 2023 Lucas Käldström 12 Specify once; Kubernetes makes your
dream true JSON container workload specification REST API server HTTP POST JSON object *The process doesn’t look exactly like this, it is a simplified mental model for now

dream true JSON container workload specification REST API server HTTP POST JSON object Container Workload Controller read desired state *The process doesn’t look exactly like this, it is a simplified mental model for now

dream true JSON container workload specification REST API server HTTP POST JSON object Container Workload Controller read desired state *The process doesn’t look exactly like this, it is a simplified mental model for now pull start re-start monitor

© 2023 Lucas Käldström 15 Run anywhere Self-healing Scalable workload
scheduling Service discovery + config mgmt How? Closed-loop controllers Uniform, declarative and extensible API

© 2023 Lucas Käldström 17 What problem are we trying
to solve?

© 2023 Lucas Käldström 19 Comes with 24 pages of
API design guidelines!

But is it inherently “too complex” for most?

Problems hiding in plain sight It just takes longer for
small-scale users to notice problems due to e.g. randomness 100 days time servers 3 3 days time servers 100 Small-scale users Large-scale users

=> unknown unknowns for small systems

Chaos is Inevitable

Google Finding: “Failure is the Norm”

“deliberately leave significant headroom for workload growth, occasional ‘black swan’
events, load spikes, machine failures, hardware upgrades, and large-scale partial failures (e.g., a power supply bus duct)” Source: (Verma et. al., 2015) Google Finding: “Failure is the Norm”

© 2023 Lucas Käldström 26 Entropy: Systems become less ordered
Time Entropy Order Start Stop Chaos area of uncertainty grows!

© 2023 Lucas Käldström 27 Entropy: Putting order to chaos
Time Entropy Order Start Stop Chaos Reversing, ordering process

© 2023 Lucas Käldström 28 Kubernetes: The dishwasher of servers
Time Entropy Order Start Stop Chaos Reversing, ordering process

What does this mean for server systems? ✨ ✨ ✨
1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config A Power On OS v1 Config A Power On OS v1 Config A Power On 1 2 3 Example: Sysadmin A gets three new servers, and install the same operating system onto all of them, with exactly the same configuration. In the beginning, the system is completely ordered, all instances are identically configured.

What does this mean for server systems? 1 2 3
Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config A Power On OS v2 Config A Power On OS v2 Config A Power On 1 2 3 After some time, a critical “v2” security upgrade to the operating system becomes available, and sysadmin A upgrades servers 2 and 3, but not 1, as it is running a critical database service, so A is afraid to disturb it.

Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out Slow disk access time OS v1 Config B Power On OS v2 Config A Power On OS v2 Config A Power On 1 2 3 Server 1 complains about slow disk access time, due to a misconfiguration in the operating system. Sysadmin A fixes it imperatively on the computer that complains until it stops, but none of the other servers.

Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config B Power On OS v2 Config A Power Off OS v2 Config A Power On 1 2 3 Sysadmin A has noticed that the amount of users has dropped because of a seasonal trend, so A decides to turn server 2 off to save on energy costs.

Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out Slow disk access time OS v1 Config B Power On OS v2 Config A Power Off OS v2 Config C Power On 1 2 3 The next week, when sysadmin A is on vacation, server 3 complains about the same error as server 1 earlier. Sysadmin B “solves” the issue (in another way than A for server 1), but does nothing to the other servers.

Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config B Power On OS v2 Config A Power Off OS v3 Config C Power On 1 2 3 desired state change Now, a new version of the operating system is released with a very cool feature that would be useful to the sysadmins. However, upgrading is risky because of incompatibilities, so they only upgrade server 3 to try it out.

Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config B Power On OS v2 Config A Power Off OS v3 Config C Power Out 1 2 3 emergent state change Suddenly, a thunderstorm enters the area where the servers are, and the lightning strikes. Due to the lack of overvoltage protection, server 3’s power supply becomes unusable, and thus shuts down.

© 2023 Lucas Käldström 36 Entropy: Systems become less ordered
Time Entropy Order Start Stop Chaos

© 2023 Lucas Käldström 41 Game Theory: An Infinite Game
against Chaos

© 2023 Lucas Käldström 42 Key Takeaways a) Systems are
inevitably becoming less ordered, and thus b) need some periodic corrective action to steer the course towards c) some declared desired state of the system.

© 2023 Lucas Käldström 43 Declarative + reconciler vs imperative
Web UI System operated (e.g. webservers), in faulty conditions, e.g. too few replicas Click Corrective Action Imperative flow:

Web UI System operated, correct state Imperative flow: goes home / to sleep

Web UI Half of the replicas go down :( Imperative flow: is home / at sleep Nothing happens; need either admin to wake up or to wait for next morning Area of uncertainty grows!

Web UI Declarative flow: Desired State Store Define Desired State

Web UI System operating Declarative flow: Desired State Store Shortly thereafter, reconciler sync get scale up

Web UI Declarative flow: Desired State Store goes home / to sleep System operating

Web UI Half of the replicas go down :( Declarative flow: Desired State Store is home / at sleep

Web UI Replicas scaled up to full health Declarative flow: Desired State Store Periodic reconciler sync, sees drift get scale up is home / at sleep

Web UI System operating in good condition Declarative flow: Desired State Store This design philosophy is why e.g. Kubernetes is called “self-healing”. is home / at sleep

“If you don’t know where you’re going, any road will
take you there”

© 2023 Lucas Käldström 57 Abstraction Layers: Pluggable interfaces Cloud
Native is all about pluggable APIs forming consistent abstractions that projects can implement and/or rely on. These CNCF/LF projects contain only a specification, no implementation:

*except for the problem of too many layers of indirection
:D

© 2023 Lucas Käldström 61 Controllers, or reconcile loops, fulfill
the claim(s) Observe and diff Desired State Source Target System 2 1 2: Actual State 1: Desired State

the claim(s) Observe and diff Act Desired State Source 3 Target System 2 1 2: Actual State 1: Desired State 3: Action Plan

the claim(s) Observe and diff Act Desired State Source 3 Target System 2 1 2: Actual State 1: Desired State 4: Action 3: Action Plan 4

the claim(s) Observe and diff Act Desired State Source 3 Report (Actual State Sink) Target System 2 1 2, 6: Actual State 1: Desired State 4: Action 3: Action Plan 5: Result 4 5 (6)

the claim(s) Observe and diff Act Desired State Source 3 Report (Actual State Sink) Target System 2 1 7: Requeue 2, 6: Actual State 1: Desired State 4: Action 3: Action Plan 5: Result 4 5 (6) 7

Operators: Encode human-like knowledge

= Automated reconcile loops with “human-like” operational knowledge Coined in
2016 by Brandon Phillips, back then at CoreOS Operators: Encode human-like knowledge

© 2023 Lucas Käldström 68 Example: Cilium - Implements Kubernetes
APIs: Endpoints, Services, Ingress & Gateway - Registers its custom Network APIs with Kubernetes for advanced features - “Compiles” routing and eBPF rules for you on the fly, based on the desired state you specified in the cluster => you never have to write detailed rules - Encodes human-like operational knowledge about configuring networks into a reusable tool controlled by declarative APIs

Not: Humans Operating Machines

Instead: Humans Operating Automation that in turn Operate Machines

Understanding Kubernetes Through Real-World Phe...

Understanding Kubernetes Through Real-World Phenomena and Analogies

More Decks by Lucas Käldström

Other Decks in Technology

Featured

Transcript