Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding Kubernetes Through Real-World Phenomena and Analogies

Understanding Kubernetes Through Real-World Phenomena and Analogies

Presented at the DevOps Finland meetup: https://www.meetup.com/DevOps-Finland

Lucas Käldström

May 09, 2023
Tweet

More Decks by Lucas Käldström

Other Decks in Technology

Transcript

  1. Understanding Kubernetes Through Real-World Phenomena and Analogies Lucas Käldström -

    CNCF Ambassador May 9, 2023 – Helsinki Image credit: CNCF
  2. © 2023 Lucas Käldström 2 $ whoami Lucas Käldström, 4th-year

    BSc student at Aalto University, Finland CNCF Ambassador, Certified Kubernetes Administrator and Emeritus Kubernetes WG/SIG Lead KubeCon Speaker in Berlin, Austin, Copenhagen, Shanghai, Seattle, San Diego & Valencia KubeCon Keynote Speaker in Barcelona Former Kubernetes approver and subproject owner, active in the OSS community for 7+ years. Worked on e.g. SIG Cluster Lifecycle => kubeadm to GA. Weaveworks contractor, Weave Ignite & libgitops author Cloud Native Nordics co-founder & meetup organizer Guild of Automation and Systems Technology corporate relations & CFO
  3. © 2023 Lucas Käldström 9 Kubernetes: A Control Plane for

    (any) infrastructure = A set of automated controllers with operational knowledge of how to control a target system
  4. © 2023 Lucas Käldström 11 Run anywhere Self-healing Scalable workload

    scheduling Service discovery + config mgmt What?
  5. © 2023 Lucas Käldström 12 Specify once; Kubernetes makes your

    dream true JSON container workload specification REST API server HTTP POST JSON object *The process doesn’t look exactly like this, it is a simplified mental model for now
  6. © 2023 Lucas Käldström 13 Specify once; Kubernetes makes your

    dream true JSON container workload specification REST API server HTTP POST JSON object Container Workload Controller read desired state *The process doesn’t look exactly like this, it is a simplified mental model for now
  7. © 2023 Lucas Käldström 14 Specify once; Kubernetes makes your

    dream true JSON container workload specification REST API server HTTP POST JSON object Container Workload Controller read desired state *The process doesn’t look exactly like this, it is a simplified mental model for now pull start re-start monitor
  8. © 2023 Lucas Käldström 15 Run anywhere Self-healing Scalable workload

    scheduling Service discovery + config mgmt How? Closed-loop controllers Uniform, declarative and extensible API
  9. Problems hiding in plain sight It just takes longer for

    small-scale users to notice problems due to e.g. randomness 100 days time servers 3 3 days time servers 100 Small-scale users Large-scale users
  10. “deliberately leave significant headroom for workload growth, occasional ‘black swan’

    events, load spikes, machine failures, hardware upgrades, and large-scale partial failures (e.g., a power supply bus duct)” Source: (Verma et. al., 2015) Google Finding: “Failure is the Norm”
  11. © 2023 Lucas Käldström 26 Entropy: Systems become less ordered

    Time Entropy Order Start Stop Chaos area of uncertainty grows!
  12. © 2023 Lucas Käldström 27 Entropy: Putting order to chaos

    Time Entropy Order Start Stop Chaos Reversing, ordering process
  13. © 2023 Lucas Käldström 28 Kubernetes: The dishwasher of servers

    Time Entropy Order Start Stop Chaos Reversing, ordering process
  14. What does this mean for server systems? ✨ ✨ ✨

    1 2 3 Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config A Power On OS v1 Config A Power On OS v1 Config A Power On 1 2 3 Example: Sysadmin A gets three new servers, and install the same operating system onto all of them, with exactly the same configuration. In the beginning, the system is completely ordered, all instances are identically configured.
  15. What does this mean for server systems? 1 2 3

    Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config A Power On OS v2 Config A Power On OS v2 Config A Power On 1 2 3 After some time, a critical “v2” security upgrade to the operating system becomes available, and sysadmin A upgrades servers 2 and 3, but not 1, as it is running a critical database service, so A is afraid to disturb it.
  16. What does this mean for server systems? 1 2 3

    Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out Slow disk access time OS v1 Config B Power On OS v2 Config A Power On OS v2 Config A Power On 1 2 3 Server 1 complains about slow disk access time, due to a misconfiguration in the operating system. Sysadmin A fixes it imperatively on the computer that complains until it stops, but none of the other servers.
  17. What does this mean for server systems? 1 2 3

    Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config B Power On OS v2 Config A Power Off OS v2 Config A Power On 1 2 3 Sysadmin A has noticed that the amount of users has dropped because of a seasonal trend, so A decides to turn server 2 off to save on energy costs.
  18. What does this mean for server systems? 1 2 3

    Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out Slow disk access time OS v1 Config B Power On OS v2 Config A Power Off OS v2 Config C Power On 1 2 3 The next week, when sysadmin A is on vacation, server 3 complains about the same error as server 1 earlier. Sysadmin B “solves” the issue (in another way than A for server 1), but does nothing to the other servers.
  19. What does this mean for server systems? 1 2 3

    Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config B Power On OS v2 Config A Power Off OS v3 Config C Power On 1 2 3 desired state change Now, a new version of the operating system is released with a very cool feature that would be useful to the sysadmins. However, upgrading is risky because of incompatibilities, so they only upgrade server 3 to try it out.
  20. What does this mean for server systems? 1 2 3

    Operating System v1 v2 v3 1 2 3 Configuration A B C 1 2 3 Power On Off Out OS v1 Config B Power On OS v2 Config A Power Off OS v3 Config C Power Out 1 2 3 emergent state change Suddenly, a thunderstorm enters the area where the servers are, and the lightning strikes. Due to the lack of overvoltage protection, server 3’s power supply becomes unusable, and thus shuts down.
  21. © 2023 Lucas Käldström 42 Key Takeaways a) Systems are

    inevitably becoming less ordered, and thus b) need some periodic corrective action to steer the course towards c) some declared desired state of the system.
  22. © 2023 Lucas Käldström 43 Declarative + reconciler vs imperative

    Web UI System operated (e.g. webservers), in faulty conditions, e.g. too few replicas Click Corrective Action Imperative flow:
  23. © 2023 Lucas Käldström 44 Declarative + reconciler vs imperative

    Web UI System operated, correct state Imperative flow: goes home / to sleep
  24. © 2023 Lucas Käldström 45 Declarative + reconciler vs imperative

    Web UI Half of the replicas go down :( Imperative flow: is home / at sleep Nothing happens; need either admin to wake up or to wait for next morning Area of uncertainty grows!
  25. © 2023 Lucas Käldström 46 Declarative + reconciler vs imperative

    Web UI Declarative flow: Desired State Store Define Desired State
  26. © 2023 Lucas Käldström 47 Declarative + reconciler vs imperative

    Web UI System operating Declarative flow: Desired State Store Shortly thereafter, reconciler sync get scale up
  27. © 2023 Lucas Käldström 48 Declarative + reconciler vs imperative

    Web UI Declarative flow: Desired State Store goes home / to sleep System operating
  28. © 2023 Lucas Käldström 49 Declarative + reconciler vs imperative

    Web UI Half of the replicas go down :( Declarative flow: Desired State Store is home / at sleep
  29. © 2023 Lucas Käldström 50 Declarative + reconciler vs imperative

    Web UI Replicas scaled up to full health Declarative flow: Desired State Store Periodic reconciler sync, sees drift get scale up is home / at sleep
  30. © 2023 Lucas Käldström 51 Declarative + reconciler vs imperative

    Web UI System operating in good condition Declarative flow: Desired State Store This design philosophy is why e.g. Kubernetes is called “self-healing”. is home / at sleep
  31. © 2023 Lucas Käldström 57 Abstraction Layers: Pluggable interfaces Cloud

    Native is all about pluggable APIs forming consistent abstractions that projects can implement and/or rely on. These CNCF/LF projects contain only a specification, no implementation:
  32. © 2023 Lucas Käldström 58 Kubernetes is a “platform for

    platforms” Platform A Platform B Platform C Platform D
  33. © 2023 Lucas Käldström 59 Kubernetes is a “platform for

    platforms” Platform A Platform B Platform C Platform D
  34. © 2023 Lucas Käldström 61 Controllers, or reconcile loops, fulfill

    the claim(s) Observe and diff Desired State Source Target System 2 1 2: Actual State 1: Desired State
  35. © 2023 Lucas Käldström 62 Controllers, or reconcile loops, fulfill

    the claim(s) Observe and diff Act Desired State Source 3 Target System 2 1 2: Actual State 1: Desired State 3: Action Plan
  36. © 2023 Lucas Käldström 63 Controllers, or reconcile loops, fulfill

    the claim(s) Observe and diff Act Desired State Source 3 Target System 2 1 2: Actual State 1: Desired State 4: Action 3: Action Plan 4
  37. © 2023 Lucas Käldström 64 Controllers, or reconcile loops, fulfill

    the claim(s) Observe and diff Act Desired State Source 3 Report (Actual State Sink) Target System 2 1 2, 6: Actual State 1: Desired State 4: Action 3: Action Plan 5: Result 4 5 (6)
  38. © 2023 Lucas Käldström 65 Controllers, or reconcile loops, fulfill

    the claim(s) Observe and diff Act Desired State Source 3 Report (Actual State Sink) Target System 2 1 7: Requeue 2, 6: Actual State 1: Desired State 4: Action 3: Action Plan 5: Result 4 5 (6) 7
  39. = Automated reconcile loops with “human-like” operational knowledge Coined in

    2016 by Brandon Phillips, back then at CoreOS Operators: Encode human-like knowledge
  40. © 2023 Lucas Käldström 68 Example: Cilium - Implements Kubernetes

    APIs: Endpoints, Services, Ingress & Gateway - Registers its custom Network APIs with Kubernetes for advanced features - “Compiles” routing and eBPF rules for you on the fly, based on the desired state you specified in the cluster => you never have to write detailed rules - Encodes human-like operational knowledge about configuring networks into a reusable tool controlled by declarative APIs
  41. © 2023 Lucas Käldström 72 Check out my thesis for

    more details! Available openly on Github: https://github.com/luxas/research CC-BY-SA 4.0 licensed Encoding human-like operational knowledge using declarative Kubernetes operator patterns
  42. © 2023 Lucas Käldström 73 Control Theory (Vallery Lancery, QCon,

    2018) My talk on control theory + declarative APIs = Kubernetes
  43. © 2023 Lucas Käldström 75 Wrap-up: The 4 Whys: 1.

    “Control through choreography” based on experience
  44. © 2023 Lucas Käldström 76 Wrap-up: The 4 Whys: 1.

    “Control through choreography” based on experience 2. Periodic controller action for fighting inevitable chaos
  45. © 2023 Lucas Käldström 77 Wrap-up: The 4 Whys: 1.

    “Control through choreography” based on experience 2. Periodic controller action for fighting inevitable chaos 3. Declarativeness allows defining a (portable) end goal
  46. © 2023 Lucas Käldström 78 Wrap-up: The 4 Whys: 1.

    “Control through choreography” based on experience 2. Periodic controller action for fighting inevitable chaos 3. Declarativeness allows defining a (portable) end goal 4. control loops + extensible declarative APIs = operators
  47. © 2023 Lucas Käldström 79 Kubernetes = 1 database +

    1 REST API + 30 operators Uniform, declarative and extensible REST API
  48. Summary Baim Hanif on Unsplash Thank you! @luxas on Github

    @luxas on LinkedIn @luxas on SpeakerDeck @kubernetesonarm on Twitter [email protected]