$30 off During Our Annual Pro Sale. View Details »

Understanding Kubernetes Through Real-World Phenomena and Analogies

Understanding Kubernetes Through Real-World Phenomena and Analogies

Presented at the DevOps Finland meetup: https://www.meetup.com/DevOps-Finland

Lucas Käldström

May 09, 2023
Tweet

More Decks by Lucas Käldström

Other Decks in Technology

Transcript

  1. Understanding Kubernetes Through
    Real-World Phenomena and Analogies
    Lucas Käldström - CNCF Ambassador
    May 9, 2023 – Helsinki
    Image credit: CNCF

    View Slide

  2. © 2023 Lucas Käldström
    2
    $ whoami
    Lucas Käldström, 4th-year BSc student at Aalto University, Finland
    CNCF Ambassador, Certified Kubernetes Administrator
    and Emeritus Kubernetes WG/SIG Lead
    KubeCon Speaker in Berlin, Austin,
    Copenhagen, Shanghai, Seattle, San Diego & Valencia
    KubeCon Keynote Speaker in Barcelona
    Former Kubernetes approver and subproject owner,
    active in the OSS community for 7+ years.
    Worked on e.g. SIG Cluster Lifecycle => kubeadm to GA.
    Weaveworks contractor, Weave Ignite & libgitops author
    Cloud Native Nordics co-founder & meetup organizer
    Guild of Automation and Systems Technology corporate relations & CFO

    View Slide

  3. © 2023 Lucas Käldström
    3 Credits to Simon Sinek

    View Slide

  4. © 2023 Lucas Käldström
    4 Credits to Simon Sinek

    View Slide

  5. © 2023 Lucas Käldström
    5
    Let’s start by defining it

    View Slide

  6. © 2023 Lucas Käldström
    6
    A Container Orchestrator? Yes

    View Slide

  7. © 2023 Lucas Käldström
    7
    A Container Orchestrator? Yes
    But in fact, even more than that

    View Slide

  8. © 2023 Lucas Käldström
    8
    Kubernetes: A Control Plane for (any) infrastructure

    View Slide

  9. © 2023 Lucas Käldström
    9
    Kubernetes: A Control Plane for (any) infrastructure
    = A set of automated controllers with operational
    knowledge of how to control a target system

    View Slide

  10. © 2023 Lucas Käldström
    10

    View Slide

  11. © 2023 Lucas Käldström
    11
    Run anywhere
    Self-healing
    Scalable
    workload
    scheduling
    Service discovery
    + config mgmt
    What?

    View Slide

  12. © 2023 Lucas Käldström
    12
    Specify once; Kubernetes makes your dream true
    JSON
    container
    workload
    specification
    REST API
    server
    HTTP POST JSON object
    *The process doesn’t look exactly like this, it is a simplified mental model for now

    View Slide

  13. © 2023 Lucas Käldström
    13
    Specify once; Kubernetes makes your dream true
    JSON
    container
    workload
    specification
    REST API
    server
    HTTP POST JSON object
    Container
    Workload
    Controller
    read
    desired
    state
    *The process doesn’t look exactly like this, it is a simplified mental model for now

    View Slide

  14. © 2023 Lucas Käldström
    14
    Specify once; Kubernetes makes your dream true
    JSON
    container
    workload
    specification
    REST API
    server
    HTTP POST JSON object
    Container
    Workload
    Controller
    read
    desired
    state
    *The process doesn’t look exactly like this, it is a simplified mental model for now
    pull
    start
    re-start
    monitor

    View Slide

  15. © 2023 Lucas Käldström
    15
    Run anywhere
    Self-healing
    Scalable
    workload
    scheduling
    Service discovery
    + config mgmt
    How?
    Closed-loop controllers Uniform, declarative and extensible API

    View Slide

  16. © 2023 Lucas Käldström
    16 Credits to Simon Sinek

    View Slide

  17. © 2023 Lucas Käldström
    17
    What problem are we trying to solve?

    View Slide

  18. © 2023 Lucas Käldström
    18
    Based on decades of experience

    View Slide

  19. © 2023 Lucas Käldström
    19
    Comes with 24 pages of API design guidelines!

    View Slide

  20. But is it inherently “too complex” for most?

    View Slide

  21. Problems hiding in plain sight
    It just takes longer for small-scale users to
    notice problems due to e.g. randomness
    100 days
    time
    servers
    3
    3 days
    time
    servers
    100
    Small-scale users
    Large-scale users

    View Slide

  22. => unknown unknowns for small systems

    View Slide

  23. Chaos is Inevitable

    View Slide

  24. Google Finding: “Failure is the Norm”

    View Slide

  25. “deliberately leave significant headroom for
    workload growth, occasional ‘black swan’ events,
    load spikes, machine failures, hardware upgrades,
    and large-scale partial failures
    (e.g., a power supply bus duct)”
    Source: (Verma et. al., 2015)
    Google Finding: “Failure is the Norm”

    View Slide

  26. © 2023 Lucas Käldström
    26
    Entropy: Systems become less ordered
    Time
    Entropy
    Order
    Start Stop
    Chaos
    area of uncertainty grows!

    View Slide

  27. © 2023 Lucas Käldström
    27
    Entropy: Putting order to chaos
    Time
    Entropy
    Order
    Start Stop
    Chaos
    Reversing,
    ordering
    process

    View Slide

  28. © 2023 Lucas Käldström
    28
    Kubernetes: The dishwasher of servers
    Time
    Entropy
    Order
    Start Stop
    Chaos
    Reversing,
    ordering
    process

    View Slide

  29. What does this mean for server systems?
    ✨ ✨ ✨
    1
    2
    3
    Operating System
    v1 v2 v3
    1
    2
    3
    Configuration
    A B C
    1
    2
    3
    Power
    On Off Out
    OS v1
    Config A
    Power On
    OS v1
    Config A
    Power On
    OS v1
    Config A
    Power On
    1 2 3
    Example: Sysadmin A gets three new servers, and install the
    same operating system onto all of them, with exactly the
    same configuration.
    In the beginning, the system is completely ordered, all
    instances are identically configured.

    View Slide

  30. What does this mean for server systems?
    1
    2
    3
    Operating System
    v1 v2 v3
    1
    2
    3
    Configuration
    A B C
    1
    2
    3
    Power
    On Off Out
    OS v1
    Config A
    Power On
    OS v2
    Config A
    Power On
    OS v2
    Config A
    Power On
    1 2 3
    After some time, a critical “v2” security upgrade to the
    operating system becomes available, and sysadmin A
    upgrades servers 2 and 3, but not 1, as it is running a critical
    database service, so A is afraid to disturb it.

    View Slide

  31. What does this mean for server systems?
    1
    2
    3
    Operating System
    v1 v2 v3
    1
    2
    3
    Configuration
    A B C
    1
    2
    3
    Power
    On Off Out
    Slow disk access time
    OS v1
    Config B
    Power On
    OS v2
    Config A
    Power On
    OS v2
    Config A
    Power On
    1 2 3
    Server 1 complains about slow disk access time, due to a
    misconfiguration in the operating system. Sysadmin A fixes it
    imperatively on the computer that complains until it stops, but
    none of the other servers.

    View Slide

  32. What does this mean for server systems?
    1
    2
    3
    Operating System
    v1 v2 v3
    1
    2
    3
    Configuration
    A B C
    1
    2
    3
    Power
    On Off Out
    OS v1
    Config B
    Power On
    OS v2
    Config A
    Power Off
    OS v2
    Config A
    Power On
    1 2 3
    Sysadmin A has noticed that the amount of users has dropped
    because of a seasonal trend, so A decides to turn server 2 off
    to save on energy costs.

    View Slide

  33. What does this mean for server systems?
    1
    2
    3
    Operating System
    v1 v2 v3
    1
    2
    3
    Configuration
    A B C
    1
    2
    3
    Power
    On Off Out
    Slow disk access time
    OS v1
    Config B
    Power On
    OS v2
    Config A
    Power Off
    OS v2
    Config C
    Power On
    1 2 3
    The next week, when sysadmin A is on vacation, server 3
    complains about the same error as server 1 earlier. Sysadmin B
    “solves” the issue (in another way than A for server 1), but
    does nothing to the other servers.

    View Slide

  34. What does this mean for server systems?
    1
    2
    3
    Operating System
    v1 v2 v3
    1
    2
    3
    Configuration
    A B C
    1
    2
    3
    Power
    On Off Out
    OS v1
    Config B
    Power On
    OS v2
    Config A
    Power Off
    OS v3
    Config C
    Power On
    1 2 3
    desired state change
    Now, a new version of the operating system is released with a
    very cool feature that would be useful to the sysadmins.
    However, upgrading is risky because of incompatibilities, so
    they only upgrade server 3 to try it out.

    View Slide

  35. What does this mean for server systems?
    1
    2
    3
    Operating System
    v1 v2 v3
    1
    2
    3
    Configuration
    A B C
    1
    2
    3
    Power
    On Off Out
    OS v1
    Config B
    Power On
    OS v2
    Config A
    Power Off
    OS v3
    Config C
    Power Out
    1 2 3
    emergent state change
    Suddenly, a thunderstorm enters the area where the servers
    are, and the lightning strikes. Due to the lack of overvoltage
    protection, server 3’s power supply becomes unusable, and
    thus shuts down.

    View Slide

  36. © 2023 Lucas Käldström
    36
    Entropy: Systems become less ordered
    Time
    Entropy
    Order
    Start Stop
    Chaos

    View Slide

  37. © 2023 Lucas Käldström
    37
    Kubernetes: The dishwasher of servers

    View Slide

  38. © 2023 Lucas Käldström
    38
    Kubernetes: The dishwasher of servers

    View Slide

  39. © 2023 Lucas Käldström
    39

    View Slide

  40. © 2023 Lucas Käldström
    40

    View Slide

  41. © 2023 Lucas Käldström
    41
    Game Theory: An Infinite Game against Chaos

    View Slide

  42. © 2023 Lucas Käldström
    42
    Key Takeaways
    a) Systems are inevitably becoming less ordered, and thus
    b) need some periodic corrective action to steer the course
    towards
    c) some declared desired state of the system.

    View Slide

  43. © 2023 Lucas Käldström
    43
    Declarative + reconciler vs imperative
    Web UI
    System operated (e.g. webservers),
    in faulty conditions, e.g. too few replicas
    Click
    Corrective
    Action
    Imperative flow:

    View Slide

  44. © 2023 Lucas Käldström
    44
    Declarative + reconciler vs imperative
    Web UI
    System operated, correct state
    Imperative flow:
    goes home / to sleep

    View Slide

  45. © 2023 Lucas Käldström
    45
    Declarative + reconciler vs imperative
    Web UI
    Half of the replicas go down :(
    Imperative flow:
    is home / at sleep
    Nothing happens; need either admin to wake
    up or to wait for next morning
    Area of uncertainty grows!

    View Slide

  46. © 2023 Lucas Käldström
    46
    Declarative + reconciler vs imperative
    Web UI
    Declarative
    flow:
    Desired
    State Store
    Define Desired State

    View Slide

  47. © 2023 Lucas Käldström
    47
    Declarative + reconciler vs imperative
    Web UI
    System operating
    Declarative
    flow:
    Desired
    State Store
    Shortly thereafter,
    reconciler sync
    get scale
    up

    View Slide

  48. © 2023 Lucas Käldström
    48
    Declarative + reconciler vs imperative
    Web UI
    Declarative
    flow:
    Desired
    State Store
    goes home / to sleep
    System operating

    View Slide

  49. © 2023 Lucas Käldström
    49
    Declarative + reconciler vs imperative
    Web UI
    Half of the replicas go down :(
    Declarative
    flow:
    Desired
    State Store
    is home / at sleep

    View Slide

  50. © 2023 Lucas Käldström
    50
    Declarative + reconciler vs imperative
    Web UI
    Replicas scaled up to full health
    Declarative
    flow:
    Desired
    State Store
    Periodic
    reconciler sync,
    sees drift
    get scale
    up
    is home / at sleep

    View Slide

  51. © 2023 Lucas Käldström
    51
    Declarative + reconciler vs imperative
    Web UI
    System operating in good condition
    Declarative
    flow:
    Desired
    State Store
    This design philosophy is why e.g. Kubernetes is
    called “self-healing”.
    is home / at sleep

    View Slide

  52. © 2023 Lucas Käldström
    52
    WHAT

    View Slide

  53. © 2023 Lucas Käldström
    53
    HOW

    View Slide

  54. © 2023 Lucas Käldström
    54
    HOW

    View Slide

  55. “If you don’t know where you’re going,
    any road will take you there”

    View Slide

  56. © 2023 Lucas Käldström
    56
    controllers + extensible API = abstraction layer

    View Slide

  57. © 2023 Lucas Käldström
    57
    Abstraction Layers: Pluggable interfaces
    Cloud Native is all about pluggable APIs forming consistent abstractions that
    projects can implement and/or rely on.
    These CNCF/LF projects contain only a specification, no implementation:

    View Slide

  58. © 2023 Lucas Käldström
    58
    Kubernetes is a “platform for platforms”
    Platform A Platform B
    Platform C
    Platform D

    View Slide

  59. © 2023 Lucas Käldström
    59
    Kubernetes is a “platform for platforms”
    Platform A Platform B
    Platform C
    Platform D

    View Slide

  60. *except for the problem of too many layers of indirection :D

    View Slide

  61. © 2023 Lucas Käldström
    61
    Controllers, or reconcile loops, fulfill the claim(s)
    Observe
    and diff
    Desired State Source
    Target System
    2
    1
    2: Actual State
    1: Desired State

    View Slide

  62. © 2023 Lucas Käldström
    62
    Controllers, or reconcile loops, fulfill the claim(s)
    Observe
    and diff
    Act
    Desired State Source
    3
    Target System
    2
    1
    2: Actual State
    1: Desired State
    3: Action Plan

    View Slide

  63. © 2023 Lucas Käldström
    63
    Controllers, or reconcile loops, fulfill the claim(s)
    Observe
    and diff
    Act
    Desired State Source
    3
    Target System
    2
    1
    2: Actual State
    1: Desired State
    4: Action
    3: Action Plan
    4

    View Slide

  64. © 2023 Lucas Käldström
    64
    Controllers, or reconcile loops, fulfill the claim(s)
    Observe
    and diff
    Act
    Desired State Source
    3
    Report
    (Actual State Sink) Target System
    2
    1
    2, 6: Actual State
    1: Desired State
    4: Action
    3: Action Plan
    5: Result
    4
    5
    (6)

    View Slide

  65. © 2023 Lucas Käldström
    65
    Controllers, or reconcile loops, fulfill the claim(s)
    Observe
    and diff
    Act
    Desired State Source
    3
    Report
    (Actual State Sink) Target System
    2
    1
    7: Requeue
    2, 6: Actual State
    1: Desired State
    4: Action
    3: Action Plan
    5: Result
    4
    5
    (6) 7

    View Slide

  66. Operators: Encode human-like knowledge

    View Slide

  67. = Automated reconcile loops
    with “human-like” operational knowledge
    Coined in 2016 by Brandon Phillips, back then at CoreOS
    Operators: Encode human-like knowledge

    View Slide

  68. © 2023 Lucas Käldström
    68
    Example: Cilium
    - Implements Kubernetes APIs: Endpoints, Services, Ingress & Gateway
    - Registers its custom Network APIs with Kubernetes for advanced features
    - “Compiles” routing and eBPF rules for you on the fly, based on the
    desired state you specified in the cluster
    => you never have to write detailed rules
    - Encodes human-like operational knowledge about configuring networks
    into a reusable tool controlled by declarative APIs

    View Slide

  69. Not: Humans Operating Machines

    View Slide

  70. Instead: Humans Operating Automation
    that in turn Operate Machines

    View Slide

  71. Further Reading

    View Slide

  72. © 2023 Lucas Käldström
    72
    Check out my thesis for more details!
    Available openly on Github:
    https://github.com/luxas/research
    CC-BY-SA 4.0 licensed
    Encoding human-like operational
    knowledge using declarative
    Kubernetes operator patterns

    View Slide

  73. © 2023 Lucas Käldström
    73
    Control Theory
    (Vallery Lancery, QCon, 2018)
    My talk on control theory + declarative APIs = Kubernetes

    View Slide

  74. © 2023 Lucas Käldström
    74
    Promise Theory
    (The Kubernetes Documentary, Honeypot, 2022)

    View Slide

  75. © 2023 Lucas Käldström
    75
    Wrap-up: The 4 Whys:
    1. “Control through choreography” based on experience

    View Slide

  76. © 2023 Lucas Käldström
    76
    Wrap-up: The 4 Whys:
    1. “Control through choreography” based on experience
    2. Periodic controller action for fighting inevitable chaos

    View Slide

  77. © 2023 Lucas Käldström
    77
    Wrap-up: The 4 Whys:
    1. “Control through choreography” based on experience
    2. Periodic controller action for fighting inevitable chaos
    3. Declarativeness allows defining a (portable) end goal

    View Slide

  78. © 2023 Lucas Käldström
    78
    Wrap-up: The 4 Whys:
    1. “Control through choreography” based on experience
    2. Periodic controller action for fighting inevitable chaos
    3. Declarativeness allows defining a (portable) end goal
    4. control loops + extensible declarative APIs = operators

    View Slide

  79. © 2023 Lucas Käldström
    79
    Kubernetes = 1 database + 1 REST API + 30 operators
    Uniform, declarative and extensible REST API

    View Slide

  80. Summary
    Baim Hanif on Unsplash
    Thank you!
    @luxas on Github
    @luxas on LinkedIn
    @luxas on SpeakerDeck
    @kubernetesonarm on Twitter
    [email protected]

    View Slide