Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Managing scalable database clusters with the TiDB Operator

Managing scalable database clusters with the TiDB Operator

Presented during HTAP Summit 2023 in San Francisco.

Website: https://www.pingcap.com/htap-summit
Abstract page: https://events.bizzabo.com/474592/agenda/speakers/3096751
Recording TBA

Location: Computer History Museum, 1401 N Shoreline Blvd, Mountain View, CA 94043, USA

Abstract:
Why is Kubernetes and other popular cloud native projects so differently designed compared to previous-generation “VM-era” systems? How has the second law of thermodynamics and control theory shaped cloud native designs? How the shift from traditionally managing servers to using Kubernetes operators (such as TiDB Operator) similar to the Industrial Revolution?

This talk offers the audience a unique perspective into some common cloud native patterns. Kubernetes and Google Spanner, for example, are often described as designed from “decades of experience”, but it is not as often mentioned what that means in practice. Quite conversely, many newcomers to find Kubernetes and similar technologies “too complex”. Why is it, or why is that the impression?

After this talk, the audience has an improved vocabulary of cloud native philosophy terms. This by learning the fundamental design philosophies of Kubernetes and cloud native through well-known phenomena and real-world analogies.

This talk can also relate the concepts presented to features in TiKV and TiDB, such as consistency control and self-healing features. After the concepts are introduced the TiDB Operator is presented as a case-study of the theory.

Lucas Käldström

September 21, 2023
Tweet

More Decks by Lucas Käldström

Other Decks in Technology

Transcript

  1. © 2023 Lucas Käldström
    1
    Managing scalable
    database clusters
    with the
    TiDB Operator
    Lucas Käldström – CNCF Ambassador
    Mountain View – September 21, 2023

    View full-size slide

  2. © 2023 Lucas Käldström
    2
    Cloud Native
    Philosophy:
    Why Do We Now Design
    Software the Way We Do?
    Lucas Käldström – CNCF Ambassador
    Mountain View – September 21, 2023
    or similarly,

    View full-size slide

  3. © 2023 Lucas Käldström
    3
    $ whoami
    Lucas Käldström, 1st-year MSc student at Aalto University, Finland
    CNCF Ambassador, Certified Kubernetes Administrator
    and Emeritus Kubernetes WG/SIG Lead
    KubeCon Speaker in Berlin, Austin,
    Copenhagen, Shanghai, Seattle, San Diego & Valencia
    KubeCon Keynote Speaker in Barcelona
    Former Kubernetes approver and subproject owner,
    active in the OSS community for 7+ years.
    Worked on e.g. SIG Cluster Lifecycle => kubeadm to GA.
    Former Weaveworks contractor, Weave Ignite & libgitops author
    Cloud Native Nordics co-founder & meetup organizer
    Guild of Automation and Systems Technology corporate relations & CFO

    View full-size slide

  4. © 2023 Lucas Käldström
    4
    Agenda

    View full-size slide

  5. © 2023 Lucas Käldström
    5
    Agenda
    - Database Sysadmin Complexities
    - Kubernetes Design Architecture
    - A Sysadmin’s Best Friend: The Operator
    - The TiDB Operator
    - Demo Screenshots (not enough time for live demo)

    View full-size slide

  6. © 2023 Lucas Käldström
    6
    Database Sysadmin
    Complexities

    View full-size slide

  7. © 2023 Lucas Käldström
    7
    Why are we here?

    View full-size slide

  8. © 2023 Lucas Käldström
    8
    Why are we here?
    Want a database for both transactions
    processing and analytical processing

    View full-size slide

  9. © 2023 Lucas Käldström
    9
    What does this require?

    View full-size slide

  10. © 2023 Lucas Käldström
    10
    What does this require?
    - Failure Tolerance and Capacity Demand => Multiple Replicas
    - Multiple Replicas => Consistency Control (Paxos / Raft)
    - Capacity Demands => Sharding
    - and much more!

    View full-size slide

  11. © 2023 Lucas Käldström
    11
    - Multiple Nodes => Need scheduling logic
    - Consensus Algorithms => We need to take care when:
    - Scaling: Need some kind of “learner mode”
    - Upgrading: Avoid killing the consensus leader; give a proper handoff
    first
    - Sharding => Nodes have varying set of data, one node doesn’t necessarily
    all the data
    - Quickly-changing business requirements => Lots of sysadmin work
    What does this mean?

    View full-size slide

  12. © 2023 Lucas Käldström
    12

    View full-size slide

  13. © 2023 Lucas Käldström
    13
    ⇒ Every non-trivial
    system requires
    non-trivial operations

    View full-size slide

  14. © 2023 Lucas Käldström
    14
    Required sysadmin work grows faster than scale
    Business scaling requirement
    Sysadmin work

    View full-size slide

  15. © 2023 Lucas Käldström
    15
    We want to find a
    generic solution

    View full-size slide

  16. © 2023 Lucas Käldström
    16
    Kubernetes Design
    Philosophy

    View full-size slide

  17. © 2023 Lucas Käldström
    17
    Kubernetes Primer
    - Kubernetes is an open source container orchestration system.
    - Project to solve sysadmin operational challenges of app orchestration
    - Already decade old (!), the founding project of CNCF, 80000+ contributors
    - Runs in all environments from own DC to cloud (even on Raspberry Pis!)
    - Super extensible system, you can configure literally everything

    View full-size slide

  18. © 2023 Lucas Käldström
    18
    Based on decades of experience

    View full-size slide

  19. © 2023 Lucas Käldström
    19
    Node
    Kubernetes Architecture
    Single source of truth
    Raft key-value store
    Stateful
    Stateless, declarative and extensible
    REST API
    stateless
    controllers
    Node Node

    these controllers
    “make stuff happen”
    <- reconcile ->

    View full-size slide

  20. © 2023 Lucas Käldström
    20
    Kubernetes: A Control Plane for (any)
    infrastructure

    View full-size slide

  21. © 2023 Lucas Käldström
    21
    Kubernetes: A Control Plane for (any)
    infrastructure
    = A set of automated controllers with operational
    knowledge of how to control a target system

    View full-size slide

  22. © 2023 Lucas Käldström
    22
    Kubernetes: A Control Plane for (any)
    infrastructure
    = A set of automated controllers with operational
    knowledge of how to control a target system
    Around 45 (!) of them in Kubernetes v1.28

    View full-size slide

  23. © 2023 Lucas Käldström
    23
    Kubernetes: A Control Plane for (any)
    infrastructure
    = A set of automated controllers with operational
    knowledge of how to control a target system
    “I know how to efficiently schedule workloads to nodes”
    “I know how to heal applications that were on failed nodes”
    “I know how to configure dynamic service discovery”

    View full-size slide

  24. © 2023 Lucas Käldström
    24
    Kubernetes: A Control Plane for (any)
    infrastructure
    = A set of automated controllers with operational
    knowledge of how to control a target system

    View full-size slide

  25. “Control Through Choreography”
    All user intent is stored in the API
    server.
    Business logic split into controllers
    making user intent a reality

    View full-size slide

  26. © 2023 Lucas Käldström
    26
    Why a controller-centric
    model?

    View full-size slide

  27. Chaos is Inevitable

    View full-size slide

  28. Google Finding: “Failure is the Norm”

    View full-size slide

  29. “deliberately leave significant headroom for
    workload growth, occasional ‘black swan’ events,
    load spikes, machine failures, hardware upgrades,
    and large-scale partial failures
    (e.g., a power supply bus duct)”
    Source: (Verma et. al., 2015)
    Google Finding: “Failure is the Norm”

    View full-size slide

  30. © 2023 Lucas Käldström
    30
    Entropy: Systems become less ordered
    Time
    Entropy
    Order
    Start Stop
    Chaos

    View full-size slide

  31. © 2023 Lucas Käldström
    31
    Entropy: Putting order to chaos
    Time
    Entropy
    Order
    Start Stop
    Chaos
    Reversing,
    ordering
    process

    View full-size slide

  32. © 2023 Lucas Käldström
    32
    Kubernetes: The dishwasher of servers
    Time
    Entropy
    Order
    Start Stop
    Chaos
    Reversing,
    ordering
    process

    View full-size slide

  33. © 2023 Lucas Käldström
    33
    Defining “order” and “chaos”

    View full-size slide

  34. © 2023 Lucas Käldström
    34
    WHAT

    View full-size slide

  35. © 2023 Lucas Käldström
    35
    HOW

    View full-size slide

  36. “If you don’t know where you’re going,
    any road will take you there”

    View full-size slide

  37. © 2023 Lucas Käldström
    37
    Key Takeaways
    a) Systems are inevitably becoming less ordered, thus
    b) need some periodic corrective action to steer the
    course towards
    c) some declared desired state of the system.

    View full-size slide

  38. © 2023 Lucas Käldström
    38
    A sysadmin’s best
    friend, the operator

    View full-size slide

  39. = Automated reconcile loops
    with “human-like” operational knowledge
    Coined in 2016 by Brandon Phillips, back then at CoreOS
    Operators: Encode human-like knowledge

    View full-size slide

  40. = Automated reconcile loops
    with “human-like” operational knowledge
    Coined in 2016 by Brandon Phillips, back then at CoreOS
    Operators: Encode human-like knowledge
    Delegate “repetitive human
    activities that are devoid of lasting value”

    View full-size slide

  41. © 2023 Lucas Käldström
    41
    What should an operator do?
    - Keep infrastructure in control: continuously minimizing drift between the
    desired and actual state,
    - Resource scalability: codify and automate “repetitive human activities
    that are devoid of lasting value”, by encoding domain-specific knowledge,
    - Monitoring scalability: observe application health, metrics and logs, such
    that configuration can be adaptively tuned and alerts of any abnormal
    behavior can be sent seldom but with high importance, and
    - Knowledge scalability: provide a high-level abstraction interface such
    that the application can be operated by engineers without the
    domain-specific knowledge otherwise required

    View full-size slide

  42. © 2023 Lucas Käldström
    42
    Avoids sysadmin management overload

    View full-size slide

  43. Not: Humans Operating Machines

    View full-size slide

  44. Instead: Humans Operating Automation
    that in turn Operate Machines

    View full-size slide

  45. © 2023 Lucas Käldström
    45
    The TiDB Operator

    View full-size slide

  46. © 2023 Lucas Käldström
    46
    TiDB Operator Capabilities
    The tidb-operator provides you with TiDB as a Service in your own cluster
    It features features such as:
    - Multi-Cluster Creation
    - Online up- and downgrades
    - Online up- and downscaling of replicas, even automatically
    - Automatic failover/self-healing
    - Dynamic monitoring
    - Re-configuration of the database
    - Backup and Restore

    View full-size slide

  47. © 2023 Lucas Käldström
    47
    Operator fulfils the user’s desires
    Observe
    and diff
    Desired State Source
    Target System
    2
    1
    2, 6: Actual State
    1: Desired State

    View full-size slide

  48. © 2023 Lucas Käldström
    48
    Operator fulfils the user’s desires
    Observe
    and diff
    Act
    Desired State Source
    3
    Target System
    2
    1
    2, 6: Actual State
    1: Desired State
    4: Action
    3: Action Plan
    4

    View full-size slide

  49. © 2023 Lucas Käldström
    49
    Operator fulfils the user’s desires
    Observe
    and diff
    Act
    Desired State Source
    3
    Report
    (Actual State Sink) Target System
    2
    1
    7: Requeue
    2, 6: Actual State
    1: Desired State
    4: Action
    3: Action Plan
    5: Result
    4
    5
    (6)

    View full-size slide

  50. © 2023 Lucas Käldström
    50
    Operator fulfils the user’s desires
    Observe
    and diff
    Act
    Desired State Source
    3
    Report
    (Actual State Sink) Target System
    2
    1
    7: Requeue
    2, 6: Actual State
    1: Desired State
    4: Action
    3: Action Plan
    5: Result
    4
    5
    (6) 7

    View full-size slide

  51. © 2023 Lucas Käldström
    51
    Demo

    View full-size slide

  52. © 2023 Lucas Käldström
    52
    Hardened tidb-operator setup
    In this demo, we will initially configure 3 cloud VMs for TiDB, 3 cloud VMs for PD,
    and 3 cloud VMs for TiKV.
    Further, we will
    1) install the tidb-operator through the CNCF GitOps engine, Flux
    2) set up the monitoring stack (Prometheus, Grafana) to watch performance
    3) create one TiDBCluster with the operator
    4) apply advanced configuration such as topology and upgrade tuning
    This demo running on UpCloud, thanks for donating cloud credits for this cause!

    View full-size slide

  53. © 2023 Lucas Käldström
    53
    Upgrading a cluster with a 60k QPS load
    In this demo, we will:
    1) bump the version number from v7.1.0 to v7.1.1 using a GitHub Pull Request,
    2) ⇒ operator upgrades the 3*3-TiDB cluster gracefully,
    3) while serving 60k requests per second (without any reconnects!),
    4) while monitoring TiDB performance
    This demo running on UpCloud, thanks for donating cloud credits for this cause!

    View full-size slide

  54. © 2023 Lucas Käldström
    54
    Architecture
    1 2
    3
    tidb-operator
    4
    k8s controllers
    Node
    5
    0

    View full-size slide

  55. © 2023 Lucas Käldström
    55
    Step 1: Change desired configuration in GitHub
    bump to v7.1.1

    View full-size slide

  56. © 2023 Lucas Käldström
    56
    Step 2: Relax and watch the upgrade
    let the upgrade do the work!

    View full-size slide

  57. © 2023 Lucas Käldström
    57
    Operator reconciles desired and actual state

    View full-size slide

  58. © 2023 Lucas Käldström
    58
    We didn’t skip a beat

    View full-size slide

  59. © 2023 Lucas Käldström
    59
    Recap

    View full-size slide

  60. © 2023 Lucas Käldström
    60
    - Manual service discovery (for peers, backup and monitoring)
    - Manual TLS setup
    - Manual scaling
    - Manual version upgrades
    - Manual re-configuration
    - Manual disaster recovery
    What do we **not** have to do?
    real-life footage of sysadmin not
    having to run 1002 commands to
    upgrade the database:

    View full-size slide

  61. Not: Humans Operating Machines

    View full-size slide

  62. Instead: Humans Operating Automation
    that in turn Operate Machines

    View full-size slide

  63. © 2023 Lucas Käldström
    63
    Further Reading

    View full-size slide

  64. © 2023 Lucas Käldström
    64
    Check out my thesis for more details!
    Available openly on Github:
    https://github.com/luxas/research
    CC-BY-SA 4.0 licensed
    Encoding human-like operational
    knowledge using declarative
    Kubernetes operator patterns

    View full-size slide

  65. © 2023 Lucas Käldström
    65
    Control Theory
    (Vallery Lancery, QCon, 2018)
    I have another talk on control theory + declarative APIs = Kubernetes
    Also check out Vallery Lancery’s great talk on the subject.

    View full-size slide

  66. © 2023 Lucas Käldström
    66
    Thank you!

    View full-size slide