Managing scalable database clusters with the TiDB Operator

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

© 2023 Lucas Käldström 3 $ whoami Lucas Käldström, 1st-year MSc student at Aalto University, Finland CNCF Ambassador, Certified Kubernetes Administrator and Emeritus Kubernetes WG/SIG Lead KubeCon Speaker in Berlin, Austin, Copenhagen, Shanghai, Seattle, San Diego & Valencia KubeCon Keynote Speaker in Barcelona Former Kubernetes approver and subproject owner, active in the OSS community for 7+ years. Worked on e.g. SIG Cluster Lifecycle => kubeadm to GA. Former Weaveworks contractor, Weave Ignite & libgitops author Cloud Native Nordics co-founder & meetup organizer Guild of Automation and Systems Technology corporate relations & CFO

Slide 4

Slide 4 text

Slide 5

Slide 5 text

© 2023 Lucas Käldström 5 Agenda - Database Sysadmin Complexities - Kubernetes Design Architecture - A Sysadmin’s Best Friend: The Operator - The TiDB Operator - Demo Screenshots (not enough time for live demo)

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

© 2023 Lucas Käldström 10 What does this require? - Failure Tolerance and Capacity Demand => Multiple Replicas - Multiple Replicas => Consistency Control (Paxos / Raft) - Capacity Demands => Sharding - and much more!

Slide 11

Slide 11 text

© 2023 Lucas Käldström 11 - Multiple Nodes => Need scheduling logic - Consensus Algorithms => We need to take care when: - Scaling: Need some kind of “learner mode” - Upgrading: Avoid killing the consensus leader; give a proper handoff first - Sharding => Nodes have varying set of data, one node doesn’t necessarily all the data - Quickly-changing business requirements => Lots of sysadmin work What does this mean?

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

© 2023 Lucas Käldström 17 Kubernetes Primer - Kubernetes is an open source container orchestration system. - Project to solve sysadmin operational challenges of app orchestration - Already decade old (!), the founding project of CNCF, 80000+ contributors - Runs in all environments from own DC to cloud (even on Raspberry Pis!) - Super extensible system, you can configure literally everything

Slide 18

Slide 18 text

Slide 19

Slide 19 text

© 2023 Lucas Käldström 19 Node Kubernetes Architecture Single source of truth Raft key-value store Stateful Stateless, declarative and extensible REST API stateless controllers Node Node … these controllers “make stuff happen” <- reconcile ->

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

© 2023 Lucas Käldström 22 Kubernetes: A Control Plane for (any) infrastructure = A set of automated controllers with operational knowledge of how to control a target system Around 45 (!) of them in Kubernetes v1.28

Slide 23

Slide 23 text

© 2023 Lucas Käldström 23 Kubernetes: A Control Plane for (any) infrastructure = A set of automated controllers with operational knowledge of how to control a target system “I know how to efficiently schedule workloads to nodes” “I know how to heal applications that were on failed nodes” “I know how to configure dynamic service discovery”

Slide 24

Slide 24 text

Slide 25

Slide 25 text

“Control Through Choreography” All user intent is stored in the API server. Business logic split into controllers making user intent a reality

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Chaos is Inevitable

Slide 28

Slide 28 text

Google Finding: “Failure is the Norm”

Slide 29

Slide 29 text

“deliberately leave significant headroom for workload growth, occasional ‘black swan’ events, load spikes, machine failures, hardware upgrades, and large-scale partial failures (e.g., a power supply bus duct)” Source: (Verma et. al., 2015) Google Finding: “Failure is the Norm”

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

“If you don’t know where you’re going, any road will take you there”

Slide 37

Slide 37 text

© 2023 Lucas Käldström 37 Key Takeaways a) Systems are inevitably becoming less ordered, thus b) need some periodic corrective action to steer the course towards c) some declared desired state of the system.

Slide 38

Slide 38 text

Slide 39

Slide 39 text

= Automated reconcile loops with “human-like” operational knowledge Coined in 2016 by Brandon Phillips, back then at CoreOS Operators: Encode human-like knowledge

Slide 40

Slide 40 text

= Automated reconcile loops with “human-like” operational knowledge Coined in 2016 by Brandon Phillips, back then at CoreOS Operators: Encode human-like knowledge Delegate “repetitive human activities that are devoid of lasting value”

Slide 41

Slide 41 text

© 2023 Lucas Käldström 41 What should an operator do? - Keep infrastructure in control: continuously minimizing drift between the desired and actual state, - Resource scalability: codify and automate “repetitive human activities that are devoid of lasting value”, by encoding domain-specific knowledge, - Monitoring scalability: observe application health, metrics and logs, such that configuration can be adaptively tuned and alerts of any abnormal behavior can be sent seldom but with high importance, and - Knowledge scalability: provide a high-level abstraction interface such that the application can be operated by engineers without the domain-specific knowledge otherwise required

Slide 42

Slide 42 text

Slide 43

Slide 43 text

Not: Humans Operating Machines

Slide 44

Slide 44 text

Instead: Humans Operating Automation that in turn Operate Machines

Slide 45

Slide 45 text

Slide 46

Slide 46 text

© 2023 Lucas Käldström 46 TiDB Operator Capabilities The tidb-operator provides you with TiDB as a Service in your own cluster It features features such as: - Multi-Cluster Creation - Online up- and downgrades - Online up- and downscaling of replicas, even automatically - Automatic failover/self-healing - Dynamic monitoring - Re-configuration of the database - Backup and Restore

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Slide 49

Slide 49 text

© 2023 Lucas Käldström 49 Operator fulfils the user’s desires Observe and diff Act Desired State Source 3 Report (Actual State Sink) Target System 2 1 7: Requeue 2, 6: Actual State 1: Desired State 4: Action 3: Action Plan 5: Result 4 5 (6)

Slide 50

Slide 50 text

© 2023 Lucas Käldström 50 Operator fulfils the user’s desires Observe and diff Act Desired State Source 3 Report (Actual State Sink) Target System 2 1 7: Requeue 2, 6: Actual State 1: Desired State 4: Action 3: Action Plan 5: Result 4 5 (6) 7

Slide 51

Slide 51 text

Slide 52

Slide 52 text

© 2023 Lucas Käldström 52 Hardened tidb-operator setup In this demo, we will initially configure 3 cloud VMs for TiDB, 3 cloud VMs for PD, and 3 cloud VMs for TiKV. Further, we will 1) install the tidb-operator through the CNCF GitOps engine, Flux 2) set up the monitoring stack (Prometheus, Grafana) to watch performance 3) create one TiDBCluster with the operator 4) apply advanced configuration such as topology and upgrade tuning This demo running on UpCloud, thanks for donating cloud credits for this cause!

Slide 53

Slide 53 text

© 2023 Lucas Käldström 53 Upgrading a cluster with a 60k QPS load In this demo, we will: 1) bump the version number from v7.1.0 to v7.1.1 using a GitHub Pull Request, 2) ⇒ operator upgrades the 3*3-TiDB cluster gracefully, 3) while serving 60k requests per second (without any reconnects!), 4) while monitoring TiDB performance This demo running on UpCloud, thanks for donating cloud credits for this cause!

Slide 54

Slide 54 text

Slide 55

Slide 55 text

Slide 56

Slide 56 text

Slide 57

Slide 57 text

Slide 58

Slide 58 text

Slide 59

Slide 59 text

Slide 60

Slide 60 text

© 2023 Lucas Käldström 60 - Manual service discovery (for peers, backup and monitoring) - Manual TLS setup - Manual scaling - Manual version upgrades - Manual re-configuration - Manual disaster recovery What do we **not** have to do? real-life footage of sysadmin not having to run 1002 commands to upgrade the database:

Slide 61

Slide 61 text

Not: Humans Operating Machines

Slide 62

Slide 62 text

Instead: Humans Operating Automation that in turn Operate Machines

Slide 63

Slide 63 text

Slide 64

Slide 64 text

© 2023 Lucas Käldström 64 Check out my thesis for more details! Available openly on Github: https://github.com/luxas/research CC-BY-SA 4.0 licensed Encoding human-like operational knowledge using declarative Kubernetes operator patterns

Slide 65

Slide 65 text

Slide 66

Slide 66 text