Managing scalable database clusters with the TiDB Operator

© 2023 Lucas Käldström 1 Managing scalable database clusters with
the TiDB Operator Lucas Käldström – CNCF Ambassador Mountain View – September 21, 2023

© 2023 Lucas Käldström 2 Cloud Native Philosophy: Why Do
We Now Design Software the Way We Do? Lucas Käldström – CNCF Ambassador Mountain View – September 21, 2023 or similarly,

© 2023 Lucas Käldström 3 $ whoami Lucas Käldström, 1st-year
MSc student at Aalto University, Finland CNCF Ambassador, Certified Kubernetes Administrator and Emeritus Kubernetes WG/SIG Lead KubeCon Speaker in Berlin, Austin, Copenhagen, Shanghai, Seattle, San Diego & Valencia KubeCon Keynote Speaker in Barcelona Former Kubernetes approver and subproject owner, active in the OSS community for 7+ years. Worked on e.g. SIG Cluster Lifecycle => kubeadm to GA. Former Weaveworks contractor, Weave Ignite & libgitops author Cloud Native Nordics co-founder & meetup organizer Guild of Automation and Systems Technology corporate relations & CFO

© 2023 Lucas Käldström 5 Agenda - Database Sysadmin Complexities
- Kubernetes Design Architecture - A Sysadmin’s Best Friend: The Operator - The TiDB Operator - Demo Screenshots (not enough time for live demo)

© 2023 Lucas Käldström 8 Why are we here? Want
a database for both transactions processing and analytical processing

© 2023 Lucas Käldström 10 What does this require? -
Failure Tolerance and Capacity Demand => Multiple Replicas - Multiple Replicas => Consistency Control (Paxos / Raft) - Capacity Demands => Sharding - and much more!

© 2023 Lucas Käldström 11 - Multiple Nodes => Need
scheduling logic - Consensus Algorithms => We need to take care when: - Scaling: Need some kind of “learner mode” - Upgrading: Avoid killing the consensus leader; give a proper handoff first - Sharding => Nodes have varying set of data, one node doesn’t necessarily all the data - Quickly-changing business requirements => Lots of sysadmin work What does this mean?

© 2023 Lucas Käldström 13 ⇒ Every non-trivial system requires
non-trivial operations

© 2023 Lucas Käldström 14 Required sysadmin work grows faster
than scale Business scaling requirement Sysadmin work

© 2023 Lucas Käldström 15 We want to find a
generic solution

© 2023 Lucas Käldström 17 Kubernetes Primer - Kubernetes is
an open source container orchestration system. - Project to solve sysadmin operational challenges of app orchestration - Already decade old (!), the founding project of CNCF, 80000+ contributors - Runs in all environments from own DC to cloud (even on Raspberry Pis!) - Super extensible system, you can configure literally everything

© 2023 Lucas Käldström 19 Node Kubernetes Architecture Single source
of truth Raft key-value store Stateful Stateless, declarative and extensible REST API stateless controllers Node Node … these controllers “make stuff happen” <- reconcile ->

© 2023 Lucas Käldström 20 Kubernetes: A Control Plane for
(any) infrastructure

(any) infrastructure = A set of automated controllers with operational knowledge of how to control a target system

(any) infrastructure = A set of automated controllers with operational knowledge of how to control a target system Around 45 (!) of them in Kubernetes v1.28

(any) infrastructure = A set of automated controllers with operational knowledge of how to control a target system “I know how to efficiently schedule workloads to nodes” “I know how to heal applications that were on failed nodes” “I know how to configure dynamic service discovery”

(any) infrastructure = A set of automated controllers with operational knowledge of how to control a target system

“Control Through Choreography” All user intent is stored in the
API server. Business logic split into controllers making user intent a reality

Chaos is Inevitable

Google Finding: “Failure is the Norm”

“deliberately leave significant headroom for workload growth, occasional ‘black swan’
events, load spikes, machine failures, hardware upgrades, and large-scale partial failures (e.g., a power supply bus duct)” Source: (Verma et. al., 2015) Google Finding: “Failure is the Norm”

© 2023 Lucas Käldström 30 Entropy: Systems become less ordered
Time Entropy Order Start Stop Chaos

© 2023 Lucas Käldström 31 Entropy: Putting order to chaos
Time Entropy Order Start Stop Chaos Reversing, ordering process

© 2023 Lucas Käldström 32 Kubernetes: The dishwasher of servers
Time Entropy Order Start Stop Chaos Reversing, ordering process

“If you don’t know where you’re going, any road will
take you there”

© 2023 Lucas Käldström 37 Key Takeaways a) Systems are
inevitably becoming less ordered, thus b) need some periodic corrective action to steer the course towards c) some declared desired state of the system.

© 2023 Lucas Käldström 38 A sysadmin’s best friend, the
operator

= Automated reconcile loops with “human-like” operational knowledge Coined in
2016 by Brandon Phillips, back then at CoreOS Operators: Encode human-like knowledge

= Automated reconcile loops with “human-like” operational knowledge Coined in
2016 by Brandon Phillips, back then at CoreOS Operators: Encode human-like knowledge Delegate “repetitive human activities that are devoid of lasting value”

© 2023 Lucas Käldström 41 What should an operator do?
- Keep infrastructure in control: continuously minimizing drift between the desired and actual state, - Resource scalability: codify and automate “repetitive human activities that are devoid of lasting value”, by encoding domain-specific knowledge, - Monitoring scalability: observe application health, metrics and logs, such that configuration can be adaptively tuned and alerts of any abnormal behavior can be sent seldom but with high importance, and - Knowledge scalability: provide a high-level abstraction interface such that the application can be operated by engineers without the domain-specific knowledge otherwise required

Not: Humans Operating Machines

Instead: Humans Operating Automation that in turn Operate Machines

© 2023 Lucas Käldström 46 TiDB Operator Capabilities The tidb-operator
provides you with TiDB as a Service in your own cluster It features features such as: - Multi-Cluster Creation - Online up- and downgrades - Online up- and downscaling of replicas, even automatically - Automatic failover/self-healing - Dynamic monitoring - Re-configuration of the database - Backup and Restore

© 2023 Lucas Käldström 47 Operator fulfils the user’s desires
Observe and diff Desired State Source Target System 2 1 2, 6: Actual State 1: Desired State

Observe and diff Act Desired State Source 3 Target System 2 1 2, 6: Actual State 1: Desired State 4: Action 3: Action Plan 4

Observe and diff Act Desired State Source 3 Report (Actual State Sink) Target System 2 1 7: Requeue 2, 6: Actual State 1: Desired State 4: Action 3: Action Plan 5: Result 4 5 (6)

Observe and diff Act Desired State Source 3 Report (Actual State Sink) Target System 2 1 7: Requeue 2, 6: Actual State 1: Desired State 4: Action 3: Action Plan 5: Result 4 5 (6) 7

© 2023 Lucas Käldström 52 Hardened tidb-operator setup In this
demo, we will initially configure 3 cloud VMs for TiDB, 3 cloud VMs for PD, and 3 cloud VMs for TiKV. Further, we will 1) install the tidb-operator through the CNCF GitOps engine, Flux 2) set up the monitoring stack (Prometheus, Grafana) to watch performance 3) create one TiDBCluster with the operator 4) apply advanced configuration such as topology and upgrade tuning This demo running on UpCloud, thanks for donating cloud credits for this cause!

© 2023 Lucas Käldström 53 Upgrading a cluster with a
60k QPS load In this demo, we will: 1) bump the version number from v7.1.0 to v7.1.1 using a GitHub Pull Request, 2) ⇒ operator upgrades the 3*3-TiDB cluster gracefully, 3) while serving 60k requests per second (without any reconnects!), 4) while monitoring TiDB performance This demo running on UpCloud, thanks for donating cloud credits for this cause!

© 2023 Lucas Käldström 60 - Manual service discovery (for
peers, backup and monitoring) - Manual TLS setup - Manual scaling - Manual version upgrades - Manual re-configuration - Manual disaster recovery What do we **not** have to do? real-life footage of sysadmin not having to run 1002 commands to upgrade the database:

Not: Humans Operating Machines

Instead: Humans Operating Automation that in turn Operate Machines

© 2023 Lucas Käldström 64 Check out my thesis for
more details! Available openly on Github: https://github.com/luxas/research CC-BY-SA 4.0 licensed Encoding human-like operational knowledge using declarative Kubernetes operator patterns

© 2023 Lucas Käldström 65 Control Theory (Vallery Lancery, QCon,
2018) I have another talk on control theory + declarative APIs = Kubernetes Also check out Vallery Lancery’s great talk on the subject.

Managing scalable database clusters with the Ti...

Managing scalable database clusters with the TiDB Operator

More Decks by Lucas Käldström

Other Decks in Technology

Featured

Transcript