TiDBとMySQLと未来 - PingCAP株式会社 - TiDB User Day 2023

Slide 1

Slide 1 text

Exploring the Latest TiDB Features and Future Roadmap - Part I

Slide 2

Slide 2 text

Ed Huang, Co-founder CTO, PingCAP Graduated from Northeastern University, Ed worked for Microsoft Research Asia, and Wandou Labs. Ed is a senior infrastructure software engineer and system architect, an expert in distributed system and database development and has rich experience and unique understanding in distributed storage. Ed is the co-author of Codis, a widely used distributed Redis solution, and TiDB, a distributed HTAP database. Projects: https://github.com/pingcap/tidb https://github.com/pingcap/tikv https://github.com/CodisLabs/codis Introduction

Slide 3

Slide 3 text

A little bit of history

Slide 4

Slide 4 text

We don't want to shard 🤕😥 🥱 ● Hundreds of TB ● Reshard / Rebalance (at midnight) ● Keep strong consistency when MySQL node loss ● Query conditions that are not included the sharding key ● Explain to angry application developers that don't understand why they can't do things like the old days ● …

Slide 5

Slide 5 text

A database for modern data-intensive applications We decided to build a database from scratch Strong Consistency & Cross-Row Transaction Scalable High Available 100% OLTP + Ad-Hoc OLAP Cloud-Native Easy to use and with Full-featured SQL Open-Source!

Slide 6

Slide 6 text

A Scalable OLTP database

Slide 7

Slide 7 text

1,000 ft Architecture Overview of TiDB Everything distributed!

Slide 8

Slide 8 text

The architecture during 2017-2018 ● Elastic scaling, distributed ACID transaction processing ● Strong consistency across nodes ● Built-in HA TiDB Then, we brought it to life, and named it TiDB

Slide 9

Slide 9 text

From OLTP to HTAP

Slide 10

Slide 10 text

TiDB Cluster OLTP Workload (Insert / Update / Delete /…) OLAP Query (Join / Group By / …) If you got a scalable OLTP database with SQL support What’s your next move? Ｗhy not OLAP queries?

Slide 11

Slide 11 text

TiDB + TiFlash, isolate OLTP and OLAP workloads, and process fresh data in real-time. TiFlash: Columnar extension for TiDB HTAP hit the road in 2020 - 2021 TiFlash

Slide 12

Slide 12 text

We kept the evolution momentum on Cloud

Slide 13

Slide 13 text

● Operation and maintenance costs are unignorable ● Compliance and security is must for your data platform, and it’s hard if you were going to build your own ● Irreversible trend towards multi-cloud ● Fully managed database brings more value, allowing customers to focus on their business instead of database technical details. Why TiDB Cloud

Slide 14

Slide 14 text

Horizontal Scaling Strong Resilience Geo-Distribution Easy Upgrades Real-Time Analytics Industry-Leading Security Monitoring 24/7 Support TiDB Cloud

Slide 15

Slide 15 text

TiDB Cloud Dashboard UI Cloud Provider Region 🔒Your VPC Service Consumer Subnet A 10.1.10.0/24 Subnet B 10.1.20.0/24 🔒TiDB VPC Service Provider Cluster - 1 Cluster - 2 AZ 1 AZ 2 AZ 3 TiDB TiKV TiDB TiKV TiKV Operator Logging Monitoring Gateway Control Plane Billing Alerting Meta Store Chat2Query Data API Central Services TiDB Cloud Architect

Slide 16

Slide 16 text

Next big thing for TiDB Cloud - Serverless

Slide 17

Slide 17 text

What kind of database will be future-proof for the age of AI? ● Elastic Scalability > PB Level ● Simple and intuitive User friendly ● Consolidated All in one solution ● Always on and Always Fresh Cloud-native ● Open eco-system Support OpenAPI and open source ecosystem ● Low cost Pay-as-you-go, support scale to zero

Slide 18

Slide 18 text

Serverless TiDB ● Launch without concern for deployment details (No Configurations!) ○ and blazing fast, what about ~XX or even ~X Secs? ○ ● Elastic Scale as the workload scales ○ Scale to zero, resume time ~X secs ● Pay-as-you-go at a finer granularity ○ Charge by workload metrics (RCU / WCU, Read / Write Capacity Unit) ○ Especially for OLAP workloads ● Better and deeper integration with modern application development ○ processes

Slide 19

Slide 19 text

Other Region Shared Gateway Isolated SQL Layer Shared Storage Layer Shared Storage Pool Gateway Gateway Gateway Virtual Cluster - Tenant 1 TiDB MPP Row Engine Row Engine Row Engine S3 Virtual Cluster - Tenant 2 On Demand Pool TiDB TiDB TiDB MPP Worker MPP Worker MPP Worker Columnar Engine Columnar Engine Columnar Engine TiDB MPP Virtual Cluster - Tenant n TiDB MPP MPP MPP TiDB Shared Services Compaction Analyze DDL Remote Copr Service Service TiDB Cloud Serverless - how it works ● Compute-stora ge totally separation ● Multi-tenancy by nature ● Second-level launch time ● Pay-as-you-go

Slide 20

Slide 20 text

TiDB Serverless already has over 10,000 active clusters It's GA yesterday 🎉

Slide 21

Slide 21 text

Serverless TiDB

Slide 22

Slide 22 text

Build a Scalable Real-Time Github Insight WebApp in 1 week - the OSSInsight Story ~6 billions Github Event Data (keep increasing), answering the questions like: ● What country is your favorite open source project most popular in？ ● What's the most popular JS framework? ● Which companies contribute most to open source databases? ● …

Slide 23

Slide 23 text

Huge Data Volume - ~12TB, 6 Billion rows of data, and it grows in real-time. Mixed Workloads - Serving online transactions plus analytical queries. Short Deadline - We were asked to make it happen in 1 week! Unpredictable Traffic Spikes - Featured on HackerNews with 7x traffic in one day. Visit us - https://ossinsight.io/ THE “best practices” for data-intensive apps TiDB Serverless just makes your life easier Source code - https://github.com/pingcap/ossinsight 😒 Build a Scalable Real-Time Github Insight WebApp in 1 week - the OSSInsight Story

Slide 24

Slide 24 text

All in TiDB Serverless now with 70% cost-saving Automatic scaling up and down to handle unpredictable traﬃc, without the need for manual intervention Build a Scalable Real-Time Github Insight WebApp in 1 week - the OSSInsight Story

Slide 25

Slide 25 text

Offerings TiDB Self-hosted TiDB Cloud Dedicated TiDB Cloud Serverless Elastic Scalability Hybrid workload support ACID Transaction Full-featured SQL Multi-tenancy Everything under control Fully managed Enterpise-level Securiy Multi Cloud platform Auto-Failover No/Little manual deployment Programmatically management via OpenAPI Automated scale / billing based on real-time consumption Instant starts / scale to zero

Slide 26

Slide 26 text

Vision TiDB Self-hosted HTAP Serverless + + = Frictionless Developer Experience Elastic Scalability Auto-Failover ACID Transaction Full-featured SQL Fully managed / Multi-tenancy … Real-time high-performance OLAP while processing OLTP Consistently Hybrid workload support No/Little manual deployment Programmatically management via OpenAPI Automated scale / billing based on real-time consumption Instant starts / scale to zero

Slide 27

Slide 27 text

Exploring the Latest TiDB Features and Future Roadmap - Part II

Slide 28

Slide 28 text

Introduction Sunny Bains Senior Architect, PingCAP ● Working on database internals since 2001. ● Was MySQL/InnoDB team lead at Oracle.

Slide 29

Slide 29 text

TiDB’s unique value - Easy to setup and start - MySQL compatible - Scalable by Design - Versatile - Reliable - Open source

Slide 30

Slide 30 text

Agenda Partitioned Raft KV Turbocharge TiKV with Unparalleled Performance and Scalability Online DDL Enhancing Database Agility with Lightning-Fast Schema Changes Resource Control Empowering Consolidated Workloads with Precision Resource Allocation 01 03 02

Slide 31

Slide 31 text

Reference Architecture • OLTP and OLAP storage • Raft for consensus • Data consistency • 96MB fixed size region • Fault tolerance ○ Across Availability Zones

Slide 32

Slide 32 text

Region: Logical vs Physical A Region is TiKV’s logical scale unit ● Operations such as load balance, scale-out, scale-in are all based on region ● The user traffic is also partitioned by region Physically there’s no clear boundary for a region ● We have to scan all the SSTs to prepare the snapshot for a region - Read amplification ● Several hot region writes can easily trigger compaction in RocksDB - Write amplification ● Large TiKV instances magnify both the read and write amplification ● A few hot regions can impact the performance of all regions within the same RocksDB instance

Slide 33

Slide 33 text

Problems with Single-RocksDB RocksDB does not perform well on (very) large datasets ● Write amplification ● Latency Jitter ● Lacks I/O isolation ● Write throughput is limited Scale speed is slow, for both scale in and out With too many regions ● Wasted resources (e.g. heartbeat) ● High pressure on Placement Driver (PD) LSM Compaction

Slide 34

Slide 34 text

KV API Coprocessor API Transaction Raft Tablets Raft TiKV Instance Store 1 RocksDB Region 1 Region 2 ... Store 1 Tablet Region 1 Tablet Region 2 ... KV Tablets ... KV Partitioned Raft KV Solution Single RocksDB Multi RocksDB

Slide 35

Slide 35 text

Multi-Rocks solution ● Unifying logical and physical partition ● Size of a region is now variable and can range up to to 10GB+ ● The hotter the region, the smaller its size ● No WAL logs in region’s RocksDB, use raft-log ● No region’s meta-data CF in RocksDB, meta-data moved to Raft-Engine ● No change in SST data format, compatible with old version

Slide 36

Slide 36 text

Benefits of multi-RocksDB ● Larger region size reduces heartbeat cost by 99% ● Better QoS per region ● Remove the bottleneck during apply stage ● Super fast/cheap snapshot generation and apply ● More predictable throughput over time ● Naturally hot and cold data separation ● Can potentially offload cold data to cheaper storage such as S3 ● Physical isolation for different table/database/user

Slide 37

Slide 37 text

Multi-Rocks benefit Workload Multi-RocksDB Single-RocksDB Throughput-Ratio Prepare 30 MB/s 7.5 MB/s 380% Insert QPS: 21K, P95 8.43ms QPS: 16K, P95: 13.46ms 131% Point-Select QPS: 75.7K, P95: 3.96 QPS: 77K, P95: 4.1ms 98.3% Read-Write QPS: 34.3K, avg: 74ms P95: 123ms QPS: 36K, P95: 121ms 95.4% Read-Only QPS: 40K, P95: 87ms QPS: 40.4K, P95: 86ms 99% Sysbench on AWS M5.2xlarge

Slide 38

Slide 38 text

Multi-Rocks benefit Workload Multi-RocksDB Single-RocksDB¹ Throughput-Ratio Insert QPS: 21K, P95 7.98ms QPS: 14.5K, P95: 12ms 144.8% Point Select QPS: 76K, P95: 3.89ms QPS: 27.4K, P95: 20.37ms 277% Read Write QPS: 34.1K, avg: 75ms P95: 132ms QPS: 20.2K, avg: 186ms, P95: 253ms 168% Read Only QPS: 38K, P95: 142ms QPS: 23.1K, P95: 200ms 164% Sysbench on AWS M5.2xlarge (1TB size, block-size=16KB) ¹After compaction

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Flow Control resource quota Schedule Control job priority Resource segregation CPU / IO Fine-grained resource abstract RU / RG Usage tracking Tune / Apportion Why TiDB Resource Control

Slide 41

Slide 41 text

Cost increase Hard to maintain Hard to cross database join Consolidate? Interfere QoS Change -> Disaster TiDB Resource Control When there are multiple apps/databases

Slide 42

Slide 42 text

A typical microservice architecture: Database per service

Slide 43

Slide 43 text

What is Resource Control? Manage multiple workloads in a TiDB cluster. Isolate, manage and schedule access to resources sharing the same TiDB cluster. TiDB eg. xx workloads use too many resources and this impacts the P99 latency of small queries. OLTP workloads (short queries, small updates…) OLAP workloads (large batches, ad-hoc queries…) Maintenance jobs (Backup, Auto tasks…) App1/User1 App2/User2 App3/User3 TiDB eg. Limit the resources allocated to app xxx/ user xxx. eg. Allocate more resources to higher priority apps/users when the system is overloaded.

Slide 44

Slide 44 text

A resource group is a logical container for managing: CPU I/O What is a Resource Group? Option Description RU_PER_SEC Rate of RU backfilling per second. Must be specified when creating a resource group. PRIORITY The absolute priority of tasks to be processed on TiKV. The default value is MEDIUM. BURSTABLE If the BURSTABLE attribute is set, use the available free system resources even if its quota is exceeded. There are 3 important options for each resource group:

Slide 45

Slide 45 text

Request Unit (RU) and Scheduling A Resource Unit (RU) is an abstract unit for measuring system resource usage. TiDB uses mClock, which is a weight and constraint based scheduler. “...constraint-based scheduler ensures that [tasks] receive at least their minimum reserved service and no more than the upper limit in a time interval, while the weight-based scheduler allocates the remaining throughput to achieve proportional sharing.”

Slide 46

Slide 46 text

Evaluate system capacity CALIBRATE RESOURCE; +-------+ | QUOTA | +-------+ | 190470 | +-------+ 1 row in set (0.01 sec) CALIBRATE RESOURCE WORKLOAD OLTP_WRITE_ONLY; +-------+ | QUOTA | +-------+ | 27444 | +-------+ 1 row in set (0.01 sec) CALIBRATE RESOURCE START_TIME '2023-04-18 08:00:00' DURATION '20m'; +-------+ | QUOTA | +-------+ | 27969 | +-------+ 1 row in set (0.01 sec) CALIBRATE RESOURCE START_TIME '2023-04-18 08:00:00' END_TIME '2023-04-18 08:20:00'; +-------+ | QUOTA | +-------+ | 27969 | +-------+ 1 row in set (0.01 sec) ● Estimate capacity based on hardware deployment and standard workloads ● Estimate capacity based on actual workloads

Slide 47

Slide 47 text

Manage resource groups Create Resource Group CREATE RESOURCE GROUP IF NOT EXISTS rg1 RU_PER_SEC = 1000 BURSTABLE; Alter Resource Group ALTER RESOURCE GROUP rg1 RU_PER_SEC=20000 PRIORITY = HIGH; Drop Resource Group DROP RESOURCE GROUP rg1; Query Resource Group(s) SHOW CREATE RESOURCE GROUP rg1; SELECT * FROM information_schema.resource_groups WHERE NAME = 'rg1';

Slide 48

Slide 48 text

Bind resource groups User Level Mapping CREATE USER 'user1'@'%' RESOURCE GROUP rg1; ALTER USER ‘user1’ RESOURCE GROUP rg2; SELECT User, User_attributes FROM mysql.user WHERE User = 'user1'; Session Level Mapping SET RESOURCE GROUP SELECT current_resource_group(); Statement Level Mapping Hint: /*+ resource_group( ${GROUP_NAME} ) */ SELECT /*+ resource_group(rg1) */ * FROM t1 INSERT /*+ resource_group(rg2) */ INTO t2 VALUES(2); Statement (Hint) Level > Session Level > User Level

Slide 49

Slide 49 text

Resource Control Architecture PD TiDB Resource Meter Token Bucket Client Local Admission Controller Token Bucket Server Global Admission Controller PD Token Bucket Server Global Admission Controller PD Token Bucket Server Global Admission Control TiDB Resource Meter Token Bucket Client Local Admission Control TiKV Scheduling Control TiKV Scheduling Control TiKV Scheduling Control Token bucket requests KV/DistSQL requests Group Resources mClock based scheduling Admission Control Layer ● Quota Limits by Request Unit ● GAC ○ Maintain global token buckets ● LAC ○ Measure resources used by TiKV and TiDB (CPU + IO -> RU -> Tokens), consume tokens allocating by GAC Scheduling Control Layer ● Enhanced mClock based scheduling ● Weight input ○ RU quota defined in resource groups ○ Priority defined in resource groups BURSTABLE

Slide 50

Slide 50 text

Slide 51

Slide 51 text

MySQL solves DDL with MDL MDL = Meta Data Lock The table will be locked for all sessions while changing the metadata (DD). For non instant DDL, eg. ADD INDEX, the metadata change still needs to block! ● It’s a single instance/writer model. ● Causes problems with replication ● Each replica will asynchronously run the DDL with an MDL ● Also if not an instant DDL, it makes replication lag worse.

Slide 52

Slide 52 text

Is a distributed database different? Client connections see and act on the same data. Issues to solve (ADD INDEX as an example): ● No synchronous update of metadata/schemas for all cluster nodes. ● Need to create index entries for all existing rows in the table. ● Need to update entries for concurrent user changes.

Slide 53

Slide 53 text

The Solution Version all schemas. Allow sessions to use current or the previous schema version. Use sub-state transitions, so that version N-1 is compatible with version N. Create states that will allow the full transition from state ‘None/Start’ to state ‘Public’.

Slide 54

Slide 54 text

Performance Test on Cloud Environment: TiDB cloud on AWS ● TiDB: ( 16c + 32G ) * 2 ● TiKV: ( 16c + 64G + 1T) * 9 Sysbench 100M, 1B, 5B rows, add index for k int ● tidb_ddl_reorg_batch_size = 1024 ● tidb_ddl_reorg_worker_cnt = 8 Rows Old Time New Time 100 Million 11 min 20 sec 1 min 18 sec 1 Billion 2 h 33 min 1 sec 13 min 42 sec 5 Billion 10 h 27 min 43 sec 1 h 5 min 5 sec Timings with no concurrent DML load TiDB v6.1.0 Aurora Cockroach TiDB v6.5.0

Slide 55

Slide 55 text

Product Roadmap Fall/Winter ‘23 Spring ‘24 Background job resource control Index usage view Query plan stability enhancements New optimizer framework Distributed background task framework Plan cache enhancements MySQL 8.0 SQL blocklist Multi-rocks GA (PB scale) Physical incremental backup TiCDC integration w/ big data systems TiCDC plugable sink January

Slide 56

Slide 56 text

THANK YOU.