Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TiDBとMySQLと未来 - PingCAP株式会社 - TiDB User Day 2023

TiDBとMySQLと未来 - PingCAP株式会社 - TiDB User Day 2023

イベント開催日:2023年7月7日
講演者:PingCAP Inc. 共同創業者 兼 CTO Ed Huang
    PingCAP Inc. Senior Architect Sunny Bains

このスライドでは、米国PingCAP本社より共同創業者兼CTOのEd HuangとMySQLのInnoDB元開発者でHeatWaveの開発リードを担っていたシニアアーキテクトのSunny Bainsが来日し、TiDBの最新情報をお届けします。Edのパートでは、海外/米国のデータベース業界のトレンド、PingCAPのビジョンと戦略、TiDBの最新動向、海外の事例などを紹介します。Sunnyのパートでは、MySQLの専門家の視点からみるNewSQLの可能性を共有し、QPS性能向上と低遅延をサポートする新しいストレージエンジン、リソースを担保するリソースコントロール、MySQL8.0の対応状況などTiDBの最新機能をより詳しく解説します。

アーカイブ動画:
[Part I_Ed Huang編]:https://youtu.be/KCbDLIUOclI
[Part II_Sunny Bains編]:https://youtu.be/O-pfx5SHigw

PingCAP-Japan

July 11, 2023
Tweet

More Decks by PingCAP-Japan

Other Decks in Technology

Transcript

  1. Ed Huang, Co-founder CTO, PingCAP Graduated from Northeastern University, Ed

    worked for Microsoft Research Asia, and Wandou Labs. Ed is a senior infrastructure software engineer and system architect, an expert in distributed system and database development and has rich experience and unique understanding in distributed storage. Ed is the co-author of Codis, a widely used distributed Redis solution, and TiDB, a distributed HTAP database. Projects: https://github.com/pingcap/tidb https://github.com/pingcap/tikv https://github.com/CodisLabs/codis Introduction
  2. We don't want to shard 🤕😥 🥱 • Hundreds of

    TB • Reshard / Rebalance (at midnight) • Keep strong consistency when MySQL node loss • Query conditions that are not included the sharding key • Explain to angry application developers that don't understand why they can't do things like the old days • …
  3. A database for modern data-intensive applications We decided to build

    a database from scratch Strong Consistency & Cross-Row Transaction Scalable High Available 100% OLTP + Ad-Hoc OLAP Cloud-Native Easy to use and with Full-featured SQL Open-Source!
  4. The architecture during 2017-2018 • Elastic scaling, distributed ACID transaction

    processing • Strong consistency across nodes • Built-in HA TiDB Then, we brought it to life, and named it TiDB
  5. TiDB Cluster OLTP Workload (Insert / Update / Delete /…)

    OLAP Query (Join / Group By / …) If you got a scalable OLTP database with SQL support What’s your next move? Why not OLAP queries?
  6. TiDB + TiFlash, isolate OLTP and OLAP workloads, and process

    fresh data in real-time. TiFlash: Columnar extension for TiDB HTAP hit the road in 2020 - 2021 TiFlash
  7. • Operation and maintenance costs are unignorable • Compliance and

    security is must for your data platform, and it’s hard if you were going to build your own • Irreversible trend towards multi-cloud • Fully managed database brings more value, allowing customers to focus on their business instead of database technical details. Why TiDB Cloud
  8. TiDB Cloud Dashboard UI Cloud Provider Region 🔒Your VPC Service

    Consumer Subnet A 10.1.10.0/24 Subnet B 10.1.20.0/24 🔒TiDB VPC Service Provider Cluster - 1 Cluster - 2 AZ 1 AZ 2 AZ 3 TiDB TiKV TiDB TiKV TiKV Operator Logging Monitoring Gateway Control Plane Billing Alerting Meta Store Chat2Query Data API Central Services TiDB Cloud Architect
  9. What kind of database will be future-proof for the age

    of AI? • Elastic Scalability > PB Level • Simple and intuitive User friendly • Consolidated All in one solution • Always on and Always Fresh Cloud-native • Open eco-system Support OpenAPI and open source ecosystem • Low cost Pay-as-you-go, support scale to zero
  10. Serverless TiDB • Launch without concern for deployment details (No

    Configurations!) ◦ and blazing fast, what about ~XX or even ~X Secs? ◦ • Elastic Scale as the workload scales ◦ Scale to zero, resume time ~X secs • Pay-as-you-go at a finer granularity ◦ Charge by workload metrics (RCU / WCU, Read / Write Capacity Unit) ◦ Especially for OLAP workloads • Better and deeper integration with modern application development ◦ processes
  11. Other Region Shared Gateway Isolated SQL Layer Shared Storage Layer

    Shared Storage Pool Gateway Gateway Gateway Virtual Cluster - Tenant 1 TiDB MPP Row Engine Row Engine Row Engine S3 Virtual Cluster - Tenant 2 On Demand Pool TiDB TiDB TiDB MPP Worker MPP Worker MPP Worker Columnar Engine Columnar Engine Columnar Engine TiDB MPP Virtual Cluster - Tenant n TiDB MPP MPP MPP TiDB Shared Services Compaction Analyze DDL Remote Copr Service Service TiDB Cloud Serverless - how it works • Compute-stora ge totally separation • Multi-tenancy by nature • Second-level launch time • Pay-as-you-go
  12. Build a Scalable Real-Time Github Insight WebApp in 1 week

    - the OSSInsight Story ~6 billions Github Event Data (keep increasing), answering the questions like: • What country is your favorite open source project most popular in? • What's the most popular JS framework? • Which companies contribute most to open source databases? • …
  13. Huge Data Volume - ~12TB, 6 Billion rows of data,

    and it grows in real-time. Mixed Workloads - Serving online transactions plus analytical queries. Short Deadline - We were asked to make it happen in 1 week! Unpredictable Traffic Spikes - Featured on HackerNews with 7x traffic in one day. Visit us - https://ossinsight.io/ THE “best practices” for data-intensive apps TiDB Serverless just makes your life easier Source code - https://github.com/pingcap/ossinsight 😒 Build a Scalable Real-Time Github Insight WebApp in 1 week - the OSSInsight Story
  14. All in TiDB Serverless now with 70% cost-saving Automatic scaling

    up and down to handle unpredictable traffic, without the need for manual intervention Build a Scalable Real-Time Github Insight WebApp in 1 week - the OSSInsight Story
  15. Offerings TiDB Self-hosted TiDB Cloud Dedicated TiDB Cloud Serverless Elastic

    Scalability Hybrid workload support ACID Transaction Full-featured SQL Multi-tenancy Everything under control Fully managed Enterpise-level Securiy Multi Cloud platform Auto-Failover No/Little manual deployment Programmatically management via OpenAPI Automated scale / billing based on real-time consumption Instant starts / scale to zero
  16. Vision TiDB Self-hosted HTAP Serverless + + = Frictionless Developer

    Experience Elastic Scalability Auto-Failover ACID Transaction Full-featured SQL Fully managed / Multi-tenancy … Real-time high-performance OLAP while processing OLTP Consistently Hybrid workload support No/Little manual deployment Programmatically management via OpenAPI Automated scale / billing based on real-time consumption Instant starts / scale to zero
  17. Introduction Sunny Bains Senior Architect, PingCAP • Working on database

    internals since 2001. • Was MySQL/InnoDB team lead at Oracle.
  18. TiDB’s unique value - Easy to setup and start -

    MySQL compatible - Scalable by Design - Versatile - Reliable - Open source
  19. Agenda Partitioned Raft KV Turbocharge TiKV with Unparalleled Performance and

    Scalability Online DDL Enhancing Database Agility with Lightning-Fast Schema Changes Resource Control Empowering Consolidated Workloads with Precision Resource Allocation 01 03 02
  20. Reference Architecture • OLTP and OLAP storage • Raft for

    consensus • Data consistency • 96MB fixed size region • Fault tolerance ◦ Across Availability Zones
  21. Region: Logical vs Physical A Region is TiKV’s logical scale

    unit • Operations such as load balance, scale-out, scale-in are all based on region • The user traffic is also partitioned by region Physically there’s no clear boundary for a region • We have to scan all the SSTs to prepare the snapshot for a region - Read amplification • Several hot region writes can easily trigger compaction in RocksDB - Write amplification • Large TiKV instances magnify both the read and write amplification • A few hot regions can impact the performance of all regions within the same RocksDB instance
  22. Problems with Single-RocksDB RocksDB does not perform well on (very)

    large datasets • Write amplification • Latency Jitter • Lacks I/O isolation • Write throughput is limited Scale speed is slow, for both scale in and out With too many regions • Wasted resources (e.g. heartbeat) • High pressure on Placement Driver (PD) LSM Compaction
  23. KV API Coprocessor API Transaction Raft Tablets Raft TiKV Instance

    Store 1 RocksDB Region 1 Region 2 ... Store 1 Tablet Region 1 Tablet Region 2 ... KV Tablets ... KV Partitioned Raft KV Solution Single RocksDB Multi RocksDB
  24. Multi-Rocks solution • Unifying logical and physical partition • Size

    of a region is now variable and can range up to to 10GB+ • The hotter the region, the smaller its size • No WAL logs in region’s RocksDB, use raft-log • No region’s meta-data CF in RocksDB, meta-data moved to Raft-Engine • No change in SST data format, compatible with old version
  25. Benefits of multi-RocksDB • Larger region size reduces heartbeat cost

    by 99% • Better QoS per region • Remove the bottleneck during apply stage • Super fast/cheap snapshot generation and apply • More predictable throughput over time • Naturally hot and cold data separation • Can potentially offload cold data to cheaper storage such as S3 • Physical isolation for different table/database/user
  26. Multi-Rocks benefit Workload Multi-RocksDB Single-RocksDB Throughput-Ratio Prepare 30 MB/s 7.5

    MB/s 380% Insert QPS: 21K, P95 8.43ms QPS: 16K, P95: 13.46ms 131% Point-Select QPS: 75.7K, P95: 3.96 QPS: 77K, P95: 4.1ms 98.3% Read-Write QPS: 34.3K, avg: 74ms P95: 123ms QPS: 36K, P95: 121ms 95.4% Read-Only QPS: 40K, P95: 87ms QPS: 40.4K, P95: 86ms 99% Sysbench on AWS M5.2xlarge
  27. Multi-Rocks benefit Workload Multi-RocksDB Single-RocksDB¹ Throughput-Ratio Insert QPS: 21K, P95

    7.98ms QPS: 14.5K, P95: 12ms 144.8% Point Select QPS: 76K, P95: 3.89ms QPS: 27.4K, P95: 20.37ms 277% Read Write QPS: 34.1K, avg: 75ms P95: 132ms QPS: 20.2K, avg: 186ms, P95: 253ms 168% Read Only QPS: 38K, P95: 142ms QPS: 23.1K, P95: 200ms 164% Sysbench on AWS M5.2xlarge (1TB size, block-size=16KB) ¹After compaction
  28. Agenda Partitioned Raft KV Turbocharge TiKV with Unparalleled Performance and

    Scalability Online DDL Enhancing Database Agility with Lightning-Fast Schema Changes Resource Control Empowering Consolidated Workloads with Precision Resource Allocation 01 03 02
  29. Flow Control resource quota Schedule Control job priority Resource segregation

    CPU / IO Fine-grained resource abstract RU / RG Usage tracking Tune / Apportion Why TiDB Resource Control
  30. Cost increase Hard to maintain Hard to cross database join

    Consolidate? Interfere QoS Change -> Disaster TiDB Resource Control When there are multiple apps/databases
  31. What is Resource Control? Manage multiple workloads in a TiDB

    cluster. Isolate, manage and schedule access to resources sharing the same TiDB cluster. TiDB eg. xx workloads use too many resources and this impacts the P99 latency of small queries. OLTP workloads (short queries, small updates…) OLAP workloads (large batches, ad-hoc queries…) Maintenance jobs (Backup, Auto tasks…) App1/User1 App2/User2 App3/User3 TiDB eg. Limit the resources allocated to app xxx/ user xxx. eg. Allocate more resources to higher priority apps/users when the system is overloaded.
  32. A resource group is a logical container for managing: CPU

    I/O What is a Resource Group? Option Description RU_PER_SEC Rate of RU backfilling per second. Must be specified when creating a resource group. PRIORITY The absolute priority of tasks to be processed on TiKV. The default value is MEDIUM. BURSTABLE If the BURSTABLE attribute is set, use the available free system resources even if its quota is exceeded. There are 3 important options for each resource group:
  33. Request Unit (RU) and Scheduling A Resource Unit (RU) is

    an abstract unit for measuring system resource usage. TiDB uses mClock, which is a weight and constraint based scheduler. “...constraint-based scheduler ensures that [tasks] receive at least their minimum reserved service and no more than the upper limit in a time interval, while the weight-based scheduler allocates the remaining throughput to achieve proportional sharing.”
  34. Evaluate system capacity CALIBRATE RESOURCE; +-------+ | QUOTA | +-------+

    | 190470 | +-------+ 1 row in set (0.01 sec) CALIBRATE RESOURCE WORKLOAD OLTP_WRITE_ONLY; +-------+ | QUOTA | +-------+ | 27444 | +-------+ 1 row in set (0.01 sec) CALIBRATE RESOURCE START_TIME '2023-04-18 08:00:00' DURATION '20m'; +-------+ | QUOTA | +-------+ | 27969 | +-------+ 1 row in set (0.01 sec) CALIBRATE RESOURCE START_TIME '2023-04-18 08:00:00' END_TIME '2023-04-18 08:20:00'; +-------+ | QUOTA | +-------+ | 27969 | +-------+ 1 row in set (0.01 sec) • Estimate capacity based on hardware deployment and standard workloads • Estimate capacity based on actual workloads
  35. Manage resource groups Create Resource Group CREATE RESOURCE GROUP IF

    NOT EXISTS rg1 RU_PER_SEC = 1000 BURSTABLE; Alter Resource Group ALTER RESOURCE GROUP rg1 RU_PER_SEC=20000 PRIORITY = HIGH; Drop Resource Group DROP RESOURCE GROUP rg1; Query Resource Group(s) SHOW CREATE RESOURCE GROUP rg1; SELECT * FROM information_schema.resource_groups WHERE NAME = 'rg1';
  36. Bind resource groups User Level Mapping CREATE USER 'user1'@'%' RESOURCE

    GROUP rg1; ALTER USER ‘user1’ RESOURCE GROUP rg2; SELECT User, User_attributes FROM mysql.user WHERE User = 'user1'; Session Level Mapping SET RESOURCE GROUP <group name> SELECT current_resource_group(); Statement Level Mapping Hint: /*+ resource_group( ${GROUP_NAME} ) */ SELECT /*+ resource_group(rg1) */ * FROM t1 INSERT /*+ resource_group(rg2) */ INTO t2 VALUES(2); Statement (Hint) Level > Session Level > User Level
  37. Resource Control Architecture PD TiDB Resource Meter Token Bucket Client

    Local Admission Controller Token Bucket Server Global Admission Controller PD Token Bucket Server Global Admission Controller PD Token Bucket Server Global Admission Control TiDB Resource Meter Token Bucket Client Local Admission Control TiKV Scheduling Control TiKV Scheduling Control TiKV Scheduling Control Token bucket requests KV/DistSQL requests Group Resources mClock based scheduling Admission Control Layer • Quota Limits by Request Unit • GAC ◦ Maintain global token buckets • LAC ◦ Measure resources used by TiKV and TiDB (CPU + IO -> RU -> Tokens), consume tokens allocating by GAC Scheduling Control Layer • Enhanced mClock based scheduling • Weight input ◦ RU quota defined in resource groups ◦ Priority defined in resource groups BURSTABLE
  38. Agenda Partitioned Raft KV Turbocharge TiKV with Unparalleled Performance and

    Scalability Online DDL Enhancing Database Agility with Lightning-Fast Schema Changes Resource Control Empowering Consolidated Workloads with Precision Resource Allocation 01 03 02
  39. MySQL solves DDL with MDL MDL = Meta Data Lock

    The table will be locked for all sessions while changing the metadata (DD). For non instant DDL, eg. ADD INDEX, the metadata change still needs to block! • It’s a single instance/writer model. • Causes problems with replication • Each replica will asynchronously run the DDL with an MDL • Also if not an instant DDL, it makes replication lag worse.
  40. Is a distributed database different? Client connections see and act

    on the same data. Issues to solve (ADD INDEX as an example): • No synchronous update of metadata/schemas for all cluster nodes. • Need to create index entries for all existing rows in the table. • Need to update entries for concurrent user changes.
  41. The Solution Version all schemas. Allow sessions to use current

    or the previous schema version. Use sub-state transitions, so that version N-1 is compatible with version N. Create states that will allow the full transition from state ‘None/Start’ to state ‘Public’.
  42. Performance Test on Cloud Environment: TiDB cloud on AWS •

    TiDB: ( 16c + 32G ) * 2 • TiKV: ( 16c + 64G + 1T) * 9 Sysbench 100M, 1B, 5B rows, add index for k int • tidb_ddl_reorg_batch_size = 1024 • tidb_ddl_reorg_worker_cnt = 8 Rows Old Time New Time 100 Million 11 min 20 sec 1 min 18 sec 1 Billion 2 h 33 min 1 sec 13 min 42 sec 5 Billion 10 h 27 min 43 sec 1 h 5 min 5 sec Timings with no concurrent DML load TiDB v6.1.0 Aurora Cockroach TiDB v6.5.0
  43. Product Roadmap Fall/Winter ‘23 Spring ‘24 Background job resource control

    Index usage view Query plan stability enhancements New optimizer framework Distributed background task framework Plan cache enhancements MySQL 8.0 SQL blocklist Multi-rocks GA (PB scale) Physical incremental backup TiCDC integration w/ big data systems TiCDC plugable sink January