Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TiDBとMySQLと未来 - PingCAP株式会社 - TiDB User Day 2023

TiDBとMySQLと未来 - PingCAP株式会社 - TiDB User Day 2023

イベント開催日:2023年7月7日
講演者:PingCAP Inc. 共同創業者 兼 CTO Ed Huang
    PingCAP Inc. Senior Architect Sunny Bains

このスライドでは、米国PingCAP本社より共同創業者兼CTOのEd HuangとMySQLのInnoDB元開発者でHeatWaveの開発リードを担っていたシニアアーキテクトのSunny Bainsが来日し、TiDBの最新情報をお届けします。Edのパートでは、海外/米国のデータベース業界のトレンド、PingCAPのビジョンと戦略、TiDBの最新動向、海外の事例などを紹介します。Sunnyのパートでは、MySQLの専門家の視点からみるNewSQLの可能性を共有し、QPS性能向上と低遅延をサポートする新しいストレージエンジン、リソースを担保するリソースコントロール、MySQL8.0の対応状況などTiDBの最新機能をより詳しく解説します。

アーカイブ動画:
[Part I_Ed Huang編]:https://youtu.be/KCbDLIUOclI
[Part II_Sunny Bains編]:https://youtu.be/O-pfx5SHigw

PingCAP-Japan

July 11, 2023
Tweet

More Decks by PingCAP-Japan

Other Decks in Technology

Transcript

  1. Exploring the Latest TiDB
    Features and Future
    Roadmap - Part I

    View full-size slide

  2. Ed Huang, Co-founder CTO, PingCAP
    Graduated from Northeastern University,
    Ed worked for Microsoft Research Asia, and Wandou Labs.
    Ed is a senior infrastructure software engineer and system
    architect, an expert in distributed system and database
    development and has rich experience and unique understanding in
    distributed storage.
    Ed is the co-author of Codis, a widely used distributed Redis
    solution, and TiDB, a distributed HTAP database.
    Projects:
    https://github.com/pingcap/tidb
    https://github.com/pingcap/tikv
    https://github.com/CodisLabs/codis
    Introduction

    View full-size slide

  3. A little bit of history

    View full-size slide

  4. We don't want to shard
    🤕😥
    🥱
    ● Hundreds of TB
    ● Reshard / Rebalance (at midnight)
    ● Keep strong consistency when MySQL node loss
    ● Query conditions that are not included the
    sharding key
    ● Explain to angry application developers that
    don't understand why they can't do things like
    the old days
    ● …

    View full-size slide

  5. A database for modern data-intensive applications
    We decided to build a database from scratch
    Strong
    Consistency
    &
    Cross-Row
    Transaction
    Scalable High Available
    100% OLTP
    + Ad-Hoc
    OLAP
    Cloud-Native
    Easy to use
    and with
    Full-featured
    SQL
    Open-Source!

    View full-size slide

  6. A Scalable OLTP database

    View full-size slide

  7. 1,000 ft Architecture Overview of TiDB
    Everything
    distributed!

    View full-size slide

  8. The architecture during
    2017-2018
    ● Elastic scaling, distributed
    ACID transaction processing
    ● Strong consistency across
    nodes
    ● Built-in HA
    TiDB
    Then, we brought it to life, and named it TiDB

    View full-size slide

  9. From OLTP to HTAP

    View full-size slide

  10. TiDB Cluster
    OLTP Workload
    (Insert / Update / Delete /…)
    OLAP Query
    (Join / Group By / …)
    If you got a scalable OLTP database with SQL support
    What’s your next move? Why not OLAP queries?

    View full-size slide

  11. TiDB + TiFlash,
    isolate OLTP and
    OLAP workloads,
    and process fresh
    data in real-time.
    TiFlash: Columnar
    extension for TiDB
    HTAP hit the road in 2020 - 2021
    TiFlash

    View full-size slide

  12. We kept the evolution
    momentum on Cloud

    View full-size slide

  13. ● Operation and maintenance costs are unignorable
    ● Compliance and security is must for your data platform,
    and it’s hard if you were going to build your own
    ● Irreversible trend towards multi-cloud
    ● Fully managed database brings more value, allowing
    customers to focus on their business instead of
    database technical details.
    Why TiDB Cloud

    View full-size slide

  14. Horizontal Scaling Strong Resilience Geo-Distribution Easy Upgrades
    Real-Time
    Analytics
    Industry-Leading
    Security
    Monitoring 24/7 Support
    TiDB Cloud

    View full-size slide

  15. TiDB Cloud
    Dashboard UI
    Cloud Provider Region
    🔒Your VPC
    Service Consumer
    Subnet A
    10.1.10.0/24
    Subnet B
    10.1.20.0/24
    🔒TiDB VPC
    Service Provider
    Cluster - 1 Cluster - 2
    AZ 1
    AZ 2
    AZ 3
    TiDB TiKV
    TiDB TiKV
    TiKV
    Operator
    Logging
    Monitoring
    Gateway
    Control Plane
    Billing
    Alerting
    Meta Store
    Chat2Query
    Data API
    Central Services
    TiDB Cloud Architect

    View full-size slide

  16. Next big thing for
    TiDB Cloud - Serverless

    View full-size slide

  17. What kind of database will be future-proof for the
    age of AI?
    ● Elastic Scalability
    > PB Level
    ● Simple and intuitive
    User friendly
    ● Consolidated
    All in one solution
    ● Always on and Always Fresh
    Cloud-native
    ● Open eco-system
    Support OpenAPI and open source
    ecosystem
    ● Low cost
    Pay-as-you-go, support scale to
    zero

    View full-size slide

  18. Serverless TiDB
    ● Launch without concern for deployment details (No
    Configurations!)
    ○ and blazing fast, what about ~XX or even ~X Secs?

    ● Elastic Scale as the workload scales
    ○ Scale to zero, resume time ~X secs
    ● Pay-as-you-go at a finer granularity
    ○ Charge by workload metrics (RCU / WCU, Read / Write Capacity Unit)
    ○ Especially for OLAP workloads
    ● Better and deeper integration with modern application
    development
    ○ processes

    View full-size slide

  19. Other
    Region
    Shared
    Gateway
    Isolated
    SQL
    Layer
    Shared
    Storage
    Layer
    Shared
    Storage
    Pool
    Gateway Gateway Gateway
    Virtual Cluster - Tenant
    1
    TiDB MPP
    Row
    Engine
    Row
    Engine
    Row
    Engine
    S3
    Virtual Cluster - Tenant
    2
    On
    Demand
    Pool
    TiDB
    TiDB
    TiDB
    MPP
    Worker
    MPP
    Worker
    MPP
    Worker
    Columnar
    Engine
    Columnar
    Engine
    Columnar
    Engine
    TiDB MPP
    Virtual Cluster - Tenant
    n
    TiDB
    MPP
    MPP MPP
    TiDB
    Shared
    Services
    Compaction Analyze DDL Remote Copr
    Service
    Service
    TiDB Cloud Serverless - how it works
    ● Compute-stora
    ge totally
    separation
    ● Multi-tenancy
    by nature
    ● Second-level
    launch time
    ● Pay-as-you-go

    View full-size slide

  20. TiDB Serverless already has
    over 10,000 active clusters
    It's GA yesterday 🎉

    View full-size slide

  21. Serverless TiDB

    View full-size slide

  22. Build a Scalable Real-Time Github Insight WebApp in
    1 week - the OSSInsight Story
    ~6 billions Github Event Data (keep
    increasing), answering the questions like:
    ● What country is your favorite open source
    project most popular in?
    ● What's the most popular JS framework?
    ● Which companies contribute most to open
    source databases?
    ● …

    View full-size slide

  23. Huge Data Volume - ~12TB, 6 Billion rows of data, and it grows in real-time.
    Mixed Workloads - Serving online transactions plus analytical queries.
    Short Deadline - We were asked to make it happen in 1 week!
    Unpredictable Traffic Spikes - Featured on HackerNews with 7x traffic in one day. Visit us - https://ossinsight.io/
    THE “best practices” for data-intensive apps
    TiDB Serverless just makes your life easier
    Source code - https://github.com/pingcap/ossinsight
    😒

    Build a Scalable Real-Time Github Insight WebApp in
    1 week - the OSSInsight Story

    View full-size slide

  24. All in TiDB Serverless
    now with 70%
    cost-saving
    Automatic scaling up
    and down to handle
    unpredictable traffic,
    without the need for
    manual intervention
    Build a Scalable Real-Time Github Insight WebApp in
    1 week - the OSSInsight Story

    View full-size slide

  25. Offerings
    TiDB Self-hosted TiDB Cloud Dedicated TiDB Cloud Serverless
    Elastic Scalability
    Hybrid workload support
    ACID Transaction
    Full-featured SQL
    Multi-tenancy
    Everything under control
    Fully managed
    Enterpise-level Securiy
    Multi Cloud platform
    Auto-Failover
    No/Little manual deployment
    Programmatically management
    via OpenAPI
    Automated scale / billing
    based on real-time
    consumption
    Instant starts /
    scale to zero

    View full-size slide

  26. Vision
    TiDB
    Self-hosted
    HTAP Serverless
    + + =
    Frictionless
    Developer
    Experience
    Elastic Scalability
    Auto-Failover
    ACID Transaction
    Full-featured SQL
    Fully managed /
    Multi-tenancy

    Real-time
    high-performance OLAP
    while processing OLTP
    Consistently
    Hybrid workload support
    No/Little manual deployment
    Programmatically
    management via OpenAPI
    Automated scale / billing
    based on real-time
    consumption
    Instant starts /
    scale to zero

    View full-size slide

  27. Exploring the Latest TiDB
    Features and Future
    Roadmap - Part II

    View full-size slide

  28. Introduction
    Sunny Bains
    Senior Architect, PingCAP
    ● Working on database
    internals since 2001.
    ● Was MySQL/InnoDB team lead
    at Oracle.

    View full-size slide

  29. TiDB’s unique value
    - Easy to setup and start
    - MySQL compatible
    - Scalable by Design
    - Versatile
    - Reliable
    - Open source

    View full-size slide

  30. Agenda
    Partitioned Raft KV
    Turbocharge TiKV with
    Unparalleled Performance
    and Scalability
    Online DDL
    Enhancing Database
    Agility with Lightning-Fast
    Schema Changes
    Resource Control
    Empowering Consolidated
    Workloads with Precision
    Resource Allocation
    01 03
    02

    View full-size slide

  31. Reference Architecture
    • OLTP and OLAP storage
    • Raft for consensus
    • Data consistency
    • 96MB fixed size region
    • Fault tolerance
    ○ Across Availability
    Zones

    View full-size slide

  32. Region: Logical vs Physical
    A Region is TiKV’s logical scale unit
    ● Operations such as load balance, scale-out, scale-in are all based on region
    ● The user traffic is also partitioned by region
    Physically there’s no clear boundary for a region
    ● We have to scan all the SSTs to prepare the snapshot for a region - Read amplification
    ● Several hot region writes can easily trigger compaction in RocksDB - Write amplification
    ● Large TiKV instances magnify both the read and write amplification
    ● A few hot regions can impact the performance of all regions within the same RocksDB
    instance

    View full-size slide

  33. Problems with Single-RocksDB
    RocksDB does not perform well on (very) large datasets
    ● Write amplification
    ● Latency Jitter
    ● Lacks I/O isolation
    ● Write throughput is limited
    Scale speed is slow, for both scale in and out
    With too many regions
    ● Wasted resources (e.g. heartbeat)
    ● High pressure on Placement Driver (PD)
    LSM Compaction

    View full-size slide

  34. KV API Coprocessor
    API
    Transaction
    Raft
    Tablets
    Raft
    TiKV Instance
    Store 1
    RocksDB
    Region 1
    Region 2
    ...
    Store 1
    Tablet
    Region 1
    Tablet
    Region 2
    ...
    KV
    Tablets
    ...
    KV
    Partitioned Raft KV Solution
    Single RocksDB Multi RocksDB

    View full-size slide

  35. Multi-Rocks solution
    ● Unifying logical and physical partition
    ● Size of a region is now variable and can range up to to 10GB+
    ● The hotter the region, the smaller its size
    ● No WAL logs in region’s RocksDB, use raft-log
    ● No region’s meta-data CF in RocksDB, meta-data moved to
    Raft-Engine
    ● No change in SST data format, compatible with old version

    View full-size slide

  36. Benefits of multi-RocksDB
    ● Larger region size reduces heartbeat cost by 99%
    ● Better QoS per region
    ● Remove the bottleneck during apply stage
    ● Super fast/cheap snapshot generation and apply
    ● More predictable throughput over time
    ● Naturally hot and cold data separation
    ● Can potentially offload cold data to cheaper storage such as S3
    ● Physical isolation for different table/database/user

    View full-size slide

  37. Multi-Rocks benefit
    Workload Multi-RocksDB Single-RocksDB Throughput-Ratio
    Prepare 30 MB/s 7.5 MB/s 380%
    Insert QPS: 21K, P95 8.43ms QPS: 16K, P95: 13.46ms 131%
    Point-Select QPS: 75.7K, P95: 3.96 QPS: 77K, P95: 4.1ms 98.3%
    Read-Write QPS: 34.3K, avg: 74ms P95:
    123ms
    QPS: 36K, P95: 121ms 95.4%
    Read-Only QPS: 40K, P95: 87ms QPS: 40.4K, P95: 86ms 99%
    Sysbench on AWS M5.2xlarge

    View full-size slide

  38. Multi-Rocks benefit
    Workload Multi-RocksDB Single-RocksDB¹ Throughput-Ratio
    Insert QPS: 21K, P95 7.98ms QPS: 14.5K, P95: 12ms 144.8%
    Point Select QPS: 76K, P95: 3.89ms QPS: 27.4K, P95: 20.37ms 277%
    Read Write QPS: 34.1K, avg: 75ms
    P95: 132ms
    QPS: 20.2K, avg: 186ms,
    P95: 253ms
    168%
    Read Only QPS: 38K, P95: 142ms QPS: 23.1K, P95: 200ms 164%
    Sysbench on AWS M5.2xlarge (1TB size, block-size=16KB)
    ¹After compaction

    View full-size slide

  39. Agenda
    Partitioned Raft KV
    Turbocharge TiKV with
    Unparalleled Performance
    and Scalability
    Online DDL
    Enhancing Database Agility
    with Lightning-Fast Schema
    Changes
    Resource Control
    Empowering Consolidated
    Workloads with Precision
    Resource Allocation
    01 03
    02

    View full-size slide

  40. Flow Control
    resource quota
    Schedule Control
    job priority
    Resource
    segregation
    CPU / IO
    Fine-grained
    resource abstract
    RU / RG
    Usage tracking
    Tune / Apportion
    Why TiDB Resource Control

    View full-size slide

  41. Cost increase
    Hard to
    maintain
    Hard to cross
    database join
    Consolidate?
    Interfere
    QoS
    Change ->
    Disaster
    TiDB Resource
    Control
    When there are multiple apps/databases

    View full-size slide

  42. A typical microservice architecture: Database per service

    View full-size slide

  43. What is Resource Control?
    Manage multiple workloads in a TiDB cluster.
    Isolate, manage and schedule access to resources sharing the
    same TiDB cluster.
    TiDB
    eg. xx workloads use too many resources and this impacts
    the P99 latency of small queries.
    OLTP
    workloads
    (short queries,
    small updates…)
    OLAP workloads
    (large batches,
    ad-hoc queries…)
    Maintenance jobs
    (Backup, Auto tasks…)
    App1/User1 App2/User2 App3/User3
    TiDB
    eg. Limit the resources allocated to app xxx/ user xxx.
    eg. Allocate more resources to higher priority apps/users
    when the system is overloaded.

    View full-size slide

  44. A resource group is a logical container for managing:
    CPU
    I/O
    What is a Resource Group?
    Option Description
    RU_PER_SEC Rate of RU backfilling per second.
    Must be specified when creating a resource group.
    PRIORITY The absolute priority of tasks to be processed on TiKV.
    The default value is MEDIUM.
    BURSTABLE If the BURSTABLE attribute is set, use the available free system
    resources even if its quota is exceeded.
    There are 3 important options for each resource group:

    View full-size slide

  45. Request Unit (RU) and Scheduling
    A Resource Unit (RU) is an abstract unit for
    measuring system resource usage.
    TiDB uses mClock, which is a weight and constraint
    based scheduler.
    “...constraint-based scheduler ensures that [tasks] receive at least
    their minimum reserved service and no more than the upper limit
    in a time interval, while the weight-based scheduler allocates the
    remaining throughput to achieve proportional sharing.”

    View full-size slide

  46. Evaluate system capacity
    CALIBRATE RESOURCE;
    +-------+
    | QUOTA |
    +-------+
    | 190470 |
    +-------+
    1 row in set (0.01 sec)
    CALIBRATE RESOURCE WORKLOAD
    OLTP_WRITE_ONLY;
    +-------+
    | QUOTA |
    +-------+
    | 27444 |
    +-------+
    1 row in set (0.01 sec)
    CALIBRATE RESOURCE START_TIME '2023-04-18
    08:00:00' DURATION '20m';
    +-------+
    | QUOTA |
    +-------+
    | 27969 |
    +-------+
    1 row in set (0.01 sec)
    CALIBRATE RESOURCE START_TIME '2023-04-18
    08:00:00' END_TIME '2023-04-18 08:20:00';
    +-------+
    | QUOTA |
    +-------+
    | 27969 |
    +-------+
    1 row in set (0.01 sec)
    ● Estimate capacity based on hardware
    deployment and standard workloads
    ● Estimate capacity based on actual
    workloads

    View full-size slide

  47. Manage resource groups
    Create Resource Group
    CREATE RESOURCE GROUP IF NOT EXISTS rg1 RU_PER_SEC = 1000 BURSTABLE;
    Alter Resource Group
    ALTER RESOURCE GROUP rg1 RU_PER_SEC=20000 PRIORITY = HIGH;
    Drop Resource Group
    DROP RESOURCE GROUP rg1;
    Query Resource Group(s)
    SHOW CREATE RESOURCE GROUP rg1;
    SELECT * FROM information_schema.resource_groups WHERE NAME = 'rg1';

    View full-size slide

  48. Bind resource groups
    User Level Mapping
    CREATE USER 'user1'@'%' RESOURCE GROUP rg1;
    ALTER USER ‘user1’ RESOURCE GROUP rg2;
    SELECT User, User_attributes FROM mysql.user WHERE User = 'user1';
    Session Level Mapping
    SET RESOURCE GROUP
    SELECT current_resource_group();
    Statement Level Mapping
    Hint: /*+ resource_group( ${GROUP_NAME} ) */
    SELECT /*+ resource_group(rg1) */ * FROM t1
    INSERT /*+ resource_group(rg2) */ INTO t2 VALUES(2);
    Statement (Hint) Level > Session Level > User Level

    View full-size slide

  49. Resource Control Architecture
    PD TiDB
    Resource Meter
    Token Bucket Client
    Local Admission Controller
    Token Bucket Server
    Global Admission Controller
    PD
    Token Bucket Server
    Global Admission Controller
    PD
    Token Bucket Server
    Global Admission
    Control
    TiDB
    Resource Meter
    Token Bucket Client
    Local Admission Control
    TiKV
    Scheduling
    Control
    TiKV
    Scheduling Control
    TiKV
    Scheduling Control
    Token bucket
    requests
    KV/DistSQL requests
    Group Resources
    mClock based scheduling
    Admission Control Layer
    ● Quota Limits by Request Unit
    ● GAC
    ○ Maintain global token buckets
    ● LAC
    ○ Measure resources used by TiKV and
    TiDB (CPU + IO -> RU -> Tokens),
    consume tokens allocating by GAC
    Scheduling Control Layer
    ● Enhanced mClock based scheduling
    ● Weight input
    ○ RU quota defined in resource groups
    ○ Priority defined in resource groups
    BURSTABLE

    View full-size slide

  50. Agenda
    Partitioned Raft KV
    Turbocharge TiKV with
    Unparalleled Performance
    and Scalability
    Online DDL
    Enhancing Database
    Agility with Lightning-Fast
    Schema Changes
    Resource Control
    Empowering Consolidated
    Workloads with Precision
    Resource Allocation
    01 03
    02

    View full-size slide

  51. MySQL solves DDL with MDL
    MDL = Meta Data Lock
    The table will be locked for all sessions while changing the metadata
    (DD).
    For non instant DDL, eg. ADD INDEX, the metadata change still needs to
    block!
    ● It’s a single instance/writer model.
    ● Causes problems with replication
    ● Each replica will asynchronously run the DDL with an MDL
    ● Also if not an instant DDL, it makes replication lag worse.

    View full-size slide

  52. Is a distributed database different?
    Client connections see and act on the same data.
    Issues to solve (ADD INDEX as an example):
    ● No synchronous update of metadata/schemas for all cluster
    nodes.
    ● Need to create index entries for all existing rows in the table.
    ● Need to update entries for concurrent user changes.

    View full-size slide

  53. The Solution
    Version all schemas.
    Allow sessions to use current or the previous schema version.
    Use sub-state transitions, so that version N-1 is compatible with
    version N.
    Create states that will allow the full transition from state
    ‘None/Start’ to state ‘Public’.

    View full-size slide

  54. Performance Test on Cloud
    Environment: TiDB cloud on AWS
    ● TiDB: ( 16c + 32G ) * 2
    ● TiKV: ( 16c + 64G + 1T) * 9
    Sysbench 100M, 1B, 5B rows, add index for k int
    ● tidb_ddl_reorg_batch_size = 1024
    ● tidb_ddl_reorg_worker_cnt = 8
    Rows Old Time New Time
    100 Million 11 min 20 sec 1 min 18 sec
    1 Billion 2 h 33 min 1 sec 13 min 42 sec
    5 Billion 10 h 27 min 43
    sec
    1 h 5 min 5 sec
    Timings with no concurrent DML load
    TiDB v6.1.0 Aurora Cockroach TiDB v6.5.0

    View full-size slide

  55. Product Roadmap
    Fall/Winter ‘23 Spring ‘24
    Background job resource
    control
    Index usage view
    Query plan stability enhancements
    New optimizer framework
    Distributed background
    task framework
    Plan cache
    enhancements
    MySQL 8.0 SQL blocklist
    Multi-rocks GA (PB scale)
    Physical incremental backup
    TiCDC integration w/ big data systems TiCDC plugable sink
    January

    View full-size slide