How to build a serverless database cloud service

TiDB: Inside the Journey to Build a Cloud-Native, Distributed SQL
Database Li Shen@PingCAP

“Future users of large data banks must be protected from
having to know how the data is organized in the machine.” Edgar Codd, 1970 ACM Turing Award (1981) Relational databases are still critical to most applications after 50 years

Challenges from Real World Character.ai web app was topping 200
million visits per month, Character.AI claims, and users were spending on average 29 minutes per visit From TechCrunch.com “They also use Cloud Spanner to ingest terabytes of data every day with reliability and global scale.”

Efforts to Address These Challenges NoSQL Primary Secondary Replication Read-Write
Separation Sharding

ACID Transaction Query/Index Data Integrity Scalability High Availability Flexibility RDBMS
NoSQL Distributed SQL Database Distributed SQL melds the scalability and reliability of NoSQL with the familiar experience of traditional SQL databases

Building a Distributed SQL Database

SQL AST Execution Plan SQL Layer TiDB v0.1 BoltDB LevelDB
RocksDB Memory KV API Local Storage Engines

SQL AST Logical Plan Physical Plan TiDB v0.5 KV API
+ Coprocessor API Storage Engines HBase SQL Layer

• ACID Transaction • High Performance • Scalable • Reliable
TiKV: a Distributed KV Storage

Data is “Auto-Sharded” • SQL data to KV data •
Split data into small chunks/regions ◦ 96MB by default • Each region has multiple replicas as a consensus group

Data is Replicated • Raft ◦ Strong consistency ◦ Leader
election ◦ Efficient and Understandable • Replica placement ◦ Spread the replicas among TiKV nodes ◦ Consider data safety and load balancing • Auto healing ◦ Short-term failures: Raft leader re-election ◦ Longer-term failures : Rebuild the replicas on the missing nodes

Partial Aggregate COUNT(*) Filter c2 = 'foo' TableScan Range: id(10,
+∞) Physical Plan on TiKV Coprocessor Final Aggregate SUM(COUNT(*)) DistSQL Scan Physical Plan on TiDB COUNT(*) COUNT(*) TiKV TiKV TiKV COUNT(*) COUNT(*) SELECT COUNT(*) FROM t WHERE id > 10 AND c2 = 'foo'; Distributed SQL - Distributed Computing

Scale-out easily TiDB TiKV TiKV TiKV TiDB TiDB TiKV TiKV
TiKV TiDB TiKV TiKV TiKV TiDB TiDB App App App App App Traffic More Traffic Scale-out

Rich Features • Online Schema Change (DDL) • Columnar Storage
Engine • Massively Parallel Processing (MPP) • Vectorized Computing Engine • Data Placement Control • CDC (Data Change Capture) • Flash Back / PiTR • Auto Hotspot Detecting and Eliminating • SQL Index Advisor • Zero Downtime Rolling Upgrading • … …

Distributed SQL Database in the Real World In the past
8 years, the most diverse workloads from tens of thousands of clusters

Dashboard screenshot of a TiDB cluster • 200+ nodes •
> 500TB data • 1.8 trillion records

• 1m read/sec • 12k write/sec

Challenges Yet To Be Fully Addressed • How to improve
stability at scale? ◦ Copying data, compaction may slow down your workloads ◦ Indexing a large table (10 billion rows): trade-off between speed and stability • How to take scalability to the next level? ◦ Exploding data volume ◦ Multiple applications share a single cluster for greater efficiency and lower costs • How to make it easier and more cost effective for users? ◦ Maintenance burden of a distributed system, Trade-offs become “Knobs” ◦ Scale down when the workload reduces

These are perfect problems for Cloud • Better (and more
varied) capabilities • Elasticity and resilience • Hiding the complexity of a distributed system Cloud Provides

A cloud-native database is not just the capability of running
on cloud infra • Truly integrate the database with the capabilities of cloud • Truly leverage cloud’s promise of elasticity and resilience • Cloud Augmented Database -> Cloud-Native Database

TiDB Serverless Architecture Multi-tenant architecture Resource Pool : Virtual Cluster
- Tenant 1 SQL SQL Storage Cloud Storage (S3) Virtual Cluster - Tenant n SQL SQL DDL worker MPP Compute service Data Ingestion service Storage Storage SQL Layer Storage Cache Layer Shared Storage Pool … Shared resource pool for background or heavy compute New storage engine built on top of cloud storage

Cloud-Based Storage Engine Remote Storage and Services Before After

Separated Frontend/Background Compute Resource Pool : Virtual Cluster - Tenant
1 SQL SQL Storage Cloud Storage (S3) Virtual Cluster - Tenant n SQL SQL DDL service MPP Compute service Data Ingestion service Storage Storage Isolated SQL Layer (For OLTP) Storage Cache Layer (For OLTP) Shared Storage Pool … Elasticity of Cloud • Shared resource pool • Elastic resource allocation • Spot instance Benefit • More stable • Faster • Lower cost

Indexing a Large Table DDL worker Shared Storage Pool Cloud
Storage (S3) DDL worker DDL worker TiDB / TiKV /…… • Distributed indexing with resource pool ◦ Linear performance scaling ◦ Lower impact on production workloads Resource Pool Records Index Index Data

From 0 to 1,000,000 QPS

AI Makes Things Even Easier • Monitoring & Diagnosis •
Parameters Tuning ◦ Not One-size-fits-all ◦ Example: 20% improvement compared with the best TiDB expert • Write SQL Queries ◦ Talk to database with natural language

Wire Protocol SQL Hotspot Network Partition Schema Transaction Data Corruption
VM Failure Data Rebalancing AI Augmented Latency/Performan ce Trade-off Data Replication Runaway Query Scale-out/in Index Happy app developer

Distributed SQL Database for AI Applications

Distributed SQL Database for AI Operational Apps User Management Order
Management Messaging Marketing Real-time Data Serving Layer End Users Data Lake Machine Learning (Training) Consuming Operational Apps Data Warehouse Online Systems Offline Systems Write Back ETL/CDC Machine Learning (Serving) AI-enriched Apps Inference Store Customer 360 Real-time Data 3rd-party Data 3rd-Party API Refresh, Right Data Storage Inference

Thanks! Q&A

How to build a serverless database cloud service

How to build a serverless database cloud service

Anyscale

More Decks by Anyscale

Other Decks in Programming

Featured

Transcript

TiDB: Inside the Journey to Build a Cloud-Native, Distributed SQL

“Future users of large data banks must be protected from

Challenges from Real World Character.ai web app was topping 200

Efforts to Address These Challenges NoSQL Primary Secondary Replication Read-Write

ACID Transaction Query/Index Data Integrity Scalability High Availability Flexibility RDBMS

Building a Distributed SQL Database

SQL AST Execution Plan SQL Layer TiDB v0.1 BoltDB LevelDB

SQL AST Logical Plan Physical Plan TiDB v0.5 KV API

• ACID Transaction • High Performance • Scalable • Reliable

Data is “Auto-Sharded” • SQL data to KV data •

Data is Replicated • Raft ◦ Strong consistency ◦ Leader

Partial Aggregate COUNT(*) Filter c2 = 'foo' TableScan Range: id(10,

Scale-out easily TiDB TiKV TiKV TiKV TiDB TiDB TiKV TiKV

Rich Features • Online Schema Change (DDL) • Columnar Storage

Distributed SQL Database in the Real World In the past

Dashboard screenshot of a TiDB cluster • 200+ nodes •

• 1m read/sec • 12k write/sec

Challenges Yet To Be Fully Addressed • How to improve

These are perfect problems for Cloud • Better (and more

A cloud-native database is not just the capability of running

TiDB Serverless Architecture Multi-tenant architecture Resource Pool : Virtual Cluster

Cloud-Based Storage Engine Remote Storage and Services Before After

Separated Frontend/Background Compute Resource Pool : Virtual Cluster - Tenant

Indexing a Large Table DDL worker Shared Storage Pool Cloud

From 0 to 1,000,000 QPS

AI Makes Things Even Easier • Monitoring & Diagnosis •

Wire Protocol SQL Hotspot Network Partition Schema Transaction Data Corruption

Distributed SQL Database for AI Applications

Distributed SQL Database for AI Operational Apps User Management Order

Thanks! Q&A