Slide 1

Slide 1 text

%#ΞʔΩςΫνϟͷൺֱͱબ୒ Sadayuki Furuhashi Database Engineering Meetup #1 Treasure Data, Inc. Chief Architect $MPVEOBUJWFTUPSBHFTFSWJDFGPS CVMLMPBESBOEPNMPPLVQXPSLMPBE

Slide 2

Slide 2 text

About me A founder of Treasure Data, Inc. Located in Tokyo, Japan. OSS Hacker. Github: @frsyuki OSS projects I initially designed:

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

DBٕज़͸Ί·͙Δ͘͠ൃల͍ͯ͠Δ σʔλϞσϧ • ϦϨʔγϣφϧ • ΦϒδΣΫτ • άϥϑ • υΩϡϝϯτ • KVS • సஔΠϯσοΫε • kNN • … ෼ࢄٕज़ • ڞ༗σΟεΫ • ΦϒδΣΫτετΞ • ϚελɾεϨʔϒ • ύʔςΟγϣχϯά • ෼ࢄετϦʔϜ • ΫΦϥϜ • ෼ࢄ߹ҙ • ߴਫ਼౓࣌ࠁಉظ • … σʔλߏ଄ • B+Tree • LSM Tree • Bitcask • … ετϨʔδ • HDD • SSD • NVMe • …

Slide 5

Slide 5 text

DBٕज़͸Ί·͙Δ͘͠ൃల͍ͯ͠Δ σʔλϞσϧ • ϦϨʔγϣφϧ • ΦϒδΣΫτ • άϥϑ • υΩϡϝϯτ • KVS • సஔΠϯσοΫε • kNN • … ෼ࢄٕज़ • ڞ༗σΟεΫ • ΦϒδΣΫτετΞ • ϚελɾεϨʔϒ • ύʔςΟγϣχϯά • ෼ࢄετϦʔϜ • ΫΦϥϜ • ෼ࢄ߹ҙ • ߴਫ਼౓࣌ࠁಉظ • … બ୒ࢶ͕ଟ༷Խ → ཧղͱબఆ͕೉͍͠ σʔλߏ଄ • B+Tree • LSM Tree • Bitcask • … ετϨʔδ • HDD • SSD • NVMe • …

Slide 6

Slide 6 text

ͳͥDBΞʔΩςΫνϟΛཧղɾબ୒͢Δ΂͖͔ ཧղ͕ෆे෼ɾબ୒͕ෆద੾ͩͱ… • γεςϜͷ։ൃ޻਺͕ෆ౰ʹ૿͑Δ • ద੾ͳϕϯνϚʔΫςετ͕Ͱ͖ͳ͍ • ࣄલʹίετࢼࢉ͕Ͱ͖ͳ͍ • ߴෛՙ࣌ʹ଱͑ΒΕͣো֐Λى͜͢ • ো֐ൃੜ࣌ʹ໰୊ͷ͋Γͦ͏ͳϙΠϯτΛ૝ఆͰ͖ͣ ରॲ͕஗ΕΔ

Slide 7

Slide 7 text

ࠓճ૝ఆ͢ΔϫʔΫϩʔυ

Slide 8

Slide 8 text

Challenges with DynamoDB

Slide 9

Slide 9 text

DynamoDB's auto-scaling doesn't scale in time Request failure! Load spikes right after noon

Slide 10

Slide 10 text

Expensive Write Capacity Cost Request failure! Already too expensive Bigger margin = even more expensive

Slide 11

Slide 11 text

Workload analysis Read Write API Random lookup by ID Bulk write & append No delete Temporal Locality ʢ࣌ؒతہॴੑʣ High (repeating visitors) Low (daily or hourly batch) Spacial Locality ʢۭؒతہॴੑʣ Moderate (hot & cold data sets) High (rewrite data sets by batch) Size of a record: 10 bytes Size of total records: 3 TB Read traffic: 50 requests/sec

Slide 12

Slide 12 text

Ideas (2018) (A) Alternative distributed KVS (Aerospike) (B) Storage Hierarchy on KVS (C) Edit log shipping & Indexed archive

Slide 13

Slide 13 text

Idea (A) Alternative Distributed KVS

Slide 14

Slide 14 text

(A) Alternative Distributed KVS CDP KVS Server CDP KVS Server DynamoDB DAX Ignite Presto Presto Aerospike node Aerospike node Aerospike node

Slide 15

Slide 15 text

Aerospike: Pros & Cons • Good: Very fast lookup • In-memory index + Direct IO on SSDs • Bad: Expensive (hardware & operation) • Same cost for both cold & hot data
 (Large memory overhead for cold data) • No spacial locality for write
 (a batch-write becomes random-writes)

Slide 16

Slide 16 text

SSD /dev/sdb Aerospike: Storage Architecture Aerospike node DRAM hash(k01):
 addr 01, size=3 ... hash(k02):
 addr 09, size=3 hash(k03):
 addr 76, size=3 k01 = v01 k02 = v02 k03 = v03 addr 01: addr 09: addr 76: GET hash(k01) ✓ Primary keys (hash) are always in-memory => Always fast lookup ✓ Data is always on SSD => Always durable ✓ IO on SSD is direct IO (no filesystem cache) => Consistently fast without warm-up Load index
 at startup (cold-start)

Slide 17

Slide 17 text

Aerospike: System Architecture { k01: v01 k02: v02 k03: v03 k04: v04 k05: v05 } { k06: v06 k07: v07 k08: v08 k09: v09 k0a: v0a } Aerospike node Aerospike node Aerospike node Aerospike node hash(key) = Node ID Aerospike node Aerospike node Bulk write 1 Bulk write 2 Batch write => Random write: No locality, No compression, More overhead Note: compressing 10-byte data isn't efficient

Slide 18

Slide 18 text

Aerospike: Cost estimation • 1 record needs 64 bytes of DRAM for primary key indexing • Storing 100 billion records (our use case) needs
 6.4 TB of DRAM. • With replication-factor=3, our system needs
 19.2TB of DRAM. • It needs r5.24xlarge × 26 instances on EC2. • It costs $89,000/month (1-year reserved, convertible). • Cost structure: • Very high DRAM cost per GB • Moderate IOPS cost • Low storage & CPU cost • High operational cost

Slide 19

Slide 19 text

Idea (B) Storage Hierarchy on KVS

Slide 20

Slide 20 text

Analyzing a cause of expensive DynamoDB WCU PK Col1 Col2 Key1 Key1 Col1 Key1 Col2 Key1 Key1 Col1 Key1 Col2 1KB 1KB 1KB 1KB 3.2KB Consumes 4 Write Capacity (0.8 WCU wasted)

Slide 21

Slide 21 text

DynamoDB with record size <<< 1KB PK Value Key1 Val1 Key2 Key3 Key4 Val2 Val3 Val4 1KB => 1 Write Capacity => 1 Write Capacity => 1 Write Capacity => 1 Write Capacity 10 bytes 4 Write Capacity consumed to store 40 bytes. 99% WCU wasted!

Slide 22

Slide 22 text

Solution: Optimizing DynamoDB WCU overhead PK Value Key1 Val1 Key2 Key3 Key4 Val2 Val3 Val4 10 bytes 10 bytes 10 bytes 10 bytes => 1 Write Capacity => 1 Write Capacity => 1 Write Capacity => 1 Write Capacity PK Value Part ID {Key1: Val1, Key2: Val2, Key3: Val3, Key4: Val4} 30 bytes => 1 Write Capacity (Note: expected 5x - 10x compression ratio)

Slide 23

Slide 23 text

Architecture overview { k01: v01 k03: v03 k06: v06 k08: v08 k0a: v0a } { k02: v02 k04: v04 k05: v05 k07: v07 k09: v09 } Bulk write 1 Bulk write 2 Compress & Write DynamoDB DAX PK = hash(partition id)

Slide 24

Slide 24 text

Pros & Cons analysis • Good: Very scalable write & storage cost • Data compression (10x less write & storage cost) • Fewer number of primary keys
 (1 / 100,000 with 100k records in a partition) • Bad: Complex to understand & use • More difficult to understand • Writer (Presto) must partition data by partition id

Slide 25

Slide 25 text

DynamoDB Implementation - read PK Split 1 Split 2 Split 3 71 69 Get hash(key) = partition id + split id GET k06 k06 is at: partition id=71 split id=2 k03 v03 k06 v06 Scan { k06: v06 } DAX (cache) Encoded split

Slide 26

Slide 26 text

Implementation - write k01: v01 k03: v03
 k06: v06 k08: v08
 k0a: v0a k02: v02
 k04: v04 k05: v05 k07: v07
 k09: v09 { k01: v01 k02: v02 k03: v03 k04: v04 k05: v05 k06: v06 k07: v07 k08: v08 k09: v09 k0a: v0a } Original data set Partition id=71 Partition=69 Encoded Partition id=71 k01 v01 k03 v03 k06 v06 k08 v08 k0a v0a Encode & Compress Split 1 Split 2 Split 3 PK Split 1 Split 2 Split 3 71 69 Store hash(key) = partition id + split id Partitioning using Presto
 (GROUP BY + array_agg query) DynamoDB

Slide 27

Slide 27 text

Appendix: Split format PK Split 1 Split 2 Split 3 71 69 { k03: v03, k06: v06, ... } msgpack( [ [keyLen 1, keyLen 2, keyLen 3, ...], "key1key2key3...", [valLen 1, valLen 2, valLen 3, ...], "val1val2val3...", ] ) zstd( msgpack( [ , msgpack( [ [keyLen, keyLen, keyLen, ...], "keykeykey...", [valLen, valLen, valLen, ...], "valvalval...", ] ) , ... ] ) , bucket 1 bucket 2 bucket N Hash table
 serialized by MessagePack
 compressed by Zstd Size of a split: approx. 200KB (100,000 records) Nested MessagePack to omit unnecessary deserialization when looking up a record

Slide 28

Slide 28 text

Bulk write performance 6x less total time 8x faster single bulk-write (which loops 18 times)

Slide 29

Slide 29 text

DynamoDB Write Capacity Consumption 210 105 921k Write Capacity in 45 minutes. 170 Write Capacity per second average (≒ 170 WCU).

Slide 30

Slide 30 text

Idea (C) Edit log shipping & Indexed

Slide 31

Slide 31 text

(C) Edit log shipping & Indexed Archive Kafka / Kinesis (+ S3) Writer API Node Writer API Stream of bulk-write data sets Indexing &
 Storage Node RocksDB Shard
 0, 1 Indexing &
 Storage Node RocksDB Shard
 1, 2 Indexing &
 Storage Node RocksDB Shard
 2, 3 Indexing &
 Storage Node RocksDB Shard
 3, 0 Writer API Node Reader API etcd, consul Shard & node list
 management Write Read Bulk-write S3 for backup & cold-start Subscribe Read

Slide 32

Slide 32 text

Architecture of RocksDB Optimization of RocksDB for Redis on Flash, Keren Ouaknine, Oran Agra, and Zvika Guz

Slide 33

Slide 33 text

Pros & Cons • Good: Very scalable write & storage cost • Data compression (10x less write & storage cost) • Bad: Expensive to implement & operate • Implementing 3 custom server components
 (Stateless: Writer, Reader. Stateful: Storage) • Operating stateful servers - more work to implement backup, restoring, monitoring, alerting, etc. • Others: • Flexible indexing • Eventually-consistent

Slide 34

Slide 34 text

New Idea (D) (2023) Indexed archive on a distributed filesystem

Slide 35

Slide 35 text

Embedded databases • Log-structured Merge Tree • LevelDB • RocksDB • Badger • B+Tree on mmap • LMDB • BoltDB / Bbolt • Bitcask - good for our workload • Rosedb • Sparkey • pogreb

Slide 36

Slide 36 text

Sparkey architecture Key Visibility Offset key1 Active 0 bytes key30 Active 321 bytes key30 Deleted zstd([ “key1”, “val1", “key2”, “val2", “key3”, “val3", “key4”, “val4”, … ] ) , zstd([ “key10”, “val10”, “key20”, “val20", “key30”, “val30", “key40”, “val40”, … ] ) , ... ] ) , Segment 1 Segment 2 Segment N Log file Index file Approximately two filesystem IO commands per lookup

Slide 37

Slide 37 text

Storage service architecture NFS v4, SMB, Lustre, Ceph (AWS EFS, FSx, File Cache) /data_set_1 /index.mph /log.db /.writer.lock /.reader.lock /data_set_2 /index.mph /log.db /.writer.lock /.reader.lock Client node Client node Client node R/W mount R/W mount R/W mount Embedded databases on a distributed filesystem With AWS EFS, < 1ms latency & 250k Max Read IOPS

Slide 38

Slide 38 text

Pros & Cons • Good: Simple & Efficient • Data compression (10x less write & storage cost) • Bad: Limited future scalability • AWS EFS (NFS v4) has a limitation on number of concurrent clients • AWS FileCache / FSx (Lustre) can’t have AWS Lambda as clients • Self-hosting Ceph is complicated

Slide 39

Slide 39 text

New Idea (E) (2023) Indexed archive on S3 / object storage

Slide 40

Slide 40 text

Quickwit Query Node Query Node Indexer Node Indexer Node Object Storage Lock / Metadata DB Split files External Storage
 / Stream Upload optimized files Update file references Query references Scan files

Slide 41

Slide 41 text

Tantivy S3 Backed Full-Text Search with Tantivy (Part 1) github:jakejscott/dynamodb-email-indexer full-text search engine library written in Rust

Slide 42

Slide 42 text

Other architectures

Slide 43

Slide 43 text

Aurora

Slide 44

Slide 44 text

Yogabyte DB

Slide 45

Slide 45 text

Log shipping

Slide 46

Slide 46 text

Log shipping + external WAL

Slide 47

Slide 47 text

AWS MemoryDB

Slide 48

Slide 48 text

·ͱΊ

Slide 49

Slide 49 text

஌ݟ • ෼ࢄϑΝΠϧγεςϜ͕ैདྷΑΓ΋ݱ࣮త • Ϋϥ΢υ؀ڥͰωοτϫʔΫIOͷίεύ͕ҎલΑΓ΋վળ • NFSv4΍LustreͳͲϩοΫ΍Ωϟογϡ͕͍ܰϓϩτίϧ • ϋΠύʔόΠβલఏͰಈ࡞͢ΔߴੑೳNIC΍υϥΠό • ΦϒδΣΫτετϨʔδ͸ґવ༗ྗ͕ͩഉଞ੍ޚɾϝλσʔλ؅ཧ͕ผ్ඞཁ • ྆ऀΛ૊Έ߹ΘͤͨετϨʔδɾݕࡧɾMLγεςϜ͸ࠓޙ૿͍͑ͯ͘ͷͰ͸ • σϝϦοτ͸࣮૷ͷෳࡶੑ͘Β͍ͰϝϦοτ͕େ͖͍ʢಛʹӡ༻ʣ • Raft/QuorumΛ࢖ͬͨ෼ࢄRAIDͷ࣮૷͕ྲྀߦத • ෼அ଱ੑͷ͋Δخ͍͠ؼ݁ • WALΛผαʔϏεͱͯ͠੾Γग़͢෼ࢄWAL͕ྲྀߦத • “ҰΧॴ͚ͩकΕ͹͍͍” ઓུ͸ӡ༻ָ͕Ͱॿ͔Δ