Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DBアーキテクチャの比較と選択

 DBアーキテクチャの比較と選択

Database Engineering Meetup #1
DBアーキテクチャの比較と選択
Cloud-native storage service for bulk load & random lookup workload

https://scalar.connpass.com/event/298887/

Sadayuki Furuhashi

December 20, 2023
Tweet

More Decks by Sadayuki Furuhashi

Other Decks in Programming

Transcript

  1. %#ΞʔΩςΫνϟͷൺֱͱબ୒ Sadayuki Furuhashi Database Engineering Meetup #1 Treasure Data, Inc.

    Chief Architect $MPVEOBUJWFTUPSBHFTFSWJDFGPS CVMLMPBESBOEPNMPPLVQXPSLMPBE
  2. About me A founder of Treasure Data, Inc. Located in

    Tokyo, Japan. OSS Hacker. Github: @frsyuki OSS projects I initially designed:
  3. DBٕज़͸Ί·͙Δ͘͠ൃల͍ͯ͠Δ σʔλϞσϧ • ϦϨʔγϣφϧ • ΦϒδΣΫτ • άϥϑ • υΩϡϝϯτ

    • KVS • సஔΠϯσοΫε • kNN • … ෼ࢄٕज़ • ڞ༗σΟεΫ • ΦϒδΣΫτετΞ • ϚελɾεϨʔϒ • ύʔςΟγϣχϯά • ෼ࢄετϦʔϜ • ΫΦϥϜ • ෼ࢄ߹ҙ • ߴਫ਼౓࣌ࠁಉظ • … σʔλߏ଄ • B+Tree • LSM Tree • Bitcask • … ετϨʔδ • HDD • SSD • NVMe • …
  4. DBٕज़͸Ί·͙Δ͘͠ൃల͍ͯ͠Δ σʔλϞσϧ • ϦϨʔγϣφϧ • ΦϒδΣΫτ • άϥϑ • υΩϡϝϯτ

    • KVS • సஔΠϯσοΫε • kNN • … ෼ࢄٕज़ • ڞ༗σΟεΫ • ΦϒδΣΫτετΞ • ϚελɾεϨʔϒ • ύʔςΟγϣχϯά • ෼ࢄετϦʔϜ • ΫΦϥϜ • ෼ࢄ߹ҙ • ߴਫ਼౓࣌ࠁಉظ • … બ୒ࢶ͕ଟ༷Խ → ཧղͱબఆ͕೉͍͠ σʔλߏ଄ • B+Tree • LSM Tree • Bitcask • … ετϨʔδ • HDD • SSD • NVMe • …
  5. Workload analysis Read Write API Random lookup by ID Bulk

    write & append No delete Temporal Locality ʢ࣌ؒతہॴੑʣ High (repeating visitors) Low (daily or hourly batch) Spacial Locality ʢۭؒతہॴੑʣ Moderate (hot & cold data sets) High (rewrite data sets by batch) Size of a record: 10 bytes Size of total records: 3 TB Read traffic: 50 requests/sec
  6. (A) Alternative Distributed KVS CDP KVS Server CDP KVS Server

    DynamoDB DAX Ignite Presto Presto Aerospike node Aerospike node Aerospike node
  7. Aerospike: Pros & Cons • Good: Very fast lookup •

    In-memory index + Direct IO on SSDs • Bad: Expensive (hardware & operation) • Same cost for both cold & hot data
 (Large memory overhead for cold data) • No spacial locality for write
 (a batch-write becomes random-writes)
  8. SSD /dev/sdb Aerospike: Storage Architecture Aerospike node DRAM hash(k01):
 addr

    01, size=3 ... hash(k02):
 addr 09, size=3 hash(k03):
 addr 76, size=3 k01 = v01 k02 = v02 k03 = v03 addr 01: addr 09: addr 76: GET hash(k01) ✓ Primary keys (hash) are always in-memory => Always fast lookup ✓ Data is always on SSD => Always durable ✓ IO on SSD is direct IO (no filesystem cache) => Consistently fast without warm-up Load index
 at startup (cold-start)
  9. Aerospike: System Architecture { k01: v01 k02: v02 k03: v03

    k04: v04 k05: v05 } { k06: v06 k07: v07 k08: v08 k09: v09 k0a: v0a } Aerospike node Aerospike node Aerospike node Aerospike node hash(key) = Node ID Aerospike node Aerospike node Bulk write 1 Bulk write 2 Batch write => Random write: No locality, No compression, More overhead Note: compressing 10-byte data isn't efficient
  10. Aerospike: Cost estimation • 1 record needs 64 bytes of

    DRAM for primary key indexing • Storing 100 billion records (our use case) needs
 6.4 TB of DRAM. • With replication-factor=3, our system needs
 19.2TB of DRAM. • It needs r5.24xlarge × 26 instances on EC2. • It costs $89,000/month (1-year reserved, convertible). • Cost structure: • Very high DRAM cost per GB • Moderate IOPS cost • Low storage & CPU cost • High operational cost
  11. Analyzing a cause of expensive DynamoDB WCU PK Col1 Col2

    Key1 Key1 Col1 Key1 Col2 Key1 Key1 Col1 Key1 Col2 1KB 1KB 1KB 1KB 3.2KB Consumes 4 Write Capacity (0.8 WCU wasted)
  12. DynamoDB with record size <<< 1KB PK Value Key1 Val1

    Key2 Key3 Key4 Val2 Val3 Val4 1KB => 1 Write Capacity => 1 Write Capacity => 1 Write Capacity => 1 Write Capacity 10 bytes 4 Write Capacity consumed to store 40 bytes. 99% WCU wasted!
  13. Solution: Optimizing DynamoDB WCU overhead PK Value Key1 Val1 Key2

    Key3 Key4 Val2 Val3 Val4 10 bytes 10 bytes 10 bytes 10 bytes => 1 Write Capacity => 1 Write Capacity => 1 Write Capacity => 1 Write Capacity PK Value Part ID {Key1: Val1, Key2: Val2, Key3: Val3, Key4: Val4} 30 bytes => 1 Write Capacity (Note: expected 5x - 10x compression ratio)
  14. Architecture overview { k01: v01 k03: v03 k06: v06 k08:

    v08 k0a: v0a } { k02: v02 k04: v04 k05: v05 k07: v07 k09: v09 } Bulk write 1 Bulk write 2 Compress & Write DynamoDB DAX PK = hash(partition id)
  15. Pros & Cons analysis • Good: Very scalable write &

    storage cost • Data compression (10x less write & storage cost) • Fewer number of primary keys
 (1 / 100,000 with 100k records in a partition) • Bad: Complex to understand & use • More difficult to understand • Writer (Presto) must partition data by partition id
  16. DynamoDB Implementation - read PK Split 1 Split 2 Split

    3 71 69 Get hash(key) = partition id + split id GET k06 k06 is at: partition id=71 split id=2 k03 v03 k06 v06 Scan { k06: v06 } DAX (cache) Encoded split
  17. Implementation - write k01: v01 k03: v03
 k06: v06 k08:

    v08
 k0a: v0a k02: v02
 k04: v04 k05: v05 k07: v07
 k09: v09 { k01: v01 k02: v02 k03: v03 k04: v04 k05: v05 k06: v06 k07: v07 k08: v08 k09: v09 k0a: v0a } Original data set Partition id=71 Partition=69 Encoded Partition id=71 k01 v01 k03 v03 k06 v06 k08 v08 k0a v0a Encode & Compress Split 1 Split 2 Split 3 PK Split 1 Split 2 Split 3 71 69 Store hash(key) = partition id + split id Partitioning using Presto
 (GROUP BY + array_agg query) DynamoDB
  18. Appendix: Split format PK Split 1 Split 2 Split 3

    71 69 { k03: v03, k06: v06, ... } msgpack( [ [keyLen 1, keyLen 2, keyLen 3, ...], "key1key2key3...", [valLen 1, valLen 2, valLen 3, ...], "val1val2val3...", ] ) zstd( msgpack( [ , msgpack( [ [keyLen, keyLen, keyLen, ...], "keykeykey...", [valLen, valLen, valLen, ...], "valvalval...", ] ) , ... ] ) , bucket 1 bucket 2 bucket N Hash table
 serialized by MessagePack
 compressed by Zstd Size of a split: approx. 200KB (100,000 records) Nested MessagePack to omit unnecessary deserialization when looking up a record
  19. DynamoDB Write Capacity Consumption 210 105 921k Write Capacity in

    45 minutes. 170 Write Capacity per second average (≒ 170 WCU).
  20. (C) Edit log shipping & Indexed Archive Kafka / Kinesis

    (+ S3) Writer API Node Writer API Stream of bulk-write data sets Indexing &
 Storage Node RocksDB Shard
 0, 1 Indexing &
 Storage Node RocksDB Shard
 1, 2 Indexing &
 Storage Node RocksDB Shard
 2, 3 Indexing &
 Storage Node RocksDB Shard
 3, 0 Writer API Node Reader API etcd, consul Shard & node list
 management Write Read Bulk-write S3 for backup & cold-start Subscribe Read
  21. Pros & Cons • Good: Very scalable write & storage

    cost • Data compression (10x less write & storage cost) • Bad: Expensive to implement & operate • Implementing 3 custom server components
 (Stateless: Writer, Reader. Stateful: Storage) • Operating stateful servers - more work to implement backup, restoring, monitoring, alerting, etc. • Others: • Flexible indexing • Eventually-consistent
  22. Embedded databases • Log-structured Merge Tree • LevelDB • RocksDB

    • Badger • B+Tree on mmap • LMDB • BoltDB / Bbolt • Bitcask - good for our workload • Rosedb • Sparkey • pogreb
  23. Sparkey architecture Key Visibility Offset key1 Active 0 bytes key30

    Active 321 bytes key30 Deleted zstd([ “key1”, “val1", “key2”, “val2", “key3”, “val3", “key4”, “val4”, … ] ) , zstd([ “key10”, “val10”, “key20”, “val20", “key30”, “val30", “key40”, “val40”, … ] ) , ... ] ) , Segment 1 Segment 2 Segment N Log file Index file Approximately two filesystem IO commands per lookup
  24. Storage service architecture NFS v4, SMB, Lustre, Ceph (AWS EFS,

    FSx, File Cache) /data_set_1 /index.mph /log.db /.writer.lock /.reader.lock /data_set_2 /index.mph /log.db /.writer.lock /.reader.lock Client node Client node Client node R/W mount R/W mount R/W mount Embedded databases on a distributed filesystem With AWS EFS, < 1ms latency & 250k Max Read IOPS
  25. Pros & Cons • Good: Simple & Efficient • Data

    compression (10x less write & storage cost) • Bad: Limited future scalability • AWS EFS (NFS v4) has a limitation on number of concurrent clients • AWS FileCache / FSx (Lustre) can’t have AWS Lambda as clients • Self-hosting Ceph is complicated
  26. Quickwit Query Node Query Node Indexer Node Indexer Node Object

    Storage Lock / Metadata DB Split files External Storage
 / Stream Upload optimized files Update file references Query references Scan files
  27. ஌ݟ • ෼ࢄϑΝΠϧγεςϜ͕ैདྷΑΓ΋ݱ࣮త • Ϋϥ΢υ؀ڥͰωοτϫʔΫIOͷίεύ͕ҎલΑΓ΋վળ • NFSv4΍LustreͳͲϩοΫ΍Ωϟογϡ͕͍ܰϓϩτίϧ • ϋΠύʔόΠβલఏͰಈ࡞͢ΔߴੑೳNIC΍υϥΠό •

    ΦϒδΣΫτετϨʔδ͸ґવ༗ྗ͕ͩഉଞ੍ޚɾϝλσʔλ؅ཧ͕ผ్ඞཁ • ྆ऀΛ૊Έ߹ΘͤͨετϨʔδɾݕࡧɾMLγεςϜ͸ࠓޙ૿͍͑ͯ͘ͷͰ͸ • σϝϦοτ͸࣮૷ͷෳࡶੑ͘Β͍ͰϝϦοτ͕େ͖͍ʢಛʹӡ༻ʣ • Raft/QuorumΛ࢖ͬͨ෼ࢄRAIDͷ࣮૷͕ྲྀߦத • ෼அ଱ੑͷ͋Δخ͍͠ؼ݁ • WALΛผαʔϏεͱͯ͠੾Γग़͢෼ࢄWAL͕ྲྀߦத • “ҰΧॴ͚ͩकΕ͹͍͍” ઓུ͸ӡ༻ָ͕Ͱॿ͔Δ