Slide 1

Slide 1 text

Accelerating NoSQL Running Voldemort on HailDB Sunny Gleason March 11, 2011

Slide 2

Slide 2 text

whoami • Sunny Gleason, human • passion: distributed systems engineering • previous... Ning : custom social networks Amazon.com : infra & web services • now... building cloud infrastructure

Slide 3

Slide 3 text

whereami • twitter : twitter.com/sunnygleason • github : github.com/sunnygleason • linkedin : linkedin.com/in/sunnygleason

Slide 4

Slide 4 text

what’s in this presentation? • NoSQL Roundup • Voldemort who? • HailDB wha? • Results & Next Steps • Special Bonus Material

Slide 5

Slide 5 text

NoSQL • “Not Only” SQL • What’s the point? • Proponent: “reaching next level of scale” • Cynic: “cloud is hype, ops nightmare”

Slide 6

Slide 6 text

what does it gain? • Higher performance, scalability, availability • More robust fault-tolerance • Simplified systems design • Easier operations

Slide 7

Slide 7 text

what does it lose? • Reduced / simplified programming model • No ad-hoc queries, no joins, no txns • Not ACID: Atomicity / Consistency / Isolation / Durability • Operations / management is still evolving • Challenging to quantify health of system • Fewer domain experts

Slide 8

Slide 8 text

NoSQL Map NoSQL Key-Value Store KV Stores (durable) KV Stores (volatile) Dynamo, Voldemort, Riak Memcached, Redis Column Store Cassandra, BigTable, HBase Graph Store Document Store CouchDB, MongoDB Neo4J

Slide 9

Slide 9 text

NoSQL Map NoSQL Key-Value Store KV Stores (durable) KV Stores (volatile) Dynamo, Voldemort, Riak Column Store Graph Store Document Store

Slide 10

Slide 10 text

motivation • database on 1 box : ok • database with master/slave replication : ok • database on cluster : tricky • database on SAN : time bomb

Slide 11

Slide 11 text

performance Complexity Aggregate Operations / Sec 1K 10K 100K 1M MySQL +SSD +FusionIO + Sharding Memcached +Cluster Voldemort

Slide 12

Slide 12 text

dynamo case study • Amazon : high read throughput, always- accessible writes • Shopping cart application • ‘Glitches’ ok, duplicate or missing item • Data loss or unavailability is unacceptable • Solution: K-V schema plus smart routing & data placement

Slide 13

Slide 13 text

key-value storage • Essentially, a gigantic hash table • Typically assign byte[] values to byte[] keys • Plus versioning mixed in to handle failures and conflicts • Yes, you *can* do range partitioning; in practice, avoid it because of hot spots

Slide 14

Slide 14 text

k-v: durable vs. volatile • RAM is ridiculous speed (ns), not durable • Disk is persistent and slow (3-7ms) • RAID eases the pain a bit (4-8x throughput) • SSD is providing good promise (100-300us) • FusionIO is redefining the space (30-100us)

Slide 15

Slide 15 text

dynamo clones • Voldemort : from LinkedIn, dynamo implementation in Java (default: BDB-JE) • Riak : from Basho, dynamo implementation in Erlang (default: embedded InnoDB)

Slide 16

Slide 16 text

Voldemort • Developed at LinkedIn • Scalable Key-Value Storage • Based on Amazon Dynamo model • High Read Throughput • Always Writable

Slide 17

Slide 17 text

Voldemort features • Consistent Hashing • Quorum settings : R, W, N • Auto-sharding & rebalancing • Pluggable storage engines

Slide 18

Slide 18 text

Consistent Hashing * Arrange keys around ring * Compute token in ring using hash function * Determine nodes responsible for token using live set

Slide 19

Slide 19 text

R/W/N • N : maximum number of nodes to query for an operation • R : read quorum • W : write quorum • Can adjust ‘quorum’ to balance throughput and fault-tolerance

Slide 20

Slide 20 text

setting up Voldemort 1 Step 1: Download the code Download either a recent stable release or, for those who like to live more dangerously, the up-to-the-minute build from the build server. Step 2: Start single node cluster > bin/voldemort-server.sh config/single_node_cluster > /tmp/voldemort.log & Step 3: Start commandline test client and do some operations > bin/voldemort-shell.sh test tcp://localhost:6666 Established connection to test via tcp://localhost:6666 > put "hello" "world" > get "hello" version(0:1): "world" > delete "hello" > get "hello" null > exit k k thx bye.

Slide 21

Slide 21 text

setting up Voldemort 2 • For a cluster, use cloud startup scripts • Works with Amazon EC2 • See https://github.com/voldemort/ voldemort/wiki/EC2-Testing-Infrastructure

Slide 22

Slide 22 text

Voldemort client libraries • Java, Scala, Clojure • Ruby • Python • C++

Slide 23

Slide 23 text

storage engines • BDB-JE (Oracle Sleepycat, the original) • Krati (LinkedIn, pretty new) • HailDB (new!) • MySQL (old / dated)

Slide 24

Slide 24 text

BDB-JE • Log-Structured B-Tree • Fast Storage When Mostly Cached • Configured without fsync() by default - writes are batched and flushed periodically

Slide 25

Slide 25 text

Krati • Fast Hash-Oriented Storage • Uses memory-mapped files for speed • Configured without fsync() by default - writes are batched and flushed periodically

Slide 26

Slide 26 text

HailDB • Fork of MySQL InnoDB plugin (contributors : Oracle, Google, Facebook, Percona) • Higher stability for large data sets • Fast crash recovery • External from Java heap (ease GC pain) • apt-get install haildb (from launchpad PPA) • Use “flush-once-per-second” mode

Slide 27

Slide 27 text

HailDB, Java & Voldemort HailDB (log, buffer pool, tablespace) JNA g414-haildb v-storage-inno Voldemort Node Voldemort Node Voldemort Node Voldemort Client

Slide 28

Slide 28 text

HailDB & Java • g414-haildb : where the magic happens • uses JNA: Java Native Access • dynamic binding to libhaildb shared library • auto-generated from .h file (w/ JNAerator) • Pointer classes & other shenanigans

Slide 29

Slide 29 text

HailDB schema _key VARBINARY(200) _version VARBINARY(200) _value BLOB PRIMARY KEY(_key, _version)

Slide 30

Slide 30 text

implementation gotchas • InnoDB API-level usage is unclear • Synchronization & locking is unclear • Therefore... I learned to love reading C • Error handling is *nasty* • Installation a bit of a pain

Slide 31

Slide 31 text

experimental setup • OS X: 8-Core Xeon, 32GB RAM, 200GB OWC SSD • Faban Benchmark : PUT 64-byte key, 1024- byte value • Scenarios:1, 2, 4, 8 threads • 512M Java Heap

Slide 32

Slide 32 text

Perf: BDB Put 100%

Slide 33

Slide 33 text

Perf: BDB Put 20%/Get 80%

Slide 34

Slide 34 text

Perf: BDB Put 20% Detail

Slide 35

Slide 35 text

Perf: Krati Put 100%

Slide 36

Slide 36 text

Perf: Krati Put 20%/Get 80%

Slide 37

Slide 37 text

Perf: Krati Put 20% Detail

Slide 38

Slide 38 text

Perf: HailDB Put 100%

Slide 39

Slide 39 text

Perf: HailDB Put 20%/Get 80%

Slide 40

Slide 40 text

Perf: HailDB Put 20% Detail

Slide 41

Slide 41 text

future work • Improve Packaging / Installation • Schema refinements & perf enhancements • Online backup/export with XtraBackup • JNI Bindings

Slide 42

Slide 42 text

schema refinements • Build upon Nokia work on fast k-v schema • 8-byte ‘long’ key hash vs. full key bytes • Smart use of secondary indexes • Native representation of vector clocks • Delayed / soft deletion • Expect 40-50% performance boost

Slide 43

Slide 43 text

InnoDB tuning • Skinny columns, skinny rows! (esp. Primary Key) • Varchar enum ‘bad’, int or smallint ‘good’ • fixed-width rows allows in-place updates • Use covering indexes strategically • More data per page means faster index scans, more efficient buffer pool utilization • You only get so many trx’s on given CPU/RAM configuration - benchmark this!

Slide 44

Slide 44 text

refined schema _id BIGINT (auto increment) _key_hash BIGINT _key VARBINARY(200) _version VARBINARY(200) _value BLOB PRIMARY KEY(_id) KEY(_key_hash)

Slide 45

Slide 45 text

online backup • hot backup of data to other machine / destination • test Percona Xtrabackup with HailDB • next step: backup/export to Hadoop/HDFS (similar to Cloudera Sqoop tool)

Slide 46

Slide 46 text

JNI bindings • JNI can get 2-5x perf boost vs. JNA • ... at the expense of nasty code • Will go for schema optimizations and InnoDB tuning tips *first*

Slide 47

Slide 47 text

resources • github.com/voldemort/voldemort freenode #voldemort • github.com/sunnygleason/v-storage-haildb github.com/sunnygleason/v-storage-bench github.com/sunnygleason/g414-haildb • jna.dev.java.net

Slide 48

Slide 48 text

more resources • Amazon Dynamo • Faban / XFaban • HailDB • Drizzle • PBXT

Slide 49

Slide 49 text

Thank You!