Optimizing LevelDB for Performance and Scale (RICON East 2013)

Optimizing LevelDB for Performance and Scale (RICON East 2013)

Presented by Matthew Von-Maszewski at RICON East 2013

LevelDB is a flexible key-value store written by Google and open sourced in August 2011. LevelDB provides an ordered mapping of binary keys to binary values. Various companies and individuals utilize LevelDB on cell phones and servers alike. The problem, however, is it does not run optimally on either as shipped.

This presentation outlines the basic internal mechanisms of LevelDB and then proceeds to discuss the tuning opportunities in the source code for each mechanism. This talk will draw heavily from our experiences optimizing LevelDB for use in Riak, which is handy for running sufficiently large clusters.

About Matthew

Matthew is a high tech migrant worker. Currently a Software Engineer at Basho Technologies working on the C/C++ aspects of Riak's storage and vm layers. Prior to Basho, Matthew has been a contributing developer at Intuit, Akamai, Nuview, SmarterTravel Media, and for miscellaneous contracts. His delivered projects range from 4 bit micro controller toys, ROM based Quicken, high volume / user specific content delivery, and distributed retail inventory planning/control. Weekends find him either with his family or out participating in marathons or Half-Iron triathlons.

E0f4dbccf64a1d37a92e224b070ee84f?s=128

Basho Technologies

May 13, 2013
Tweet

Transcript

  1. Optimizing  leveldb  for   Performance  and  Scale

  2. leveldb  throughput 0 5000 10000 15000 20000 0 10000 20000

    30000 40000 50000 60000 70000 80000 90000 100000
  3. leveldb  throughput 0 5000 10000 15000 20000 0 10000 20000

    30000 40000 50000 60000 70000 80000 90000 100000 0 5000 10000 15000 20000 0 10000 20000 30000 40000 50000 tuned  as  a  server github.com    basho/leveldb
  4. key/value  lifecycle Write() Skip  list Recovery  log Immutable  memory Level-­‐‑0

     .sst  (overlapping) Level-­‐‑1  .sst  (sorted/overlapping) Level-­‐‑2  .sst  (sorted/overlapping) Level-­‐‑3  .sst  (sorted) Level-­‐‑4  .sst  (sorted) Level-­‐‑5  .sst  (sorted) Level-­‐‑6  .sst  (sorted) MANIFEST
  5. .sst  file  anatomy trailer block  index filter  table  (bloom) data

     block data  block data  block File  position  0 metadata  index
  6. stalls imm  (immutable  memory) level  0  full

  7. compaction   C2 F M B  C1  E A C3

    H1 G  H0  L C0  D  J K  N A  B  C3 E  F  G H1  L  M C0  D  J K  N Sorted Level+1 Sorted Level Overlap Level Before  Compaction After  Compaction •  Write  Amplification:    the  silent  performance  killer
  8. stall  sources •  Single  Database •  Level  0  full  and

     IMM  compactions  occur  too  often •  Level  0  full  and  blocked  by  any  higher  level  compaction •  Multiple  Databases •  IMM  /  Level  0  full  and  blocked  by  any  other  active  compaction •  IMM  /  Level  0  full  and  waiting  on  queue
  9. compaction  management   Global Thread  block  1  (of  5) Tiered

     Lock  0 Tiered  Lock  1 IMM  to  Level  0  compaction  thread Level  0  to  Level  1  compaction  thread Levels  1+  compaction  thread Backpressure:    Write  Throble
  10. key/value  retrieval Get() Skip  list Immutable  memory Use  manifest  to

     find  files   covering  key  range  by  level File  in  file  cache (no:  the  open  file  song) Bloom  filter  suggests  exists Use  index  to  identify   block  with  key  range Block  in  read  cache (no:  see  open  file  song,  verse  4) Sequentially  walk  block   to  find  key
  11. the  open  file  song Open  .sst  file Read  and  validate

     trailer Request  block  index Chorus:      Read  block  to  user  space      CRC  scan  block      Compression’s  checksum  block  scan      Decompress  block  into  malloc  memory Request  metadata  index Chorus: Request  bloom  filter Chorus: Chorus: Request  data  block
  12. time  fillers •  Q&A •  Repair •  Level  directories  for

     tiered  storage •  Linux  and  grace  of  posix_fadvise •  Performance  counters •  Independent  cache  types •  FusionIO  /  SSD  /  SATA  /  AWS