Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a High-Performance Key/Value Store in Go

Building a High-Performance Key/Value Store in Go

In this talk we explore the internals of a high-performance key/value store written in Go. The audience will learn the basic design used to store and retrieve data, as well the techniques used to achieve high performance.

16cdfb0c4af5297e261cb36e30fa5c20?s=128

Marty Schoch

July 14, 2017
Tweet

Transcript

  1. Marty Schoch GopherCon 2017

  2. None
  3. The Problem

  4. Primary Data Source

  5. Index Primary Data Source

  6. Index Primary Data Source Search

  7. Key/Value Store

  8. General Purpose Key-Value Stores Key Value [ ] byte [

    ] byte
  9. General Purpose Key-Value Stores Key Value hatter mad head attached

    rabbit early Get / Set / Delete Values by Key
  10. General Purpose Key-Value Stores Key Value hatter mad head attached

    rabbit early Get / Set / Delete Values by Key GET hatter → mad
  11. General Purpose Key-Value Stores Key Value hatter mad head attached

    rabbit early late Get / Set / Delete Values by Key SET rabbit late
  12. General Purpose Key-Value Stores Key Value hatter mad rabbit late

    head attached Get / Set / Delete Values by Key DELETE head
  13. General Purpose Key-Value Stores Key Value hatter mad rabbit late

    Iterate Ranges of Key-Value Pairs
  14. General Purpose Key-Value Stores Key Value hatter mad rabbit late

    Iterate Ranges of Key-Value Pairs
  15. General Purpose Key-Value Stores Key Value hatter mad rabbit late

    Atomic Batch Updates tea party cat grinning hare march
  16. General Purpose Key-Value Stores Key Value cat grinning hare march

    hatter mad rabbit late tea party Atomic Batch Updates
  17. General Purpose Key-Value Stores Isolated Read Snapshots Iterator Started Key

    Value cat grinning hare march hatter mad rabbit late tea party
  18. General Purpose Key-Value Stores Isolated Read Snapshots Iterator Started Key

    Value cat grinning caterpillar smoking hare march hatter mad rabbit late tea party Concurrent Mutation
  19. General Purpose Key-Value Stores Isolated Read Snapshots Iterator Started Key

    Value cat grinning caterpillar smoking hare march hatter mad rabbit late tea party Concurrent Mutation (not seen)
  20. General Purpose Key-Value Stores Persistence to Disk

  21. Off-The-Shelf Solutions https://github.com/boltdb/bolt

  22. Off-The-Shelf Solutions BoltDB? https://github.com/boltdb/bolt

  23. Off-The-Shelf Solutions BoltDB? • Go • b+tree • Great Read

    Performance https://github.com/boltdb/bolt
  24. Off-The-Shelf Solutions RocksDB? https://github.com/facebook/rocksdb/

  25. Off-The-Shelf Solutions RocksDB? • C++ • LSM (leveldb) • Better

    Read/Write Perf Balance • cgo https://github.com/facebook/rocksdb/
  26. Off-The-Shelf Solutions GoLevelDB? https://github.com/syndtr/goleveldb

  27. Off-The-Shelf Solutions GoLevelDB? • Go • LSM (leveldb) • Unable

    to Tune Adequately https://github.com/syndtr/goleveldb
  28. Off-The-Shelf Solutions Badger? https://github.com/dgraph-io/badger

  29. Off-The-Shelf Solutions Badger? • Go • WiscKey • Not Available

    at the Time https://github.com/dgraph-io/badger
  30. Build our Own Key-Value Store?

  31. Simplicity Damian Gryski - Slices https://go-talks.appspot.com/github.com/dgryski/talks/dotgo-2016/slices.slide Performance through cache friendliness

  32. Simplicity Damian Gryski - Slices https://go-talks.appspot.com/github.com/dgryski/talks/dotgo-2016/slices.slide Performance through cache friendliness

    Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy. Rule 4. Fancy algorithms are buggier than simple ones, and they're much harder to implement. Use simple algorithms as well as simple data structures. "Notes on C Programming" (Rob Pike, 1989)
  33. Simplicity Dave Cheney - Simplicity and Collaboration https://dave.cheney.net/2015/03/08/simplicity-and-collaboration

  34. Simplicity Dave Cheney - Simplicity and Collaboration https://dave.cheney.net/2015/03/08/simplicity-and-collaboration "Simplicity cannot

    be added later"
  35. Special Purpose Key-Value Store Index

  36. Special Purpose Key-Value Store Index Write Throughput

  37. Special Purpose Key-Value Store Index Write Throughput Read after Write

    Latency
  38. Special Purpose Key-Value Store Index Write Throughput Read after Write

    Latency
  39. Special Purpose Key-Value Store Index Write Throughput Read after Write

    Latency • Persistence Decoupled from Read and Write
  40. Special Purpose Key-Value Store Index Write Throughput Read after Write

    Latency • Persistence Decoupled from Read and Write • Willing to use All System RAM* * Bounded by some Quota
  41. The Rabbit Hole

  42. type Collection interface { ExecuteBatch(b Batch) error Snapshot() (Snapshot, error)

    }
  43. type Batch interface { Set(key, value []byte) error Delete(key []byte)

    error }
  44. Batch of Changes

  45. Batch of Changes alice dodo knave cat duchess queen hare

    bill king rabbit
  46. type Snapshot interface { Get(key []byte) ([]byte, error) Iterator(start, end

    []byte) (Iterator, error) }
  47. Batch of Changes alice dodo knave cat duchess queen hare

    bill king rabbit
  48. Batch of Changes alice dodo knave cat duchess queen hare

    bill king rabbit
  49. Sort the Keys alice dodo knave cat duchess queen hare

    bill king rabbit
  50. Get – Binary Search for Key

  51. Get – Binary Search for Key

  52. Get – Binary Search for Key

  53. Get – Binary Search for Key

  54. Get – Binary Search for Key K/V pair

  55. Iterate Key-Value Pairs – For Loop K/V pair

  56. Iterate Key-Value Pairs – For Loop K/V pair for-loop i++

  57. Batch → Segment Unsorted Mutable Execute Sorted Immutable

  58. type segment struct { data []byte meta []uint64 }

  59. Set( []byte("name"), []byte("marty") ) data []byte 0 20

  60. Set( []byte("name"), []byte("marty") ) data []byte Append Key and Value

    []byte to data slice 0 20
  61. Set( []byte("name"), []byte("marty") ) data []byte Append Key and Value

    []byte to data slice name 0 20 24
  62. Set( []byte("name"), []byte("marty") ) data []byte Append Key and Value

    []byte to data slice name marty 0 20 24 29
  63. Set( []byte("name"), []byte("marty") ) data []byte name marty 0 20

    24 29 Build 2 uint64s of metadata
  64. Set( []byte("name"), []byte("marty") ) data []byte name marty 0 20

    24 29 Build 2 uint64s of metadata 20 Data Offset 64 bits
  65. Set( []byte("name"), []byte("marty") ) data []byte name marty 0 20

    24 29 Build 2 uint64s of metadata 20 Data Offset Operation 4 bits s 64 bits
  66. Set( []byte("name"), []byte("marty") ) data []byte name marty 0 20

    24 29 Build 2 uint64s of metadata 20 Data Offset Operation 4 bits 24 bits s 4 64 bits Key Len
  67. Set( []byte("name"), []byte("marty") ) data []byte name marty 0 20

    24 29 Build 2 uint64s of metadata 20 Data Offset Operation 4 bits 24 bits 28 bits s 4 5 64 bits Key Len Val Len
  68. Set( []byte("name"), []byte("marty") ) data []byte name marty 0 20

    24 29 Build 2 uint64s of metadata 20 Data Offset Operation 4 bits 24 bits 28 bits s 4 5 – 8 bits 64 bits Key Len Val Len
  69. Set( []byte("name"), []byte("marty") ) data []byte name marty 0 20

    24 29 20 s 4 5 meta []uint64 Append uint64s to meta slice
  70. type segment struct { data []byte meta []uint64 }

  71. type segment struct { data []byte meta []uint64 } Sorting

    only shuffles integers, not bytes!
  72. type segment struct { data []byte meta []uint64 }

  73. type segment struct { data []byte meta []uint64 } implements…

  74. type segment struct { data []byte meta []uint64 } implements…

    • Batch
  75. type segment struct { data []byte meta []uint64 } implements…

    • Batch • Snapshot
  76. type segment struct { data []byte meta []uint64 } implements…

    • Batch • Snapshot • Collection
  77. Another Batch?

  78. More Batches, a stack of Segments

  79. More Batches, a stack of Segments segmentStack []*segment

  80. More Batches, a stack of Segments Incoming Batch segmentStack []*segment

  81. More Batches, a stack of Segments Incoming Batch segmentStack sort

    in place []*segment
  82. More Batches, a stack of Segments segmentStack sort in place

    []*segment
  83. More Batches, a stack of Segments segmentStack []*segment

  84. More Batches, a stack of Segments segmentStack []*segment

  85. More Batches, a stack of Segments segmentStack []*segment

  86. Get segmentStack Newer Segments Shadow Older Ones

  87. Get segmentStack Newer Segments Shadow Older Ones Binary Search Down

    the Stack
  88. Get segmentStack Newer Segments Shadow Older Ones Binary Search Down

    the Stack
  89. Get segmentStack Newer Segments Shadow Older Ones Binary Search Down

    the Stack
  90. Get segmentStack Newer Segments Shadow Older Ones Binary Search Down

    the Stack
  91. Get segmentStack Newer Segments Shadow Older Ones Binary Search Down

    the Stack
  92. Iteration - using container/heap segmentStack Iterators for each segment alice

    cat hare
  93. Iteration - using container/heap segmentStack Iterators for each segment alice

    cat hare
  94. Iteration - using container/heap segmentStack Iterator returns 'alice' dodo cat

    hare
  95. Iteration - using container/heap segmentStack Use heap Fix() method cat

    hare dodo
  96. Iteration - using container/heap segmentStack Iterator returns 'cat' king hare

    dodo
  97. Atomic Batches segmentStack

  98. Isolated Snapshots collection segmentStack

  99. Isolated Snapshots collection segmentStack

  100. Isolated Snapshots collection segmentStack snapshot segmentStack

  101. Isolated Snapshots collection segmentStack snapshot segmentStack

  102. Isolated Snapshots collection segmentStack snapshot segmentStack

  103. Stack Too Tall?

  104. Background Merger segmentStack

  105. Background Merger segmentStack Iterate over Segments

  106. Background Merger segmentStack Iterate over Segments Build up New Segment

  107. Background Merger segmentStack Build up New Segment Iterate over Segments

  108. Background Merger segmentStack Swap in New Segment

  109. Background Merger segmentStack Swap in New Segment

  110. Background Merger segmentStack Swap in New Segment

  111. Background Merger segmentStack

  112. It Works! But…

  113. Append-Only File Format File

  114. Append-Only File Format File Data

  115. Append-Only File Format File Data

  116. Append-Only File Format File Data Data

  117. Append-Only File Format File Data Data

  118. Append-Only File Format File Data Data Simple

  119. Append-Only File Format File Data Data Simple Safe

  120. Header Header

  121. Persist each Segment in the Segment Stack Header meta []uint64

    data []byte func (f *File) Write(b []byte) (n int, err error)
  122. func Uint64SliceToByteSlice(in []uint64) ([]byte, error) { inHeader := (*reflect.SliceHeader)(unsafe.Pointer(&in)) var

    out []byte outHeader := (*reflect.SliceHeader)(unsafe.Pointer(&out)) outHeader.Data = inHeader.Data outHeader.Len = inHeader.Len * 8 outHeader.Cap = inHeader.Cap * 8 return out, nil }
  123. Persist the Segment Stack Header meta []uint64 data []byte []byte

    func (f *File) Write(b []byte) (n int, err error)
  124. Persist the Segment Stack Header meta []uint64 data []byte []byte

    not portable func (f *File) Write(b []byte) (n int, err error)
  125. Append Footer Header Footer

  126. Append Footer Header Footer

  127. Footer Pointers to Older Segments Header Footer

  128. Footer Pointers to Older Segments Header Footer

  129. Footer Pointers to Older Segments Header Footer

  130. Footer Pointers to Older Segments Header Footer Footer

  131. Opening File Header Footer Footer EOF Seek Backwards Find Start

    of Valid Footer
  132. Opening File Header Footer Footer EOF Seek Backwards Find Start

    of Valid Footer
  133. Opening File Header Footer Footer EOF Seek Backwards Find Start

    of Valid Footer
  134. Opening File Header Footer Footer EOF Seek Backwards Find Start

    of Valid Footer
  135. Reading From File meta []byte data []byte mmap segment bytes

    os https://github.com/edsrzf/mmap-go
  136. Reading From File meta []byte data []byte mmap segment bytes

    os []uint64
  137. In-memory and Disk Segments Get/Iterator implementations the same, no new

    code collection segmentStack []byte []byte mmap
  138. Compaction

  139. Reuse Background Merger to Compact Files Header Footer Footer data-0000123.moss

  140. Reuse Background Merger to Compact Files Header Footer Footer data-0000123.moss

    Header data-0000124.moss
  141. Reuse Background Merger to Compact Files Header Footer Footer data-0000123.moss

    Header data-0000124.moss
  142. Reuse Background Merger to Compact Files Header Footer Footer data-0000123.moss

    Header Footer data-0000124.moss
  143. Reuse Background Merger to Compact Files Header Footer data-0000124.moss

  144. Who Are You?

  145. None
  146. None
  147. MOSS

  148. Memory Oriented Sorted Segments

  149. None
  150. None
  151. Approachable Codebase Code 39% Tests 61% • Code - 4262

    • Test - 6580
  152. High Performance

  153. High Performance - Minimize GC Impact type segment struct {

    data []byte meta []uint64 }
  154. High Performance - Minimize GC Impact • Slices of uint64

    and byte type segment struct { data []byte meta []uint64 }
  155. High Performance - Minimize GC Impact • Slices of uint64

    and byte • Integer offsets into slices, not pointers type segment struct { data []byte meta []uint64 } data []byte name marty 20 s 4 5 – meta []uint64
  156. High Performance - Memory Allocation type segment struct { data

    []byte meta []uint64 }
  157. High Performance - Memory Allocation • 2 uints of meta

    per op type segment struct { data []byte meta []uint64 }
  158. High Performance - Memory Allocation • 2 uints of meta

    per op • len(data) = len(key) + len(val) type segment struct { data []byte meta []uint64 }
  159. High Performance - Memory Allocation type Collection interface { NewBatch(ops,

    bytes int) (Batch, error) }
  160. High Performance - Memory Allocation type Batch interface { AllocSet(key,

    val []byte) error AllocDel(key []byte) error }
  161. High Performance - Unsafe

  162. High Performance - Unsafe • Faster serialization, but loss of

    portability
  163. High Performance - Unsafe • Faster serialization, but loss of

    portability • Deeply unsatisfying https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html Rob Pike – The byte order fallacy "Whenever I see code that asks what the native byte order is, it's almost certain the code is either wrong or misguided."
  164. None
  165. None
  166. One Key Written Entire Store Rewritten

  167. Conclusions

  168. Moss Competitor

  169. Moss Competitor @mossbro Moss Fanboy Moss is the best, fastest!

    #winning
  170. Moss Competitor @mossbro Moss Fanboy Moss is the best, fastest!

    #winning @mossbro Moss Fanboy Competition is slow. Sad!
  171. Moss Competitor @mossbro Moss Fanboy Moss is the best, fastest!

    #winning @mossbro Moss Fanboy Competition is slow. Sad!
  172. moss Write Throughput Read after Write Latency

  173. Implementation Benchmark Analyze and Profile

  174. Epilogue Steve Yen

  175. Couchbase Moss Team Abhinav Dangeti Sundar Sridharan Sreekanth Sivasankaran Scott

    Lashley Marty Schoch Alex Gyryk Aruna Piravi
  176. Thanks github.com/couchbase/moss marty@couchbase.com @mschoch

  177. Thanks github.com/couchbase/moss marty@couchbase.com @mschoch