Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a High-Performance Key/Value Store in Go

Building a High-Performance Key/Value Store in Go

In this talk we explore the internals of a high-performance key/value store written in Go. The audience will learn the basic design used to store and retrieve data, as well the techniques used to achieve high performance.

Marty Schoch

July 14, 2017
Tweet

More Decks by Marty Schoch

Other Decks in Technology

Transcript

  1. Marty Schoch
    GopherCon 2017

    View Slide

  2. View Slide

  3. The Problem

    View Slide

  4. Primary
    Data Source

    View Slide

  5. Index
    Primary
    Data Source

    View Slide

  6. Index
    Primary
    Data Source Search

    View Slide

  7. Key/Value Store

    View Slide

  8. General Purpose Key-Value Stores
    Key Value
    [ ] byte [ ] byte

    View Slide

  9. General Purpose Key-Value Stores
    Key Value
    hatter mad
    head attached
    rabbit early
    Get / Set / Delete Values by Key

    View Slide

  10. General Purpose Key-Value Stores
    Key Value
    hatter mad
    head attached
    rabbit early
    Get / Set / Delete Values by Key
    GET hatter → mad

    View Slide

  11. General Purpose Key-Value Stores
    Key Value
    hatter mad
    head attached
    rabbit early late
    Get / Set / Delete Values by Key
    SET rabbit late

    View Slide

  12. General Purpose Key-Value Stores
    Key Value
    hatter mad
    rabbit late
    head attached
    Get / Set / Delete Values by Key
    DELETE head

    View Slide

  13. General Purpose Key-Value Stores
    Key Value
    hatter mad
    rabbit late
    Iterate Ranges of Key-Value Pairs

    View Slide

  14. General Purpose Key-Value Stores
    Key Value
    hatter mad
    rabbit late
    Iterate Ranges of Key-Value Pairs

    View Slide

  15. General Purpose Key-Value Stores
    Key Value
    hatter mad
    rabbit late
    Atomic Batch Updates
    tea party
    cat grinning
    hare march

    View Slide

  16. General Purpose Key-Value Stores
    Key Value
    cat grinning
    hare march
    hatter mad
    rabbit late
    tea party
    Atomic Batch Updates

    View Slide

  17. General Purpose Key-Value Stores
    Isolated Read Snapshots
    Iterator Started
    Key Value
    cat grinning
    hare march
    hatter mad
    rabbit late
    tea party

    View Slide

  18. General Purpose Key-Value Stores
    Isolated Read Snapshots
    Iterator Started
    Key Value
    cat grinning
    caterpillar smoking
    hare march
    hatter mad
    rabbit late
    tea party
    Concurrent Mutation

    View Slide

  19. General Purpose Key-Value Stores
    Isolated Read Snapshots
    Iterator Started
    Key Value
    cat grinning
    caterpillar smoking
    hare march
    hatter mad
    rabbit late
    tea party
    Concurrent Mutation (not seen)

    View Slide

  20. General Purpose Key-Value Stores
    Persistence to Disk

    View Slide

  21. Off-The-Shelf Solutions
    https://github.com/boltdb/bolt

    View Slide

  22. Off-The-Shelf Solutions
    BoltDB?
    https://github.com/boltdb/bolt

    View Slide

  23. Off-The-Shelf Solutions
    BoltDB?
    • Go
    • b+tree
    • Great Read Performance
    https://github.com/boltdb/bolt

    View Slide

  24. Off-The-Shelf Solutions
    RocksDB?
    https://github.com/facebook/rocksdb/

    View Slide

  25. Off-The-Shelf Solutions
    RocksDB?
    • C++
    • LSM (leveldb)
    • Better Read/Write Perf Balance
    • cgo
    https://github.com/facebook/rocksdb/

    View Slide

  26. Off-The-Shelf Solutions
    GoLevelDB?
    https://github.com/syndtr/goleveldb

    View Slide

  27. Off-The-Shelf Solutions
    GoLevelDB?
    • Go
    • LSM (leveldb)
    • Unable to Tune Adequately
    https://github.com/syndtr/goleveldb

    View Slide

  28. Off-The-Shelf Solutions
    Badger?
    https://github.com/dgraph-io/badger

    View Slide

  29. Off-The-Shelf Solutions
    Badger?
    • Go
    • WiscKey
    • Not Available at the Time
    https://github.com/dgraph-io/badger

    View Slide

  30. Build our Own
    Key-Value Store?

    View Slide

  31. Simplicity
    Damian Gryski - Slices
    https://go-talks.appspot.com/github.com/dgryski/talks/dotgo-2016/slices.slide
    Performance through cache friendliness

    View Slide

  32. Simplicity
    Damian Gryski - Slices
    https://go-talks.appspot.com/github.com/dgryski/talks/dotgo-2016/slices.slide
    Performance through cache friendliness
    Rule 3. Fancy algorithms are slow when n is small, and n is usually small.
    Fancy algorithms have big constants. Until you know that n is frequently going
    to be big, don't get fancy.
    Rule 4. Fancy algorithms are buggier than simple ones, and they're much
    harder to implement. Use simple algorithms as well as simple data structures.
    "Notes on C Programming" (Rob Pike, 1989)

    View Slide

  33. Simplicity
    Dave Cheney - Simplicity and Collaboration
    https://dave.cheney.net/2015/03/08/simplicity-and-collaboration

    View Slide

  34. Simplicity
    Dave Cheney - Simplicity and Collaboration
    https://dave.cheney.net/2015/03/08/simplicity-and-collaboration
    "Simplicity cannot be added later"

    View Slide

  35. Special Purpose Key-Value Store
    Index

    View Slide

  36. Special Purpose Key-Value Store
    Index
    Write Throughput

    View Slide

  37. Special Purpose Key-Value Store
    Index
    Write Throughput
    Read after Write Latency

    View Slide

  38. Special Purpose Key-Value Store
    Index
    Write Throughput
    Read after Write Latency

    View Slide

  39. Special Purpose Key-Value Store
    Index
    Write Throughput
    Read after Write Latency
    • Persistence Decoupled from Read and Write

    View Slide

  40. Special Purpose Key-Value Store
    Index
    Write Throughput
    Read after Write Latency
    • Persistence Decoupled from Read and Write
    • Willing to use All System RAM*
    * Bounded by some Quota

    View Slide

  41. The Rabbit Hole

    View Slide

  42. type Collection interface {
    ExecuteBatch(b Batch) error
    Snapshot() (Snapshot, error)
    }

    View Slide

  43. type Batch interface {
    Set(key, value []byte) error
    Delete(key []byte) error
    }

    View Slide

  44. Batch of Changes

    View Slide

  45. Batch of Changes
    alice
    dodo
    knave
    cat
    duchess
    queen
    hare
    bill
    king
    rabbit

    View Slide

  46. type Snapshot interface {
    Get(key []byte) ([]byte, error)
    Iterator(start, end []byte)
    (Iterator, error)
    }

    View Slide

  47. Batch of Changes
    alice
    dodo
    knave
    cat
    duchess
    queen
    hare
    bill
    king
    rabbit

    View Slide

  48. Batch of Changes
    alice
    dodo
    knave
    cat
    duchess
    queen
    hare
    bill
    king
    rabbit

    View Slide

  49. Sort the Keys
    alice
    dodo
    knave
    cat
    duchess
    queen
    hare
    bill
    king
    rabbit

    View Slide

  50. Get – Binary Search for Key

    View Slide

  51. Get – Binary Search for Key

    View Slide

  52. Get – Binary Search for Key

    View Slide

  53. Get – Binary Search for Key

    View Slide

  54. Get – Binary Search for Key
    K/V pair

    View Slide

  55. Iterate Key-Value Pairs – For Loop
    K/V pair

    View Slide

  56. Iterate Key-Value Pairs – For Loop
    K/V pair
    for-loop i++

    View Slide

  57. Batch → Segment
    Unsorted
    Mutable Execute Sorted
    Immutable

    View Slide

  58. type segment struct {
    data []byte
    meta []uint64
    }

    View Slide

  59. Set( []byte("name"), []byte("marty") )
    data []byte
    0 20

    View Slide

  60. Set( []byte("name"), []byte("marty") )
    data []byte
    Append Key and Value []byte to data slice
    0 20

    View Slide

  61. Set( []byte("name"), []byte("marty") )
    data []byte
    Append Key and Value []byte to data slice
    name
    0 20 24

    View Slide

  62. Set( []byte("name"), []byte("marty") )
    data []byte
    Append Key and Value []byte to data slice
    name marty
    0 20 24 29

    View Slide

  63. Set( []byte("name"), []byte("marty") )
    data []byte name marty
    0 20 24 29
    Build 2 uint64s of metadata

    View Slide

  64. Set( []byte("name"), []byte("marty") )
    data []byte name marty
    0 20 24 29
    Build 2 uint64s of metadata
    20
    Data Offset
    64 bits

    View Slide

  65. Set( []byte("name"), []byte("marty") )
    data []byte name marty
    0 20 24 29
    Build 2 uint64s of metadata
    20
    Data Offset
    Operation
    4 bits
    s
    64 bits

    View Slide

  66. Set( []byte("name"), []byte("marty") )
    data []byte name marty
    0 20 24 29
    Build 2 uint64s of metadata
    20
    Data Offset
    Operation
    4 bits 24 bits
    s 4
    64 bits
    Key Len

    View Slide

  67. Set( []byte("name"), []byte("marty") )
    data []byte name marty
    0 20 24 29
    Build 2 uint64s of metadata
    20
    Data Offset
    Operation
    4 bits 24 bits 28 bits
    s 4 5
    64 bits
    Key Len
    Val Len

    View Slide

  68. Set( []byte("name"), []byte("marty") )
    data []byte name marty
    0 20 24 29
    Build 2 uint64s of metadata
    20
    Data Offset
    Operation
    4 bits 24 bits 28 bits
    s 4 5 –
    8 bits
    64 bits
    Key Len
    Val Len

    View Slide

  69. Set( []byte("name"), []byte("marty") )
    data []byte name marty
    0 20 24 29
    20 s 4 5
    meta []uint64
    Append uint64s to meta slice

    View Slide

  70. type segment struct {
    data []byte
    meta []uint64
    }

    View Slide

  71. type segment struct {
    data []byte
    meta []uint64
    }
    Sorting only shuffles integers, not bytes!

    View Slide

  72. type segment struct {
    data []byte
    meta []uint64
    }

    View Slide

  73. type segment struct {
    data []byte
    meta []uint64
    }
    implements…

    View Slide

  74. type segment struct {
    data []byte
    meta []uint64
    }
    implements…
    • Batch

    View Slide

  75. type segment struct {
    data []byte
    meta []uint64
    }
    implements…
    • Batch
    • Snapshot

    View Slide

  76. type segment struct {
    data []byte
    meta []uint64
    }
    implements…
    • Batch
    • Snapshot
    • Collection

    View Slide

  77. Another Batch?

    View Slide

  78. More Batches, a stack of Segments

    View Slide

  79. More Batches, a stack of Segments
    segmentStack
    []*segment

    View Slide

  80. More Batches, a stack of Segments
    Incoming Batch
    segmentStack
    []*segment

    View Slide

  81. More Batches, a stack of Segments
    Incoming Batch
    segmentStack
    sort in place
    []*segment

    View Slide

  82. More Batches, a stack of Segments
    segmentStack
    sort in place
    []*segment

    View Slide

  83. More Batches, a stack of Segments
    segmentStack
    []*segment

    View Slide

  84. More Batches, a stack of Segments
    segmentStack
    []*segment

    View Slide

  85. More Batches, a stack of Segments
    segmentStack
    []*segment

    View Slide

  86. Get
    segmentStack
    Newer Segments
    Shadow Older Ones

    View Slide

  87. Get
    segmentStack
    Newer Segments
    Shadow Older Ones
    Binary Search
    Down the Stack

    View Slide

  88. Get
    segmentStack
    Newer Segments
    Shadow Older Ones
    Binary Search
    Down the Stack

    View Slide

  89. Get
    segmentStack
    Newer Segments
    Shadow Older Ones
    Binary Search
    Down the Stack

    View Slide

  90. Get
    segmentStack
    Newer Segments
    Shadow Older Ones
    Binary Search
    Down the Stack

    View Slide

  91. Get
    segmentStack
    Newer Segments
    Shadow Older Ones
    Binary Search
    Down the Stack

    View Slide

  92. Iteration - using container/heap
    segmentStack
    Iterators for each segment
    alice
    cat
    hare

    View Slide

  93. Iteration - using container/heap
    segmentStack
    Iterators for each segment
    alice
    cat hare

    View Slide

  94. Iteration - using container/heap
    segmentStack
    Iterator returns 'alice'
    dodo
    cat hare

    View Slide

  95. Iteration - using container/heap
    segmentStack
    Use heap Fix() method
    cat
    hare
    dodo

    View Slide

  96. Iteration - using container/heap
    segmentStack
    Iterator returns 'cat'
    king
    hare
    dodo

    View Slide

  97. Atomic Batches
    segmentStack

    View Slide

  98. Isolated Snapshots
    collection
    segmentStack

    View Slide

  99. Isolated Snapshots
    collection
    segmentStack

    View Slide

  100. Isolated Snapshots
    collection
    segmentStack
    snapshot
    segmentStack

    View Slide

  101. Isolated Snapshots
    collection
    segmentStack
    snapshot
    segmentStack

    View Slide

  102. Isolated Snapshots
    collection
    segmentStack
    snapshot
    segmentStack

    View Slide

  103. Stack
    Too
    Tall?

    View Slide

  104. Background Merger
    segmentStack

    View Slide

  105. Background Merger
    segmentStack
    Iterate over Segments

    View Slide

  106. Background Merger
    segmentStack
    Iterate over Segments
    Build up New Segment

    View Slide

  107. Background Merger
    segmentStack
    Build up New Segment
    Iterate over Segments

    View Slide

  108. Background Merger
    segmentStack
    Swap in New Segment

    View Slide

  109. Background Merger
    segmentStack
    Swap in New Segment

    View Slide

  110. Background Merger
    segmentStack
    Swap in New Segment

    View Slide

  111. Background Merger
    segmentStack

    View Slide

  112. It Works!
    But…

    View Slide

  113. Append-Only File Format
    File

    View Slide

  114. Append-Only File Format
    File Data

    View Slide

  115. Append-Only File Format
    File Data

    View Slide

  116. Append-Only File Format
    File Data Data

    View Slide

  117. Append-Only File Format
    File Data Data

    View Slide

  118. Append-Only File Format
    File Data Data
    Simple

    View Slide

  119. Append-Only File Format
    File Data Data
    Simple
    Safe

    View Slide

  120. Header
    Header

    View Slide

  121. Persist each Segment in the Segment Stack
    Header
    meta []uint64
    data []byte
    func (f *File) Write(b []byte) (n int, err error)

    View Slide

  122. func Uint64SliceToByteSlice(in []uint64) ([]byte, error) {
    inHeader := (*reflect.SliceHeader)(unsafe.Pointer(&in))
    var out []byte
    outHeader := (*reflect.SliceHeader)(unsafe.Pointer(&out))
    outHeader.Data = inHeader.Data
    outHeader.Len = inHeader.Len * 8
    outHeader.Cap = inHeader.Cap * 8
    return out, nil
    }

    View Slide

  123. Persist the Segment Stack
    Header
    meta []uint64
    data []byte
    []byte
    func (f *File) Write(b []byte) (n int, err error)

    View Slide

  124. Persist the Segment Stack
    Header
    meta []uint64
    data []byte
    []byte
    not
    portable
    func (f *File) Write(b []byte) (n int, err error)

    View Slide

  125. Append Footer
    Header Footer

    View Slide

  126. Append Footer
    Header Footer

    View Slide

  127. Footer Pointers to Older Segments
    Header Footer

    View Slide

  128. Footer Pointers to Older Segments
    Header Footer

    View Slide

  129. Footer Pointers to Older Segments
    Header Footer

    View Slide

  130. Footer Pointers to Older Segments
    Header Footer Footer

    View Slide

  131. Opening File
    Header Footer Footer
    EOF
    Seek Backwards
    Find Start of Valid Footer

    View Slide

  132. Opening File
    Header Footer Footer
    EOF
    Seek Backwards
    Find Start of Valid Footer

    View Slide

  133. Opening File
    Header Footer Footer
    EOF
    Seek Backwards
    Find Start of Valid Footer

    View Slide

  134. Opening File
    Header Footer Footer
    EOF
    Seek Backwards
    Find Start of Valid Footer

    View Slide

  135. Reading From File
    meta []byte data []byte
    mmap segment bytes
    os
    https://github.com/edsrzf/mmap-go

    View Slide

  136. Reading From File
    meta []byte data []byte
    mmap segment bytes
    os
    []uint64

    View Slide

  137. In-memory and Disk Segments
    Get/Iterator implementations the same, no new code
    collection
    segmentStack
    []byte
    []byte
    mmap

    View Slide

  138. Compaction

    View Slide

  139. Reuse Background Merger to Compact Files
    Header Footer Footer
    data-0000123.moss

    View Slide

  140. Reuse Background Merger to Compact Files
    Header Footer Footer
    data-0000123.moss
    Header
    data-0000124.moss

    View Slide

  141. Reuse Background Merger to Compact Files
    Header Footer Footer
    data-0000123.moss
    Header
    data-0000124.moss

    View Slide

  142. Reuse Background Merger to Compact Files
    Header Footer Footer
    data-0000123.moss
    Header Footer
    data-0000124.moss

    View Slide

  143. Reuse Background Merger to Compact Files
    Header Footer
    data-0000124.moss

    View Slide

  144. Who
    Are
    You?

    View Slide

  145. View Slide

  146. View Slide

  147. MOSS

    View Slide

  148. Memory
    Oriented
    Sorted
    Segments

    View Slide

  149. View Slide

  150. View Slide

  151. Approachable Codebase
    Code
    39%
    Tests
    61%
    • Code - 4262
    • Test - 6580

    View Slide

  152. High
    Performance

    View Slide

  153. High Performance - Minimize GC Impact
    type segment struct {
    data []byte
    meta []uint64
    }

    View Slide

  154. High Performance - Minimize GC Impact
    • Slices of uint64 and byte
    type segment struct {
    data []byte
    meta []uint64
    }

    View Slide

  155. High Performance - Minimize GC Impact
    • Slices of uint64 and byte
    • Integer offsets into slices, not pointers
    type segment struct {
    data []byte
    meta []uint64
    }
    data []byte name marty
    20 s 4 5 –
    meta []uint64

    View Slide

  156. High Performance - Memory Allocation
    type segment struct {
    data []byte
    meta []uint64
    }

    View Slide

  157. High Performance - Memory Allocation
    • 2 uints of meta per op
    type segment struct {
    data []byte
    meta []uint64
    }

    View Slide

  158. High Performance - Memory Allocation
    • 2 uints of meta per op
    • len(data) = len(key) + len(val)
    type segment struct {
    data []byte
    meta []uint64
    }

    View Slide

  159. High Performance - Memory Allocation
    type Collection interface {
    NewBatch(ops, bytes int)
    (Batch, error)
    }

    View Slide

  160. High Performance - Memory Allocation
    type Batch interface {
    AllocSet(key, val []byte) error
    AllocDel(key []byte) error
    }

    View Slide

  161. High Performance - Unsafe

    View Slide

  162. High Performance - Unsafe
    • Faster serialization, but loss of portability

    View Slide

  163. High Performance - Unsafe
    • Faster serialization, but loss of portability
    • Deeply unsatisfying
    https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html
    Rob Pike – The byte order fallacy
    "Whenever I see code that asks what the native byte order is, it's almost certain
    the code is either wrong or misguided."

    View Slide

  164. View Slide

  165. View Slide

  166. One Key Written
    Entire Store Rewritten

    View Slide

  167. Conclusions

    View Slide

  168. Moss Competitor

    View Slide

  169. Moss Competitor
    @mossbro
    Moss Fanboy
    Moss is the best, fastest!
    #winning

    View Slide

  170. Moss Competitor
    @mossbro
    Moss Fanboy
    Moss is the best, fastest!
    #winning
    @mossbro
    Moss Fanboy
    Competition is slow. Sad!

    View Slide

  171. Moss Competitor
    @mossbro
    Moss Fanboy
    Moss is the best, fastest!
    #winning
    @mossbro
    Moss Fanboy
    Competition is slow. Sad!

    View Slide

  172. moss
    Write Throughput
    Read after Write Latency

    View Slide

  173. Implementation
    Benchmark
    Analyze and Profile

    View Slide

  174. Epilogue
    Steve Yen

    View Slide

  175. Couchbase
    Moss
    Team
    Abhinav Dangeti
    Sundar Sridharan
    Sreekanth Sivasankaran
    Scott Lashley
    Marty Schoch
    Alex Gyryk
    Aruna Piravi

    View Slide

  176. Thanks
    github.com/couchbase/moss
    [email protected]
    @mschoch

    View Slide

  177. Thanks
    github.com/couchbase/moss
    [email protected]
    @mschoch

    View Slide