Slide 1

Slide 1 text

Marty Schoch GopherCon 2017

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

The Problem

Slide 4

Slide 4 text

Primary Data Source

Slide 5

Slide 5 text

Index Primary Data Source

Slide 6

Slide 6 text

Index Primary Data Source Search

Slide 7

Slide 7 text

Key/Value Store

Slide 8

Slide 8 text

General Purpose Key-Value Stores Key Value [ ] byte [ ] byte

Slide 9

Slide 9 text

General Purpose Key-Value Stores Key Value hatter mad head attached rabbit early Get / Set / Delete Values by Key

Slide 10

Slide 10 text

General Purpose Key-Value Stores Key Value hatter mad head attached rabbit early Get / Set / Delete Values by Key GET hatter → mad

Slide 11

Slide 11 text

General Purpose Key-Value Stores Key Value hatter mad head attached rabbit early late Get / Set / Delete Values by Key SET rabbit late

Slide 12

Slide 12 text

General Purpose Key-Value Stores Key Value hatter mad rabbit late head attached Get / Set / Delete Values by Key DELETE head

Slide 13

Slide 13 text

General Purpose Key-Value Stores Key Value hatter mad rabbit late Iterate Ranges of Key-Value Pairs

Slide 14

Slide 14 text

General Purpose Key-Value Stores Key Value hatter mad rabbit late Iterate Ranges of Key-Value Pairs

Slide 15

Slide 15 text

General Purpose Key-Value Stores Key Value hatter mad rabbit late Atomic Batch Updates tea party cat grinning hare march

Slide 16

Slide 16 text

General Purpose Key-Value Stores Key Value cat grinning hare march hatter mad rabbit late tea party Atomic Batch Updates

Slide 17

Slide 17 text

General Purpose Key-Value Stores Isolated Read Snapshots Iterator Started Key Value cat grinning hare march hatter mad rabbit late tea party

Slide 18

Slide 18 text

General Purpose Key-Value Stores Isolated Read Snapshots Iterator Started Key Value cat grinning caterpillar smoking hare march hatter mad rabbit late tea party Concurrent Mutation

Slide 19

Slide 19 text

General Purpose Key-Value Stores Isolated Read Snapshots Iterator Started Key Value cat grinning caterpillar smoking hare march hatter mad rabbit late tea party Concurrent Mutation (not seen)

Slide 20

Slide 20 text

General Purpose Key-Value Stores Persistence to Disk

Slide 21

Slide 21 text

Off-The-Shelf Solutions https://github.com/boltdb/bolt

Slide 22

Slide 22 text

Off-The-Shelf Solutions BoltDB? https://github.com/boltdb/bolt

Slide 23

Slide 23 text

Off-The-Shelf Solutions BoltDB? • Go • b+tree • Great Read Performance https://github.com/boltdb/bolt

Slide 24

Slide 24 text

Off-The-Shelf Solutions RocksDB? https://github.com/facebook/rocksdb/

Slide 25

Slide 25 text

Off-The-Shelf Solutions RocksDB? • C++ • LSM (leveldb) • Better Read/Write Perf Balance • cgo https://github.com/facebook/rocksdb/

Slide 26

Slide 26 text

Off-The-Shelf Solutions GoLevelDB? https://github.com/syndtr/goleveldb

Slide 27

Slide 27 text

Off-The-Shelf Solutions GoLevelDB? • Go • LSM (leveldb) • Unable to Tune Adequately https://github.com/syndtr/goleveldb

Slide 28

Slide 28 text

Off-The-Shelf Solutions Badger? https://github.com/dgraph-io/badger

Slide 29

Slide 29 text

Off-The-Shelf Solutions Badger? • Go • WiscKey • Not Available at the Time https://github.com/dgraph-io/badger

Slide 30

Slide 30 text

Build our Own Key-Value Store?

Slide 31

Slide 31 text

Simplicity Damian Gryski - Slices https://go-talks.appspot.com/github.com/dgryski/talks/dotgo-2016/slices.slide Performance through cache friendliness

Slide 32

Slide 32 text

Simplicity Damian Gryski - Slices https://go-talks.appspot.com/github.com/dgryski/talks/dotgo-2016/slices.slide Performance through cache friendliness Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy. Rule 4. Fancy algorithms are buggier than simple ones, and they're much harder to implement. Use simple algorithms as well as simple data structures. "Notes on C Programming" (Rob Pike, 1989)

Slide 33

Slide 33 text

Simplicity Dave Cheney - Simplicity and Collaboration https://dave.cheney.net/2015/03/08/simplicity-and-collaboration

Slide 34

Slide 34 text

Simplicity Dave Cheney - Simplicity and Collaboration https://dave.cheney.net/2015/03/08/simplicity-and-collaboration "Simplicity cannot be added later"

Slide 35

Slide 35 text

Special Purpose Key-Value Store Index

Slide 36

Slide 36 text

Special Purpose Key-Value Store Index Write Throughput

Slide 37

Slide 37 text

Special Purpose Key-Value Store Index Write Throughput Read after Write Latency

Slide 38

Slide 38 text

Special Purpose Key-Value Store Index Write Throughput Read after Write Latency

Slide 39

Slide 39 text

Special Purpose Key-Value Store Index Write Throughput Read after Write Latency • Persistence Decoupled from Read and Write

Slide 40

Slide 40 text

Special Purpose Key-Value Store Index Write Throughput Read after Write Latency • Persistence Decoupled from Read and Write • Willing to use All System RAM* * Bounded by some Quota

Slide 41

Slide 41 text

The Rabbit Hole

Slide 42

Slide 42 text

type Collection interface { ExecuteBatch(b Batch) error Snapshot() (Snapshot, error) }

Slide 43

Slide 43 text

type Batch interface { Set(key, value []byte) error Delete(key []byte) error }

Slide 44

Slide 44 text

Batch of Changes

Slide 45

Slide 45 text

Batch of Changes alice dodo knave cat duchess queen hare bill king rabbit

Slide 46

Slide 46 text

type Snapshot interface { Get(key []byte) ([]byte, error) Iterator(start, end []byte) (Iterator, error) }

Slide 47

Slide 47 text

Batch of Changes alice dodo knave cat duchess queen hare bill king rabbit

Slide 48

Slide 48 text

Batch of Changes alice dodo knave cat duchess queen hare bill king rabbit

Slide 49

Slide 49 text

Sort the Keys alice dodo knave cat duchess queen hare bill king rabbit

Slide 50

Slide 50 text

Get – Binary Search for Key

Slide 51

Slide 51 text

Get – Binary Search for Key

Slide 52

Slide 52 text

Get – Binary Search for Key

Slide 53

Slide 53 text

Get – Binary Search for Key

Slide 54

Slide 54 text

Get – Binary Search for Key K/V pair

Slide 55

Slide 55 text

Iterate Key-Value Pairs – For Loop K/V pair

Slide 56

Slide 56 text

Iterate Key-Value Pairs – For Loop K/V pair for-loop i++

Slide 57

Slide 57 text

Batch → Segment Unsorted Mutable Execute Sorted Immutable

Slide 58

Slide 58 text

type segment struct { data []byte meta []uint64 }

Slide 59

Slide 59 text

Set( []byte("name"), []byte("marty") ) data []byte 0 20

Slide 60

Slide 60 text

Set( []byte("name"), []byte("marty") ) data []byte Append Key and Value []byte to data slice 0 20

Slide 61

Slide 61 text

Set( []byte("name"), []byte("marty") ) data []byte Append Key and Value []byte to data slice name 0 20 24

Slide 62

Slide 62 text

Set( []byte("name"), []byte("marty") ) data []byte Append Key and Value []byte to data slice name marty 0 20 24 29

Slide 63

Slide 63 text

Set( []byte("name"), []byte("marty") ) data []byte name marty 0 20 24 29 Build 2 uint64s of metadata

Slide 64

Slide 64 text

Set( []byte("name"), []byte("marty") ) data []byte name marty 0 20 24 29 Build 2 uint64s of metadata 20 Data Offset 64 bits

Slide 65

Slide 65 text

Set( []byte("name"), []byte("marty") ) data []byte name marty 0 20 24 29 Build 2 uint64s of metadata 20 Data Offset Operation 4 bits s 64 bits

Slide 66

Slide 66 text

Set( []byte("name"), []byte("marty") ) data []byte name marty 0 20 24 29 Build 2 uint64s of metadata 20 Data Offset Operation 4 bits 24 bits s 4 64 bits Key Len

Slide 67

Slide 67 text

Set( []byte("name"), []byte("marty") ) data []byte name marty 0 20 24 29 Build 2 uint64s of metadata 20 Data Offset Operation 4 bits 24 bits 28 bits s 4 5 64 bits Key Len Val Len

Slide 68

Slide 68 text

Set( []byte("name"), []byte("marty") ) data []byte name marty 0 20 24 29 Build 2 uint64s of metadata 20 Data Offset Operation 4 bits 24 bits 28 bits s 4 5 – 8 bits 64 bits Key Len Val Len

Slide 69

Slide 69 text

Set( []byte("name"), []byte("marty") ) data []byte name marty 0 20 24 29 20 s 4 5 meta []uint64 Append uint64s to meta slice

Slide 70

Slide 70 text

type segment struct { data []byte meta []uint64 }

Slide 71

Slide 71 text

type segment struct { data []byte meta []uint64 } Sorting only shuffles integers, not bytes!

Slide 72

Slide 72 text

type segment struct { data []byte meta []uint64 }

Slide 73

Slide 73 text

type segment struct { data []byte meta []uint64 } implements…

Slide 74

Slide 74 text

type segment struct { data []byte meta []uint64 } implements… • Batch

Slide 75

Slide 75 text

type segment struct { data []byte meta []uint64 } implements… • Batch • Snapshot

Slide 76

Slide 76 text

type segment struct { data []byte meta []uint64 } implements… • Batch • Snapshot • Collection

Slide 77

Slide 77 text

Another Batch?

Slide 78

Slide 78 text

More Batches, a stack of Segments

Slide 79

Slide 79 text

More Batches, a stack of Segments segmentStack []*segment

Slide 80

Slide 80 text

More Batches, a stack of Segments Incoming Batch segmentStack []*segment

Slide 81

Slide 81 text

More Batches, a stack of Segments Incoming Batch segmentStack sort in place []*segment

Slide 82

Slide 82 text

More Batches, a stack of Segments segmentStack sort in place []*segment

Slide 83

Slide 83 text

More Batches, a stack of Segments segmentStack []*segment

Slide 84

Slide 84 text

More Batches, a stack of Segments segmentStack []*segment

Slide 85

Slide 85 text

More Batches, a stack of Segments segmentStack []*segment

Slide 86

Slide 86 text

Get segmentStack Newer Segments Shadow Older Ones

Slide 87

Slide 87 text

Get segmentStack Newer Segments Shadow Older Ones Binary Search Down the Stack

Slide 88

Slide 88 text

Get segmentStack Newer Segments Shadow Older Ones Binary Search Down the Stack

Slide 89

Slide 89 text

Get segmentStack Newer Segments Shadow Older Ones Binary Search Down the Stack

Slide 90

Slide 90 text

Get segmentStack Newer Segments Shadow Older Ones Binary Search Down the Stack

Slide 91

Slide 91 text

Get segmentStack Newer Segments Shadow Older Ones Binary Search Down the Stack

Slide 92

Slide 92 text

Iteration - using container/heap segmentStack Iterators for each segment alice cat hare

Slide 93

Slide 93 text

Iteration - using container/heap segmentStack Iterators for each segment alice cat hare

Slide 94

Slide 94 text

Iteration - using container/heap segmentStack Iterator returns 'alice' dodo cat hare

Slide 95

Slide 95 text

Iteration - using container/heap segmentStack Use heap Fix() method cat hare dodo

Slide 96

Slide 96 text

Iteration - using container/heap segmentStack Iterator returns 'cat' king hare dodo

Slide 97

Slide 97 text

Atomic Batches segmentStack

Slide 98

Slide 98 text

Isolated Snapshots collection segmentStack

Slide 99

Slide 99 text

Isolated Snapshots collection segmentStack

Slide 100

Slide 100 text

Isolated Snapshots collection segmentStack snapshot segmentStack

Slide 101

Slide 101 text

Isolated Snapshots collection segmentStack snapshot segmentStack

Slide 102

Slide 102 text

Isolated Snapshots collection segmentStack snapshot segmentStack

Slide 103

Slide 103 text

Stack Too Tall?

Slide 104

Slide 104 text

Background Merger segmentStack

Slide 105

Slide 105 text

Background Merger segmentStack Iterate over Segments

Slide 106

Slide 106 text

Background Merger segmentStack Iterate over Segments Build up New Segment

Slide 107

Slide 107 text

Background Merger segmentStack Build up New Segment Iterate over Segments

Slide 108

Slide 108 text

Background Merger segmentStack Swap in New Segment

Slide 109

Slide 109 text

Background Merger segmentStack Swap in New Segment

Slide 110

Slide 110 text

Background Merger segmentStack Swap in New Segment

Slide 111

Slide 111 text

Background Merger segmentStack

Slide 112

Slide 112 text

It Works! But…

Slide 113

Slide 113 text

Append-Only File Format File

Slide 114

Slide 114 text

Append-Only File Format File Data

Slide 115

Slide 115 text

Append-Only File Format File Data

Slide 116

Slide 116 text

Append-Only File Format File Data Data

Slide 117

Slide 117 text

Append-Only File Format File Data Data

Slide 118

Slide 118 text

Append-Only File Format File Data Data Simple

Slide 119

Slide 119 text

Append-Only File Format File Data Data Simple Safe

Slide 120

Slide 120 text

Header Header

Slide 121

Slide 121 text

Persist each Segment in the Segment Stack Header meta []uint64 data []byte func (f *File) Write(b []byte) (n int, err error)

Slide 122

Slide 122 text

func Uint64SliceToByteSlice(in []uint64) ([]byte, error) { inHeader := (*reflect.SliceHeader)(unsafe.Pointer(&in)) var out []byte outHeader := (*reflect.SliceHeader)(unsafe.Pointer(&out)) outHeader.Data = inHeader.Data outHeader.Len = inHeader.Len * 8 outHeader.Cap = inHeader.Cap * 8 return out, nil }

Slide 123

Slide 123 text

Persist the Segment Stack Header meta []uint64 data []byte []byte func (f *File) Write(b []byte) (n int, err error)

Slide 124

Slide 124 text

Persist the Segment Stack Header meta []uint64 data []byte []byte not portable func (f *File) Write(b []byte) (n int, err error)

Slide 125

Slide 125 text

Append Footer Header Footer

Slide 126

Slide 126 text

Append Footer Header Footer

Slide 127

Slide 127 text

Footer Pointers to Older Segments Header Footer

Slide 128

Slide 128 text

Footer Pointers to Older Segments Header Footer

Slide 129

Slide 129 text

Footer Pointers to Older Segments Header Footer

Slide 130

Slide 130 text

Footer Pointers to Older Segments Header Footer Footer

Slide 131

Slide 131 text

Opening File Header Footer Footer EOF Seek Backwards Find Start of Valid Footer

Slide 132

Slide 132 text

Opening File Header Footer Footer EOF Seek Backwards Find Start of Valid Footer

Slide 133

Slide 133 text

Opening File Header Footer Footer EOF Seek Backwards Find Start of Valid Footer

Slide 134

Slide 134 text

Opening File Header Footer Footer EOF Seek Backwards Find Start of Valid Footer

Slide 135

Slide 135 text

Reading From File meta []byte data []byte mmap segment bytes os https://github.com/edsrzf/mmap-go

Slide 136

Slide 136 text

Reading From File meta []byte data []byte mmap segment bytes os []uint64

Slide 137

Slide 137 text

In-memory and Disk Segments Get/Iterator implementations the same, no new code collection segmentStack []byte []byte mmap

Slide 138

Slide 138 text

Compaction

Slide 139

Slide 139 text

Reuse Background Merger to Compact Files Header Footer Footer data-0000123.moss

Slide 140

Slide 140 text

Reuse Background Merger to Compact Files Header Footer Footer data-0000123.moss Header data-0000124.moss

Slide 141

Slide 141 text

Reuse Background Merger to Compact Files Header Footer Footer data-0000123.moss Header data-0000124.moss

Slide 142

Slide 142 text

Reuse Background Merger to Compact Files Header Footer Footer data-0000123.moss Header Footer data-0000124.moss

Slide 143

Slide 143 text

Reuse Background Merger to Compact Files Header Footer data-0000124.moss

Slide 144

Slide 144 text

Who Are You?

Slide 145

Slide 145 text

No content

Slide 146

Slide 146 text

No content

Slide 147

Slide 147 text

MOSS

Slide 148

Slide 148 text

Memory Oriented Sorted Segments

Slide 149

Slide 149 text

No content

Slide 150

Slide 150 text

No content

Slide 151

Slide 151 text

Approachable Codebase Code 39% Tests 61% • Code - 4262 • Test - 6580

Slide 152

Slide 152 text

High Performance

Slide 153

Slide 153 text

High Performance - Minimize GC Impact type segment struct { data []byte meta []uint64 }

Slide 154

Slide 154 text

High Performance - Minimize GC Impact • Slices of uint64 and byte type segment struct { data []byte meta []uint64 }

Slide 155

Slide 155 text

High Performance - Minimize GC Impact • Slices of uint64 and byte • Integer offsets into slices, not pointers type segment struct { data []byte meta []uint64 } data []byte name marty 20 s 4 5 – meta []uint64

Slide 156

Slide 156 text

High Performance - Memory Allocation type segment struct { data []byte meta []uint64 }

Slide 157

Slide 157 text

High Performance - Memory Allocation • 2 uints of meta per op type segment struct { data []byte meta []uint64 }

Slide 158

Slide 158 text

High Performance - Memory Allocation • 2 uints of meta per op • len(data) = len(key) + len(val) type segment struct { data []byte meta []uint64 }

Slide 159

Slide 159 text

High Performance - Memory Allocation type Collection interface { NewBatch(ops, bytes int) (Batch, error) }

Slide 160

Slide 160 text

High Performance - Memory Allocation type Batch interface { AllocSet(key, val []byte) error AllocDel(key []byte) error }

Slide 161

Slide 161 text

High Performance - Unsafe

Slide 162

Slide 162 text

High Performance - Unsafe • Faster serialization, but loss of portability

Slide 163

Slide 163 text

High Performance - Unsafe • Faster serialization, but loss of portability • Deeply unsatisfying https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html Rob Pike – The byte order fallacy "Whenever I see code that asks what the native byte order is, it's almost certain the code is either wrong or misguided."

Slide 164

Slide 164 text

No content

Slide 165

Slide 165 text

No content

Slide 166

Slide 166 text

One Key Written Entire Store Rewritten

Slide 167

Slide 167 text

Conclusions

Slide 168

Slide 168 text

Moss Competitor

Slide 169

Slide 169 text

Moss Competitor @mossbro Moss Fanboy Moss is the best, fastest! #winning

Slide 170

Slide 170 text

Moss Competitor @mossbro Moss Fanboy Moss is the best, fastest! #winning @mossbro Moss Fanboy Competition is slow. Sad!

Slide 171

Slide 171 text

Moss Competitor @mossbro Moss Fanboy Moss is the best, fastest! #winning @mossbro Moss Fanboy Competition is slow. Sad!

Slide 172

Slide 172 text

moss Write Throughput Read after Write Latency

Slide 173

Slide 173 text

Implementation Benchmark Analyze and Profile

Slide 174

Slide 174 text

Epilogue Steve Yen

Slide 175

Slide 175 text

Couchbase Moss Team Abhinav Dangeti Sundar Sridharan Sreekanth Sivasankaran Scott Lashley Marty Schoch Alex Gyryk Aruna Piravi

Slide 176

Slide 176 text

Thanks github.com/couchbase/moss marty@couchbase.com @mschoch

Slide 177

Slide 177 text

Thanks github.com/couchbase/moss marty@couchbase.com @mschoch