Slide 1

Slide 1 text

Radix Trees Transactions, and MemDB

Slide 2

Slide 2 text

Armon Dadgar @armon

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

MemDB • Used in Consul, Nomad, Docker Swarm • Built on Immutable Radix Trees • Inspired by Radix Trees

Slide 5

Slide 5 text

Radix Trees

Slide 6

Slide 6 text

Radix Trees • Tree Data Structure, used as a Dictionary / Map • Directed (parent / child relationship) • Acyclic (cannot contain a cycle) • Keys are strings* • Values can be arbitrary

Slide 7

Slide 7 text

Properties • O(K) operations instead of O(log N) for most trees • K is length of the input Key • Hash functions also O(K), can be deceptive for Hash Tables • Tunable sparsity vs depth

Slide 8

Slide 8 text

Operations • CRUD (Create, Read, Update, Delete) • Find predecessor / successor of a key • Min / Max Value • Find common prefix of keys • Find longest matching prefix • Ordered Iteration

Slide 9

Slide 9 text

Radix Structure zip fooba zip root fooba r z foobaz foobar

Slide 10

Slide 10 text

Basic Operations • Start at the root and with the input key K • Follow the pointers from the current node using the offset into the key • Number of iterations linear with length of key • May need to split nodes on Insert or merge on Delete

Slide 11

Slide 11 text

Uses Cases at HashiCorp • Consul / Vault ACLs • Vault Request Routing • CLI Library • etcetera

Slide 12

Slide 12 text

Vault ACLs path “secret/*” { capabilities = [“read”] } path “secret/child” { capabilities = [“read”, “write”] } path “mysql/creds/*” { capabilities = [“read”] }

Slide 13

Slide 13 text

ACL Structure [“read”] secret/ mysql/creds/ root secret/ (nil) child [“read”, “write”] [“read”]

Slide 14

Slide 14 text

Vault Request Routing $ vault mount -path=other generic Successfully mounted 'generic' at ‘other’! $ vault mount aws Successfully mounted 'aws' at 'aws'!

Slide 15

Slide 15 text

Routing Structure Generic Backend root Generic Backend AWS Backend aws/ secret/ other/

Slide 16

Slide 16 text

Request Routing • $ vault read secret/foobar • Uses the longest prefix (secret/*) on ACLs to determine which policy is applicable and if the operation should be allowed • Uses the Routing tree to find longest prefix (secret/) to determine the backend that services the request

Slide 17

Slide 17 text

Immutable Radix Tree

Slide 18

Slide 18 text

Immutability • The inability to be changed, e.g. not mutable • Every modification returns a new tree, existing tree is unmodified • Uses more memory, reduces need for read coordination

Slide 19

Slide 19 text

Immutable Radix • Same operations and properties of mutable Radix • Every modification returns a new root • Mutable: Insert(root, key, value) = (void) • Immutable: Insert(root, key, value) = root’

Slide 20

Slide 20 text

Copy On Write • Any time a node or leaf is going to be modified, we copy the node and update the copy • K nodes updated per modification

Slide 21

Slide 21 text

Original Tree [“read”] secret/ mysql/creds/ root secret/ (nil) child [“read”, “write”] [“read”]

Slide 22

Slide 22 text

Update secret/child [“read”] secret/ mysql/creds/ root secret/ (nil) child [“read”, “write”] [“read”] [“read”, “write”,”delete”]

Slide 23

Slide 23 text

Update secret/child [“read”] secret/ mysql/creds/ root secret/ (nil) child [“read”, “write”] [“read”] [“read”, “write”,”delete”] secret/ child (nil)

Slide 24

Slide 24 text

Update secret/child [“read”] secret/ mysql/creds/ root secret/ (nil) child [“read”, “write”] [“read”] [“read”, “write”,”delete”] secret/ child (nil) root’ secret/

Slide 25

Slide 25 text

Update secret/child [“read”] mysql/creds/ [“read”] [“read”, “write”,”delete”] secret/ child (nil) root’ secret/

Slide 26

Slide 26 text

Immutable vs Mutable • Mutable Radix requires synchronization for reads/writes • Concurrent reads allowed • Concurrent read/writes disallowed • Immutable Radix requires synchronization for writes only • Concurrent read/writes allowed • Each write returns a new tree, existing tree is unmodified • Good for heavy read, low write workloads

Slide 27

Slide 27 text

Uses Cases at HashiCorp • MemDB (Consul, Nomad, Docker Swarm) • Vault Enterprise

Slide 28

Slide 28 text

Transactions

Slide 29

Slide 29 text

Transaction • Standard usage is RDBMS (ACID) • Atomicity: Completely fails or completely succeeds • Consistency: Does not result in any integrity violations (e.g. User ID with does not map to blank e-mail) • Isolation: Transaction is not visible to others until completed • Durability: Once completed, the changes are permanent

Slide 30

Slide 30 text

Immutable Radix • We can use an immutable radix tree to implement in-memory transactions! • Provides us with A and I properties • Consistency is domain specific • In-memory only, so not Durable in the ACID sense • Can be used to build ACID system (e.g. Consul, Nomad)

Slide 31

Slide 31 text

Atomicity and Isolation • Many keys can be Created, Updated, Deleted in a single transaction • Atomicity: transaction creates new root on commit, retains existing root on abort. Check-And-Set (CAS) operation to swap root pointers. • Isolation: Copy-On-Write of each transaction prevents readers of the existing root from witnessing any of the changes.

Slide 32

Slide 32 text

MemDB

Slide 33

Slide 33 text

MemDB Goals • MVCC: Multi-Version Concurrency Control. Support multiple versions of an object so that you can have concurrent read/writes. • Transaction Support: Update many objects in a transaction to support richer high level APIs. Should be atomic and isolated. • Rich Indexing: Allow a single object to be indexed in multiple ways (e.g. User ID, email, DOB, etc)

Slide 34

Slide 34 text

Why those requirements? • Consul needs to be able to snapshot current state to disk while accepting new writes. Long running read cannot block writes. • A single event such as a node failure may need to update multiple pieces of state (Health Checks, Sessions, K/V locks) • Many different query paths. Services by node, services by name, services in a failing state, etc.

Slide 35

Slide 35 text

MemDB Structure Root Tree MemDB Write Lock Schema

Slide 36

Slide 36 text

Schema • Schema defines tables and indexes at creation time • Allows for efficient storage and indexing of objects • Sanity checking of objects (ensure Consistency)

Slide 37

Slide 37 text

Example Schema &DBSchema{ Tables: map[string]*TableSchema{ "people": &TableSchema{ Name: "people", Indexes: map[string]*IndexSchema{ "id": &IndexSchema{ Name: "id", Unique: true, Indexer: &UUIDFieldIndex{Field: "ID"}, }, "name": &IndexSchema{ Name: "name", Indexer: &StringFieldIndex{Field: "Name"}, }, “email”: &IndexSchema{ Name: “email”, Indexer: &StringFieldINdex{Field: “Email”}, },

Slide 38

Slide 38 text

MemDB Tree Structure person_email Root Tree person_name person_id ID: abc123… Name: Armon Email: armon@… abc123… Armon [email protected]

Slide 39

Slide 39 text

MemDB Tree Structure • Each table has a primary tree, keyed by a unique ID • Each table can have 0+ indexes, unique or non-unique • Single copy of the object is stored in the primary tree, indexes point to the object

Slide 40

Slide 40 text

Indexes • Each index has an Indexer which extracts a value from an object and turns it into an index key • StringFieldIndex: Extracts string value field • UUIDFieldIndex: Extracts string or []byte field • FieldSetIndex: Checks if a field has non-zero value (is set) • ConditionalIndex: Extracts field as boolean value • CompoundIndex: Combines multiple indexes

Slide 41

Slide 41 text

Compound Index • CompoundIndex{StringFieldIndex{“First”}, StringFieldIndex{“Last”}} • Extracts {“First”: “Armon”, “Last”: “Dadgar”} as “Armon\x00Dadgar\x00” • Queries like “first = ‘Armon’ and last starts with ‘D’”

Slide 42

Slide 42 text

Read-only Transactions • Snapshot MemDB, retain a copy of the root pointer • Read against the Snapshot • Immutable trees allow us to avoid locking across reads and isolation from other transactions

Slide 43

Slide 43 text

Read-only Transaction Root Tree MemDB Write Lock Schema Read Txn

Slide 44

Slide 44 text

Mixed Transactions • Acquire the write lock, serializes writes • Write to the root, creating a new root • Atomic swap the root pointers on commit, do nothing on abort • Release the write lock

Slide 45

Slide 45 text

Mixed Transaction (Progress) Root Tree MemDB Write Lock Schema Write Txn New Root

Slide 46

Slide 46 text

Mixed Transaction (Commit) Root Tree MemDB Write Lock Schema Write Txn New Root

Slide 47

Slide 47 text

Mixed Transaction (Abort) Root Tree MemDB Write Lock Schema Write Txn New Root

Slide 48

Slide 48 text

Uses Cases • Consul • Nomad • Docker Swarm

Slide 49

Slide 49 text

Consensus Based Systems stage write API Raft Log MemDB Raft Snapshot read snapshot apply write read

Slide 50

Slide 50 text

MemDB • Allows highly concurrent reads to state • Long running reads to snapshot without blocking writes • Single threaded writer from Raft has no write contention • Raft ensures consistent state for all copies of MemDB

Slide 51

Slide 51 text

Nomad Advanced Usage • Schedulers use snapshots of state to determine placement • Leader provides coordination through evaluation queue and plan queue • Evaluation Queue: Dequeues work to schedulers, provides at-least-once semantics • Plan Queue: Controls placement to prevent data races and over- allocation

Slide 52

Slide 52 text

Plan Queue • Receives placement plans from schedulers • Verifies plan and writes to Raft to commit the plan • Read, Verify, Write loop causes a stall while we are waiting for Raft to commit • MemDB allows us to optimistically evaluate plans while we wait!

Slide 53

Slide 53 text

No Overlapping Time Verify Plan 1 Stall Apply Plan 1 Verify Plan 2 Apply Plan 2

Slide 54

Slide 54 text

Plan Overlapping Time Verify Plan 1 Stall Apply Plan 1 Verify Plan 2 Apply Plan 2

Slide 55

Slide 55 text

Plan Overlapping • Plan 1 is applied to a snapshot of the state • Plan 2 is verified against the optimistic state copy • Once plan 1 commits, we can submit plan 2 • Allows CPU to verify plan while waiting on I/O to apply writes

Slide 56

Slide 56 text

Conclusion

Slide 57

Slide 57 text

Radix Trees • High performance tree data structure • Comparable to Hash Tables usually, richer set of operations supported • I’ve used them in probably every project I’ve ever worked on

Slide 58

Slide 58 text

Immutable Radix Trees • Similar to mutable radix tree • Simplifies concurrency • Allows for highly scalable reads

Slide 59

Slide 59 text

MemDB • Abstracts radix trees to provide object store • Provides MVCC, transactions, and rich indexing • Simplifies complex state management • Allows for highly scalable reads

Slide 60

Slide 60 text

Thanks! go-radix: https://github.com/armon/go-radix go-immutable-radix: https://github.com/hashicorp/go-immutable-radix MemDB: https://github.com/hashicorp/go-memdb Q/A