Radix Trees, Transactions, and MemDB

Radix Trees, Transactions, and MemDB

The talk will introduces the radix tree data structure, it's properties and use cases. Immutable radix trees are shown, with their advantages and construction. Transactions are briefly introduced (what is ACID) to set context for MemDB, which is a Golang library to provide a transactional in-memory database built on immutable radix trees. The MemDB library is used by HashiCorp Consul, Vault, Nomad, and Docker Swarm.

11ba9630c9136eef9a70d26473d355d5?s=128

Armon Dadgar

April 19, 2017
Tweet

Transcript

  1. Radix Trees Transactions, and MemDB

  2. Armon Dadgar @armon

  3. None
  4. MemDB • Used in Consul, Nomad, Docker Swarm • Built

    on Immutable Radix Trees • Inspired by Radix Trees
  5. Radix Trees

  6. Radix Trees • Tree Data Structure, used as a Dictionary

    / Map • Directed (parent / child relationship) • Acyclic (cannot contain a cycle) • Keys are strings* • Values can be arbitrary
  7. Properties • O(K) operations instead of O(log N) for most

    trees • K is length of the input Key • Hash functions also O(K), can be deceptive for Hash Tables • Tunable sparsity vs depth
  8. Operations • CRUD (Create, Read, Update, Delete) • Find predecessor

    / successor of a key • Min / Max Value • Find common prefix of keys • Find longest matching prefix • Ordered Iteration
  9. Radix Structure zip fooba zip root fooba r z foobaz

    foobar
  10. Basic Operations • Start at the root and with the

    input key K • Follow the pointers from the current node using the offset into the key • Number of iterations linear with length of key • May need to split nodes on Insert or merge on Delete
  11. Uses Cases at HashiCorp • Consul / Vault ACLs •

    Vault Request Routing • CLI Library • etcetera
  12. Vault ACLs path “secret/*” { capabilities = [“read”] } path

    “secret/child” { capabilities = [“read”, “write”] } path “mysql/creds/*” { capabilities = [“read”] }
  13. ACL Structure [“read”] secret/ mysql/creds/ root secret/ (nil) child [“read”,

    “write”] [“read”]
  14. Vault Request Routing $ vault mount -path=other generic Successfully mounted

    'generic' at ‘other’! $ vault mount aws Successfully mounted 'aws' at 'aws'!
  15. Routing Structure Generic Backend root Generic Backend AWS Backend aws/

    secret/ other/
  16. Request Routing • $ vault read secret/foobar • Uses the

    longest prefix (secret/*) on ACLs to determine which policy is applicable and if the operation should be allowed • Uses the Routing tree to find longest prefix (secret/) to determine the backend that services the request
  17. Immutable Radix Tree

  18. Immutability • The inability to be changed, e.g. not mutable

    • Every modification returns a new tree, existing tree is unmodified • Uses more memory, reduces need for read coordination
  19. Immutable Radix • Same operations and properties of mutable Radix

    • Every modification returns a new root • Mutable: Insert(root, key, value) = (void) • Immutable: Insert(root, key, value) = root’
  20. Copy On Write • Any time a node or leaf

    is going to be modified, we copy the node and update the copy • K nodes updated per modification
  21. Original Tree [“read”] secret/ mysql/creds/ root secret/ (nil) child [“read”,

    “write”] [“read”]
  22. Update secret/child [“read”] secret/ mysql/creds/ root secret/ (nil) child [“read”,

    “write”] [“read”] [“read”, “write”,”delete”]
  23. Update secret/child [“read”] secret/ mysql/creds/ root secret/ (nil) child [“read”,

    “write”] [“read”] [“read”, “write”,”delete”] secret/ child (nil)
  24. Update secret/child [“read”] secret/ mysql/creds/ root secret/ (nil) child [“read”,

    “write”] [“read”] [“read”, “write”,”delete”] secret/ child (nil) root’ secret/
  25. Update secret/child [“read”] mysql/creds/ [“read”] [“read”, “write”,”delete”] secret/ child (nil)

    root’ secret/
  26. Immutable vs Mutable • Mutable Radix requires synchronization for reads/writes

    • Concurrent reads allowed • Concurrent read/writes disallowed • Immutable Radix requires synchronization for writes only • Concurrent read/writes allowed • Each write returns a new tree, existing tree is unmodified • Good for heavy read, low write workloads
  27. Uses Cases at HashiCorp • MemDB (Consul, Nomad, Docker Swarm)

    • Vault Enterprise
  28. Transactions

  29. Transaction • Standard usage is RDBMS (ACID) • Atomicity: Completely

    fails or completely succeeds • Consistency: Does not result in any integrity violations (e.g. User ID with does not map to blank e-mail) • Isolation: Transaction is not visible to others until completed • Durability: Once completed, the changes are permanent
  30. Immutable Radix • We can use an immutable radix tree

    to implement in-memory transactions! • Provides us with A and I properties • Consistency is domain specific • In-memory only, so not Durable in the ACID sense • Can be used to build ACID system (e.g. Consul, Nomad)
  31. Atomicity and Isolation • Many keys can be Created, Updated,

    Deleted in a single transaction • Atomicity: transaction creates new root on commit, retains existing root on abort. Check-And-Set (CAS) operation to swap root pointers. • Isolation: Copy-On-Write of each transaction prevents readers of the existing root from witnessing any of the changes.
  32. MemDB

  33. MemDB Goals • MVCC: Multi-Version Concurrency Control. Support multiple versions

    of an object so that you can have concurrent read/writes. • Transaction Support: Update many objects in a transaction to support richer high level APIs. Should be atomic and isolated. • Rich Indexing: Allow a single object to be indexed in multiple ways (e.g. User ID, email, DOB, etc)
  34. Why those requirements? • Consul needs to be able to

    snapshot current state to disk while accepting new writes. Long running read cannot block writes. • A single event such as a node failure may need to update multiple pieces of state (Health Checks, Sessions, K/V locks) • Many different query paths. Services by node, services by name, services in a failing state, etc.
  35. MemDB Structure Root Tree MemDB Write Lock Schema

  36. Schema • Schema defines tables and indexes at creation time

    • Allows for efficient storage and indexing of objects • Sanity checking of objects (ensure Consistency)
  37. Example Schema &DBSchema{ Tables: map[string]*TableSchema{ "people": &TableSchema{ Name: "people", Indexes:

    map[string]*IndexSchema{ "id": &IndexSchema{ Name: "id", Unique: true, Indexer: &UUIDFieldIndex{Field: "ID"}, }, "name": &IndexSchema{ Name: "name", Indexer: &StringFieldIndex{Field: "Name"}, }, “email”: &IndexSchema{ Name: “email”, Indexer: &StringFieldINdex{Field: “Email”}, },
  38. MemDB Tree Structure person_email Root Tree person_name person_id ID: abc123…

    Name: Armon Email: armon@… abc123… Armon armon@hashicorp.com
  39. MemDB Tree Structure • Each table has a primary tree,

    keyed by a unique ID • Each table can have 0+ indexes, unique or non-unique • Single copy of the object is stored in the primary tree, indexes point to the object
  40. Indexes • Each index has an Indexer which extracts a

    value from an object and turns it into an index key • StringFieldIndex: Extracts string value field • UUIDFieldIndex: Extracts string or []byte field • FieldSetIndex: Checks if a field has non-zero value (is set) • ConditionalIndex: Extracts field as boolean value • CompoundIndex: Combines multiple indexes
  41. Compound Index • CompoundIndex{StringFieldIndex{“First”}, StringFieldIndex{“Last”}} • Extracts {“First”: “Armon”, “Last”:

    “Dadgar”} as “Armon\x00Dadgar\x00” • Queries like “first = ‘Armon’ and last starts with ‘D’”
  42. Read-only Transactions • Snapshot MemDB, retain a copy of the

    root pointer • Read against the Snapshot • Immutable trees allow us to avoid locking across reads and isolation from other transactions
  43. Read-only Transaction Root Tree MemDB Write Lock Schema Read Txn

  44. Mixed Transactions • Acquire the write lock, serializes writes •

    Write to the root, creating a new root • Atomic swap the root pointers on commit, do nothing on abort • Release the write lock
  45. Mixed Transaction (Progress) Root Tree MemDB Write Lock Schema Write

    Txn New Root
  46. Mixed Transaction (Commit) Root Tree MemDB Write Lock Schema Write

    Txn New Root
  47. Mixed Transaction (Abort) Root Tree MemDB Write Lock Schema Write

    Txn New Root
  48. Uses Cases • Consul • Nomad • Docker Swarm

  49. Consensus Based Systems stage write API Raft Log MemDB Raft

    Snapshot read snapshot apply write read
  50. MemDB • Allows highly concurrent reads to state • Long

    running reads to snapshot without blocking writes • Single threaded writer from Raft has no write contention • Raft ensures consistent state for all copies of MemDB
  51. Nomad Advanced Usage • Schedulers use snapshots of state to

    determine placement • Leader provides coordination through evaluation queue and plan queue • Evaluation Queue: Dequeues work to schedulers, provides at-least-once semantics • Plan Queue: Controls placement to prevent data races and over- allocation
  52. Plan Queue • Receives placement plans from schedulers • Verifies

    plan and writes to Raft to commit the plan • Read, Verify, Write loop causes a stall while we are waiting for Raft to commit • MemDB allows us to optimistically evaluate plans while we wait!
  53. No Overlapping Time Verify Plan 1 Stall Apply Plan 1

    Verify Plan 2 Apply Plan 2
  54. Plan Overlapping Time Verify Plan 1 Stall Apply Plan 1

    Verify Plan 2 Apply Plan 2
  55. Plan Overlapping • Plan 1 is applied to a snapshot

    of the state • Plan 2 is verified against the optimistic state copy • Once plan 1 commits, we can submit plan 2 • Allows CPU to verify plan while waiting on I/O to apply writes
  56. Conclusion

  57. Radix Trees • High performance tree data structure • Comparable

    to Hash Tables usually, richer set of operations supported • I’ve used them in probably every project I’ve ever worked on
  58. Immutable Radix Trees • Similar to mutable radix tree •

    Simplifies concurrency • Allows for highly scalable reads
  59. MemDB • Abstracts radix trees to provide object store •

    Provides MVCC, transactions, and rich indexing • Simplifies complex state management • Allows for highly scalable reads
  60. Thanks! go-radix: https://github.com/armon/go-radix go-immutable-radix: https://github.com/hashicorp/go-immutable-radix MemDB: https://github.com/hashicorp/go-memdb Q/A