Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Radix Trees, Transactions, and MemDB

Radix Trees, Transactions, and MemDB

The talk will introduces the radix tree data structure, it's properties and use cases. Immutable radix trees are shown, with their advantages and construction. Transactions are briefly introduced (what is ACID) to set context for MemDB, which is a Golang library to provide a transactional in-memory database built on immutable radix trees. The MemDB library is used by HashiCorp Consul, Vault, Nomad, and Docker Swarm.

Armon Dadgar

April 19, 2017
Tweet

More Decks by Armon Dadgar

Other Decks in Technology

Transcript

  1. Radix Trees
    Transactions, and MemDB

    View Slide

  2. Armon Dadgar
    @armon

    View Slide

  3. View Slide

  4. MemDB
    • Used in Consul, Nomad, Docker Swarm
    • Built on Immutable Radix Trees
    • Inspired by Radix Trees

    View Slide

  5. Radix Trees

    View Slide

  6. Radix Trees
    • Tree Data Structure, used as a Dictionary / Map
    • Directed (parent / child relationship)
    • Acyclic (cannot contain a cycle)
    • Keys are strings*
    • Values can be arbitrary

    View Slide

  7. Properties
    • O(K) operations instead of O(log N) for most trees
    • K is length of the input Key
    • Hash functions also O(K), can be deceptive for Hash Tables
    • Tunable sparsity vs depth

    View Slide

  8. Operations
    • CRUD (Create, Read, Update, Delete)
    • Find predecessor / successor of a key
    • Min / Max Value
    • Find common prefix of keys
    • Find longest matching prefix
    • Ordered Iteration

    View Slide

  9. Radix Structure
    zip
    fooba zip
    root
    fooba
    r z
    foobaz
    foobar

    View Slide

  10. Basic Operations
    • Start at the root and with the input key K
    • Follow the pointers from the current node using the offset into the key
    • Number of iterations linear with length of key
    • May need to split nodes on Insert or merge on Delete

    View Slide

  11. Uses Cases at HashiCorp
    • Consul / Vault ACLs
    • Vault Request Routing
    • CLI Library
    • etcetera

    View Slide

  12. Vault ACLs
    path “secret/*” {
    capabilities = [“read”]
    }
    path “secret/child” {
    capabilities = [“read”, “write”]
    }
    path “mysql/creds/*” {
    capabilities = [“read”]
    }

    View Slide

  13. ACL Structure
    [“read”]
    secret/ mysql/creds/
    root
    secret/
    (nil) child
    [“read”,
    “write”]
    [“read”]

    View Slide

  14. Vault Request Routing
    $ vault mount -path=other generic
    Successfully mounted 'generic' at ‘other’!
    $ vault mount aws
    Successfully mounted 'aws' at 'aws'!

    View Slide

  15. Routing Structure
    Generic
    Backend
    root
    Generic
    Backend
    AWS Backend
    aws/ secret/ other/

    View Slide

  16. Request Routing
    • $ vault read secret/foobar
    • Uses the longest prefix (secret/*) on ACLs to determine which policy is
    applicable and if the operation should be allowed
    • Uses the Routing tree to find longest prefix (secret/) to determine the
    backend that services the request

    View Slide

  17. Immutable Radix Tree

    View Slide

  18. Immutability
    • The inability to be changed, e.g. not mutable
    • Every modification returns a new tree, existing tree is unmodified
    • Uses more memory, reduces need for read coordination

    View Slide

  19. Immutable Radix
    • Same operations and properties of mutable Radix
    • Every modification returns a new root
    • Mutable: Insert(root, key, value) = (void)
    • Immutable: Insert(root, key, value) = root’

    View Slide

  20. Copy On Write
    • Any time a node or leaf is going to be modified, we copy the node and
    update the copy
    • K nodes updated per modification

    View Slide

  21. Original Tree
    [“read”]
    secret/ mysql/creds/
    root
    secret/
    (nil) child
    [“read”,
    “write”]
    [“read”]

    View Slide

  22. Update secret/child
    [“read”]
    secret/ mysql/creds/
    root
    secret/
    (nil) child
    [“read”,
    “write”]
    [“read”]
    [“read”,
    “write”,”delete”]

    View Slide

  23. Update secret/child
    [“read”]
    secret/ mysql/creds/
    root
    secret/
    (nil) child
    [“read”,
    “write”]
    [“read”]
    [“read”,
    “write”,”delete”]
    secret/
    child
    (nil)

    View Slide

  24. Update secret/child
    [“read”]
    secret/ mysql/creds/
    root
    secret/
    (nil) child
    [“read”,
    “write”]
    [“read”]
    [“read”,
    “write”,”delete”]
    secret/
    child
    (nil)
    root’
    secret/

    View Slide

  25. Update secret/child
    [“read”]
    mysql/creds/
    [“read”]
    [“read”,
    “write”,”delete”]
    secret/
    child
    (nil)
    root’
    secret/

    View Slide

  26. Immutable vs Mutable
    • Mutable Radix requires synchronization for reads/writes
    • Concurrent reads allowed
    • Concurrent read/writes disallowed
    • Immutable Radix requires synchronization for writes only
    • Concurrent read/writes allowed
    • Each write returns a new tree, existing tree is unmodified
    • Good for heavy read, low write workloads

    View Slide

  27. Uses Cases at HashiCorp
    • MemDB (Consul, Nomad, Docker Swarm)
    • Vault Enterprise

    View Slide

  28. Transactions

    View Slide

  29. Transaction
    • Standard usage is RDBMS (ACID)
    • Atomicity: Completely fails or completely succeeds
    • Consistency: Does not result in any integrity violations (e.g. User ID with
    does not map to blank e-mail)
    • Isolation: Transaction is not visible to others until completed
    • Durability: Once completed, the changes are permanent

    View Slide

  30. Immutable Radix
    • We can use an immutable radix tree to implement in-memory
    transactions!
    • Provides us with A and I properties
    • Consistency is domain specific
    • In-memory only, so not Durable in the ACID sense
    • Can be used to build ACID system (e.g. Consul, Nomad)

    View Slide

  31. Atomicity and Isolation
    • Many keys can be Created, Updated, Deleted in a single transaction
    • Atomicity: transaction creates new root on commit, retains existing root
    on abort. Check-And-Set (CAS) operation to swap root pointers.
    • Isolation: Copy-On-Write of each transaction prevents readers of the
    existing root from witnessing any of the changes.

    View Slide

  32. MemDB

    View Slide

  33. MemDB Goals
    • MVCC: Multi-Version Concurrency Control. Support multiple versions of an
    object so that you can have concurrent read/writes.
    • Transaction Support: Update many objects in a transaction to support
    richer high level APIs. Should be atomic and isolated.
    • Rich Indexing: Allow a single object to be indexed in multiple ways (e.g.
    User ID, email, DOB, etc)

    View Slide

  34. Why those requirements?
    • Consul needs to be able to snapshot current state to disk while accepting
    new writes. Long running read cannot block writes.
    • A single event such as a node failure may need to update multiple pieces
    of state (Health Checks, Sessions, K/V locks)
    • Many different query paths. Services by node, services by name, services in
    a failing state, etc.

    View Slide

  35. MemDB Structure
    Root Tree
    MemDB
    Write Lock
    Schema

    View Slide

  36. Schema
    • Schema defines tables and indexes at creation time
    • Allows for efficient storage and indexing of objects
    • Sanity checking of objects (ensure Consistency)

    View Slide

  37. Example Schema
    &DBSchema{
    Tables: map[string]*TableSchema{
    "people": &TableSchema{
    Name: "people",
    Indexes: map[string]*IndexSchema{
    "id": &IndexSchema{
    Name: "id",
    Unique: true,
    Indexer: &UUIDFieldIndex{Field: "ID"},
    },
    "name": &IndexSchema{
    Name: "name",
    Indexer: &StringFieldIndex{Field: "Name"},
    },
    “email”: &IndexSchema{
    Name: “email”,
    Indexer: &StringFieldINdex{Field: “Email”},
    },

    View Slide

  38. MemDB Tree Structure
    person_email
    Root Tree
    person_name
    person_id
    ID: abc123…
    Name: Armon
    Email: [email protected]
    abc123… Armon [email protected]

    View Slide

  39. MemDB Tree Structure
    • Each table has a primary tree, keyed by a unique ID
    • Each table can have 0+ indexes, unique or non-unique
    • Single copy of the object is stored in the primary tree, indexes point to the
    object

    View Slide

  40. Indexes
    • Each index has an Indexer which extracts a value from an object and turns it
    into an index key
    • StringFieldIndex: Extracts string value field
    • UUIDFieldIndex: Extracts string or []byte field
    • FieldSetIndex: Checks if a field has non-zero value (is set)
    • ConditionalIndex: Extracts field as boolean value
    • CompoundIndex: Combines multiple indexes

    View Slide

  41. Compound Index
    • CompoundIndex{StringFieldIndex{“First”},
    StringFieldIndex{“Last”}}
    • Extracts {“First”: “Armon”, “Last”: “Dadgar”} as
    “Armon\x00Dadgar\x00”
    • Queries like “first = ‘Armon’ and last starts with ‘D’”

    View Slide

  42. Read-only Transactions
    • Snapshot MemDB, retain a copy of the root pointer
    • Read against the Snapshot
    • Immutable trees allow us to avoid locking across reads and isolation from
    other transactions

    View Slide

  43. Read-only Transaction
    Root Tree
    MemDB
    Write Lock
    Schema
    Read Txn

    View Slide

  44. Mixed Transactions
    • Acquire the write lock, serializes writes
    • Write to the root, creating a new root
    • Atomic swap the root pointers on commit, do nothing on abort
    • Release the write lock

    View Slide

  45. Mixed Transaction (Progress)
    Root Tree
    MemDB
    Write Lock
    Schema
    Write Txn
    New Root

    View Slide

  46. Mixed Transaction (Commit)
    Root Tree
    MemDB
    Write Lock
    Schema
    Write Txn
    New Root

    View Slide

  47. Mixed Transaction (Abort)
    Root Tree
    MemDB
    Write Lock
    Schema
    Write Txn
    New Root

    View Slide

  48. Uses Cases
    • Consul
    • Nomad
    • Docker Swarm

    View Slide

  49. Consensus Based Systems
    stage
    write
    API
    Raft Log
    MemDB
    Raft
    Snapshot
    read
    snapshot
    apply
    write
    read

    View Slide

  50. MemDB
    • Allows highly concurrent reads to state
    • Long running reads to snapshot without blocking writes
    • Single threaded writer from Raft has no write contention
    • Raft ensures consistent state for all copies of MemDB

    View Slide

  51. Nomad Advanced Usage
    • Schedulers use snapshots of state to determine placement
    • Leader provides coordination through evaluation queue and plan queue
    • Evaluation Queue: Dequeues work to schedulers, provides at-least-once
    semantics
    • Plan Queue: Controls placement to prevent data races and over-
    allocation

    View Slide

  52. Plan Queue
    • Receives placement plans from schedulers
    • Verifies plan and writes to Raft to commit the plan
    • Read, Verify, Write loop causes a stall while we are waiting for Raft to
    commit
    • MemDB allows us to optimistically evaluate plans while we wait!

    View Slide

  53. No Overlapping
    Time
    Verify Plan 1
    Stall
    Apply Plan 1
    Verify Plan 2 Apply Plan 2

    View Slide

  54. Plan Overlapping
    Time
    Verify Plan 1
    Stall
    Apply Plan 1
    Verify Plan 2 Apply Plan 2

    View Slide

  55. Plan Overlapping
    • Plan 1 is applied to a snapshot of the state
    • Plan 2 is verified against the optimistic state copy
    • Once plan 1 commits, we can submit plan 2
    • Allows CPU to verify plan while waiting on I/O to apply writes

    View Slide

  56. Conclusion

    View Slide

  57. Radix Trees
    • High performance tree data structure
    • Comparable to Hash Tables usually, richer set of operations supported
    • I’ve used them in probably every project I’ve ever worked on

    View Slide

  58. Immutable Radix Trees
    • Similar to mutable radix tree
    • Simplifies concurrency
    • Allows for highly scalable reads

    View Slide

  59. MemDB
    • Abstracts radix trees to provide object store
    • Provides MVCC, transactions, and rich indexing
    • Simplifies complex state management
    • Allows for highly scalable reads

    View Slide

  60. Thanks!
    go-radix: https://github.com/armon/go-radix
    go-immutable-radix: https://github.com/hashicorp/go-immutable-radix
    MemDB: https://github.com/hashicorp/go-memdb
    Q/A

    View Slide