Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Heart of the SwarmKit: Distributed Data Store

Aaron Lehmann
October 07, 2016
89

Heart of the SwarmKit: Distributed Data Store

Aaron Lehmann

October 07, 2016
Tweet

Transcript

  1. What we store • State of the cluster • User-defined

    configuration • Organized into objects: ◦ Cluster ◦ Node ◦ Service ◦ Task ◦ Network ◦ etc... 2
  2. Why embed the distributed data store? • Ease of setup

    • Fewer round trips • Can maintain local indices 3
  3. In-memory data structures • Objects are protocol buffers messages •

    go-memdb used as in-memory database: https://github.com/hashicorp/go-memdb • Underlying data structure: radix trees 4
  4. Radix trees for indexing id: id:abcd id:efgh node: node:1234:abcd node:1234:efgh

    node:5678:ijkl node:1234 node:5678:mnop node:5678 id:ijkl id:mnop 6
  5. Transactions • We provide a transactional interface to read or

    write data in the store • Read transactions are just atomic snapshots • Write transaction: ◦ Take a snapshot ◦ Make changes ◦ Replace tree root with modified tree’s root (atomic pointer swap) • Only one write transaction allowed at once • Commit of write transaction blocks until changes are committed to Raft 11
  6. Transaction example: Read dataStore.View(func(tx store.ReadTx) { tasks, err = store.FindTasks(tx,

    store.ByServiceID(serviceID)) if err == nil { for _, t := range tasks { fmt.Println(t.ID) } } }) 12
  7. Transaction example: Write err := dataStore.Update(func(tx store.Tx) error { t

    := store.GetTask(tx, "id1") if t == nil { return errors.New("task not found") } t.DesiredState = api.TaskStateRunning return store.UpdateTask(tx, t) }) 13
  8. Watches • Code can register to receive specific creation, update,

    or deletion events on a Go channel • Selectors on particular fields in the objects • Currently an internal feature, will expose through API in the future 14
  9. Watches watch, cancelWatch = state.Watch( r.store.WatchQueue(), state.EventUpdateTask{ Task: &api.Task{ID: oldTask.ID,

    Status: api.TaskStatus{State: api.TaskStateRunning}}, Checks: []state.TaskCheckFunc{state.TaskCheckID, state.TaskCheckStateGreaterThan}, }, ... 15
  10. Watches state.EventUpdateNode{ Node: &api.Node{ID: oldTask.NodeID, Status: api.NodeStatus{State: api.NodeStatus_DOWN}}, Checks: []state.NodeCheckFunc{state.NodeCheckID,

    state.NodeCheckState}, }, state.EventDeleteNode{ Node: &api.Node{ID: oldTask.NodeID}, Checks: []state.NodeCheckFunc{state.NodeCheckID}, }, }) 16
  11. Replication • Only Raft leader does writes • During write

    transaction, log every change as well as updating the radix tree • The transaction log is serialized and replicated through Raft • Since our internal types are protobuf types, serialization is very easy • Followers replay the log entries into radix tree 17
  12. Sequencer • Every object in the store has a Version

    field • Version stores the Raft index when the object was last updated • Updates must provide a base Version; are rejected if it is out of date • Similar to CAS • Also exposed through API calls that change objects in the store 18
  13. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.3.0

    ... Version = 189 Service ABC Spec Replicas = 4 Image = registry:2.4.0 ... Version = 189 Update request: Original object: 20
  14. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.3.0

    ... Version = 189 Original object: Service ABC Spec Replicas = 4 Image = registry:2.4.0 ... Version = 189 Update request: 21
  15. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.4.0

    ... Version = 190 Service ABC Spec Replicas = 5 Image = registry:2.3.0 ... Version = 189 Update request: Updated object: 23
  16. Sequencer Service ABC Spec Replicas = 4 Image = registry:2.4.0

    ... Version = 190 Service ABC Spec Replicas = 5 Image = registry:2.3.0 ... Version = 189 Update request: Updated object: 24
  17. Write batching • Every write transaction involves a Raft round

    trip to get consensus • Costly to do many transactions, but want to limit the size of writes to Raft • Batch primitive lets the store automatically split a group of changes across multiple writes to Raft 25
  18. Write batching _, err = d.store.Batch(func(batch *store.Batch) error { for

    _, n := range nodes { err := batch.Update(func(tx store.Tx) error { node := store.GetNode(tx, n.ID) node.Status = api.NodeStatus{ State: api.NodeStatus_UNKNOWN, Message: `Node moved to "unknown" state`, } return store.UpdateNode(tx, node) } } return nil } 26