Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Heart of the SwarmKit: Topology Management

Heart of the SwarmKit: Topology Management

Presented at the Docker Distributed Systems Summit, Berlin

Deep dive into the SwarmKit Topology Management internals.

Andrea Luzzardi

October 07, 2016
Tweet

Other Decks in Technology

Transcript

  1. 3 Push vs Pull Push Pull Manager Worker ZooKeeper 3

    - Payload 1 - Register 2 - Discover Manager Worker Registration & Payload
  2. 4 Push vs Pull Push • Pros: Provides better control

    over communication rate − Managers decide when to contact Workers • Cons: Requires a discovery mechanism − More failure scenarios − Harder to troubleshoot Pull • Pros: Simpler to operate − Workers connect to Managers and don’t need to bind − Can easily traverse networks − Easier to secure − Less moving parts • Cons: Workers must maintain connection to Managers at all times
  3. 5 Push vs Pull • SwarmKit adopted the Pull model

    • Favored operational simplicity • Engineered solutions to provide rate control in pull mode
  4. 7 Rate Control: Heartbeats • Manager dictates heartbeat rate to

    Workers • Rate is Configurable • Managers agree on same Rate by Consensus (Raft) • Managers add jitter so pings are spread over time (avoid bursts) Manager Worker Ping? Pong! Ping me back in 5.2 seconds
  5. 8 Rate Control: Workloads • Worker opens a gRPC stream

    to receive workloads • Manager can send data whenever it wants to • Manager will send data in batches • Changes are buffered and sent in batches of 100 or every 100 ms, whichever occurs first • Adds little delay (at most 100ms) but drastically reduces amount of communication Manager Worker Give me work to do 100ms - [Batch of 12 ] 200ms - [Batch of 26 ] 300ms - [Batch of 32 ] 340ms - [Batch of 100] 360ms - [Batch of 100] 460ms - [Batch of 42 ] 560ms - [Batch of 23 ]
  6. 10 Replication Manager Manager Manager Worker Leader Follower Follower •

    Worker can connect to any Manager • Followers will forward traffic to the Leader
  7. 11 Replication Manager Manager Manager Worker Leader Follower Follower •

    Followers multiplex all workers to the Leader using a single connection • Backed by gRPC channels (HTTP/2 streams) • Reduces Leader networking load by spreading the connections evenly Worker Worker Example: On a cluster with 10,000 workers and 5 managers, each will only have to handle about 2,000 connections. Each follower will forward its 2,000 workers using a single socket to the leader.
  8. 12 Replication Manager Manager Manager Worker Leader Follower Follower •

    Upon Leader failure, a new one is elected • All managers start redirecting worker traffic to the new one • Transparent to workers Worker Worker
  9. 13 Replication Manager Manager Manager Worker Follower Follower Leader •

    Upon Leader failure, a new one is elected • All managers start redirecting worker traffic to the new one • Transparent to workers Worker Worker
  10. 14 Replication Manager 3 Manager 1 Manager 2 Worker Leader

    Follower Follower • Manager sends list of all managers’ addresses to Workers • When a new manager joins, all workers are notified • Upon manager failure, workers will reconnect to a different manager - Manager 1 Addr - Manager 2 Addr - Manager 3 Addr
  11. 15 Replication Manager 3 Manager 1 Manager 2 Worker Leader

    Follower Follower • Manager sends list of all managers’ addresses to Workers • When a new manager joins, all workers are notified • Upon manager failure, workers will reconnect to a different manager
  12. 16 Replication Manager 3 Manager 1 Manager 2 Worker Leader

    Follower Follower • Manager sends list of all managers’ addresses to Workers • When a new manager joins, all workers are notified • Upon manager failure, workers will reconnect to a different manager Reconnect to random manager
  13. 17 Replication • gRPC handles connection management − Exponential backoff,

    reconnection jitter, … − Avoids flooding managers on failover − Connections evenly spread across Managers • Manager Weights − Allows Manager prioritization / de-prioritization − Gracefully remove Manager from rotation
  14. 19 Presence • Leader commits Worker state (Up vs Down)

    into Raft − Propagates to all managers − Recoverable in case of leader re-election • Heartbeat TTLs kept in Leader memory − Too expensive to store “last ping time” in Raft • Every ping would result in a quorum write − Leader keeps worker<->TTL in a heap (time.AfterFunc) − Upon leader failover workers are given a grace period to reconnect • Workers considered Unknown until they reconnect • If they do they move back to Up • If they don’t they move to Down