Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Heart of the SwarmKit: Topology Management

Heart of the SwarmKit: Topology Management

Presented at the Docker Distributed Systems Summit, Berlin

Deep dive into the SwarmKit Topology Management internals.

Andrea Luzzardi

October 07, 2016
Tweet

Other Decks in Technology

Transcript

  1. Heart of the SwarmKit: Topology Management Docker Distributed Systems Summit

    10.07.2016 Andrea Luzzardi / al@docker.com / @aluzzardi Docker Inc.
  2. 3 Push vs Pull Push Pull Manager Worker ZooKeeper 3

    - Payload 1 - Register 2 - Discover Manager Worker Registration & Payload
  3. 4 Push vs Pull Push • Pros: Provides better control

    over communication rate − Managers decide when to contact Workers • Cons: Requires a discovery mechanism − More failure scenarios − Harder to troubleshoot Pull • Pros: Simpler to operate − Workers connect to Managers and don’t need to bind − Can easily traverse networks − Easier to secure − Less moving parts • Cons: Workers must maintain connection to Managers at all times
  4. 5 Push vs Pull • SwarmKit adopted the Pull model

    • Favored operational simplicity • Engineered solutions to provide rate control in pull mode
  5. 7 Rate Control: Heartbeats • Manager dictates heartbeat rate to

    Workers • Rate is Configurable • Managers agree on same Rate by Consensus (Raft) • Managers add jitter so pings are spread over time (avoid bursts) Manager Worker Ping? Pong! Ping me back in 5.2 seconds
  6. 8 Rate Control: Workloads • Worker opens a gRPC stream

    to receive workloads • Manager can send data whenever it wants to • Manager will send data in batches • Changes are buffered and sent in batches of 100 or every 100 ms, whichever occurs first • Adds little delay (at most 100ms) but drastically reduces amount of communication Manager Worker Give me work to do 100ms - [Batch of 12 ] 200ms - [Batch of 26 ] 300ms - [Batch of 32 ] 340ms - [Batch of 100] 360ms - [Batch of 100] 460ms - [Batch of 42 ] 560ms - [Batch of 23 ]
  7. 10 Replication Manager Manager Manager Worker Leader Follower Follower •

    Worker can connect to any Manager • Followers will forward traffic to the Leader
  8. 11 Replication Manager Manager Manager Worker Leader Follower Follower •

    Followers multiplex all workers to the Leader using a single connection • Backed by gRPC channels (HTTP/2 streams) • Reduces Leader networking load by spreading the connections evenly Worker Worker Example: On a cluster with 10,000 workers and 5 managers, each will only have to handle about 2,000 connections. Each follower will forward its 2,000 workers using a single socket to the leader.
  9. 12 Replication Manager Manager Manager Worker Leader Follower Follower •

    Upon Leader failure, a new one is elected • All managers start redirecting worker traffic to the new one • Transparent to workers Worker Worker
  10. 13 Replication Manager Manager Manager Worker Follower Follower Leader •

    Upon Leader failure, a new one is elected • All managers start redirecting worker traffic to the new one • Transparent to workers Worker Worker
  11. 14 Replication Manager 3 Manager 1 Manager 2 Worker Leader

    Follower Follower • Manager sends list of all managers’ addresses to Workers • When a new manager joins, all workers are notified • Upon manager failure, workers will reconnect to a different manager - Manager 1 Addr - Manager 2 Addr - Manager 3 Addr
  12. 15 Replication Manager 3 Manager 1 Manager 2 Worker Leader

    Follower Follower • Manager sends list of all managers’ addresses to Workers • When a new manager joins, all workers are notified • Upon manager failure, workers will reconnect to a different manager
  13. 16 Replication Manager 3 Manager 1 Manager 2 Worker Leader

    Follower Follower • Manager sends list of all managers’ addresses to Workers • When a new manager joins, all workers are notified • Upon manager failure, workers will reconnect to a different manager Reconnect to random manager
  14. 17 Replication • gRPC handles connection management − Exponential backoff,

    reconnection jitter, … − Avoids flooding managers on failover − Connections evenly spread across Managers • Manager Weights − Allows Manager prioritization / de-prioritization − Gracefully remove Manager from rotation
  15. 19 Presence • Leader commits Worker state (Up vs Down)

    into Raft − Propagates to all managers − Recoverable in case of leader re-election • Heartbeat TTLs kept in Leader memory − Too expensive to store “last ping time” in Raft • Every ping would result in a quorum write − Leader keeps worker<->TTL in a heap (time.AfterFunc) − Upon leader failover workers are given a grace period to reconnect • Workers considered Unknown until they reconnect • If they do they move back to Up • If they don’t they move to Down