Slide 1

Slide 1 text

Heart of the SwarmKit: Topology Management Docker Distributed Systems Summit 10.07.2016 Andrea Luzzardi / al@docker.com / @aluzzardi Docker Inc.

Slide 2

Slide 2 text

Push vs Pull Model

Slide 3

Slide 3 text

3 Push vs Pull Push Pull Manager Worker ZooKeeper 3 - Payload 1 - Register 2 - Discover Manager Worker Registration & Payload

Slide 4

Slide 4 text

4 Push vs Pull Push • Pros: Provides better control over communication rate − Managers decide when to contact Workers • Cons: Requires a discovery mechanism − More failure scenarios − Harder to troubleshoot Pull • Pros: Simpler to operate − Workers connect to Managers and don’t need to bind − Can easily traverse networks − Easier to secure − Less moving parts • Cons: Workers must maintain connection to Managers at all times

Slide 5

Slide 5 text

5 Push vs Pull • SwarmKit adopted the Pull model • Favored operational simplicity • Engineered solutions to provide rate control in pull mode

Slide 6

Slide 6 text

Rate Control Controlling communication rate in a Pull model

Slide 7

Slide 7 text

7 Rate Control: Heartbeats • Manager dictates heartbeat rate to Workers • Rate is Configurable • Managers agree on same Rate by Consensus (Raft) • Managers add jitter so pings are spread over time (avoid bursts) Manager Worker Ping? Pong! Ping me back in 5.2 seconds

Slide 8

Slide 8 text

8 Rate Control: Workloads • Worker opens a gRPC stream to receive workloads • Manager can send data whenever it wants to • Manager will send data in batches • Changes are buffered and sent in batches of 100 or every 100 ms, whichever occurs first • Adds little delay (at most 100ms) but drastically reduces amount of communication Manager Worker Give me work to do 100ms - [Batch of 12 ] 200ms - [Batch of 26 ] 300ms - [Batch of 32 ] 340ms - [Batch of 100] 360ms - [Batch of 100] 460ms - [Batch of 42 ] 560ms - [Batch of 23 ]

Slide 9

Slide 9 text

Replication Running multiple managers for high availability

Slide 10

Slide 10 text

10 Replication Manager Manager Manager Worker Leader Follower Follower • Worker can connect to any Manager • Followers will forward traffic to the Leader

Slide 11

Slide 11 text

11 Replication Manager Manager Manager Worker Leader Follower Follower • Followers multiplex all workers to the Leader using a single connection • Backed by gRPC channels (HTTP/2 streams) • Reduces Leader networking load by spreading the connections evenly Worker Worker Example: On a cluster with 10,000 workers and 5 managers, each will only have to handle about 2,000 connections. Each follower will forward its 2,000 workers using a single socket to the leader.

Slide 12

Slide 12 text

12 Replication Manager Manager Manager Worker Leader Follower Follower • Upon Leader failure, a new one is elected • All managers start redirecting worker traffic to the new one • Transparent to workers Worker Worker

Slide 13

Slide 13 text

13 Replication Manager Manager Manager Worker Follower Follower Leader • Upon Leader failure, a new one is elected • All managers start redirecting worker traffic to the new one • Transparent to workers Worker Worker

Slide 14

Slide 14 text

14 Replication Manager 3 Manager 1 Manager 2 Worker Leader Follower Follower • Manager sends list of all managers’ addresses to Workers • When a new manager joins, all workers are notified • Upon manager failure, workers will reconnect to a different manager - Manager 1 Addr - Manager 2 Addr - Manager 3 Addr

Slide 15

Slide 15 text

15 Replication Manager 3 Manager 1 Manager 2 Worker Leader Follower Follower • Manager sends list of all managers’ addresses to Workers • When a new manager joins, all workers are notified • Upon manager failure, workers will reconnect to a different manager

Slide 16

Slide 16 text

16 Replication Manager 3 Manager 1 Manager 2 Worker Leader Follower Follower • Manager sends list of all managers’ addresses to Workers • When a new manager joins, all workers are notified • Upon manager failure, workers will reconnect to a different manager Reconnect to random manager

Slide 17

Slide 17 text

17 Replication • gRPC handles connection management − Exponential backoff, reconnection jitter, … − Avoids flooding managers on failover − Connections evenly spread across Managers • Manager Weights − Allows Manager prioritization / de-prioritization − Gracefully remove Manager from rotation

Slide 18

Slide 18 text

Presence Scalable presence in a distributed environment

Slide 19

Slide 19 text

19 Presence • Leader commits Worker state (Up vs Down) into Raft − Propagates to all managers − Recoverable in case of leader re-election • Heartbeat TTLs kept in Leader memory − Too expensive to store “last ping time” in Raft • Every ping would result in a quorum write − Leader keeps worker<->TTL in a heap (time.AfterFunc) − Upon leader failover workers are given a grace period to reconnect • Workers considered Unknown until they reconnect • If they do they move back to Up • If they don’t they move to Down

Slide 20

Slide 20 text

Andrea Luzzardi al@docker.com / @aluzzardi