Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Heart of the SwarmKit: Topology Management

Heart of the SwarmKit: Topology Management

Presented at the Docker Distributed Systems Summit, Berlin

Deep dive into the SwarmKit Topology Management internals.

Andrea Luzzardi

October 07, 2016
Tweet

Other Decks in Technology

Transcript

  1. Heart of the SwarmKit:
    Topology Management
    Docker Distributed Systems Summit
    10.07.2016
    Andrea Luzzardi / [email protected] / @aluzzardi
    Docker Inc.

    View full-size slide

  2. Push vs Pull Model

    View full-size slide

  3. 3
    Push vs Pull
    Push Pull
    Manager
    Worker
    ZooKeeper
    3 - Payload
    1 - Register
    2 - Discover Manager
    Worker
    Registration &
    Payload

    View full-size slide

  4. 4
    Push vs Pull
    Push
    • Pros: Provides better control
    over communication rate
    − Managers decide when to
    contact Workers
    • Cons: Requires a discovery
    mechanism
    − More failure scenarios
    − Harder to troubleshoot
    Pull
    • Pros: Simpler to operate
    − Workers connect to Managers
    and don’t need to bind
    − Can easily traverse networks
    − Easier to secure
    − Less moving parts
    • Cons: Workers must maintain
    connection to Managers at all
    times

    View full-size slide

  5. 5
    Push vs Pull
    • SwarmKit adopted the Pull model
    • Favored operational simplicity
    • Engineered solutions to provide rate control in pull mode

    View full-size slide

  6. Rate Control
    Controlling communication rate in a Pull model

    View full-size slide

  7. 7
    Rate Control: Heartbeats
    • Manager dictates heartbeat rate to
    Workers
    • Rate is Configurable
    • Managers agree on same Rate by
    Consensus (Raft)
    • Managers add jitter so pings are spread
    over time (avoid bursts)
    Manager
    Worker
    Ping? Pong!
    Ping me back in
    5.2 seconds

    View full-size slide

  8. 8
    Rate Control: Workloads
    • Worker opens a gRPC stream to
    receive workloads
    • Manager can send data whenever it
    wants to
    • Manager will send data in batches
    • Changes are buffered and sent in
    batches of 100 or every 100 ms,
    whichever occurs first
    • Adds little delay (at most 100ms) but
    drastically reduces amount of
    communication
    Manager
    Worker
    Give me
    work to do
    100ms - [Batch of 12 ]
    200ms - [Batch of 26 ]
    300ms - [Batch of 32 ]
    340ms - [Batch of 100]
    360ms - [Batch of 100]
    460ms - [Batch of 42 ]
    560ms - [Batch of 23 ]

    View full-size slide

  9. Replication
    Running multiple managers for high availability

    View full-size slide

  10. 10
    Replication
    Manager Manager Manager
    Worker
    Leader Follower
    Follower • Worker can connect to any
    Manager
    • Followers will forward traffic to
    the Leader

    View full-size slide

  11. 11
    Replication
    Manager Manager Manager
    Worker
    Leader Follower
    Follower • Followers multiplex all workers
    to the Leader using a single
    connection
    • Backed by gRPC channels
    (HTTP/2 streams)
    • Reduces Leader networking load
    by spreading the connections
    evenly
    Worker Worker
    Example: On a cluster with 10,000 workers and 5 managers,
    each will only have to handle about 2,000 connections. Each
    follower will forward its 2,000 workers using a single socket to
    the leader.

    View full-size slide

  12. 12
    Replication
    Manager Manager Manager
    Worker
    Leader Follower
    Follower • Upon Leader failure, a new one
    is elected
    • All managers start redirecting
    worker traffic to the new one
    • Transparent to workers
    Worker Worker

    View full-size slide

  13. 13
    Replication
    Manager Manager Manager
    Worker
    Follower Follower
    Leader • Upon Leader failure, a new one
    is elected
    • All managers start redirecting
    worker traffic to the new one
    • Transparent to workers
    Worker Worker

    View full-size slide

  14. 14
    Replication
    Manager
    3
    Manager
    1
    Manager
    2
    Worker
    Leader Follower
    Follower • Manager sends list of all
    managers’ addresses to Workers
    • When a new manager joins, all
    workers are notified
    • Upon manager failure, workers
    will reconnect to a different
    manager
    - Manager 1 Addr
    - Manager 2 Addr
    - Manager 3 Addr

    View full-size slide

  15. 15
    Replication
    Manager
    3
    Manager
    1
    Manager
    2
    Worker
    Leader Follower
    Follower • Manager sends list of all
    managers’ addresses to Workers
    • When a new manager joins, all
    workers are notified
    • Upon manager failure, workers
    will reconnect to a different
    manager

    View full-size slide

  16. 16
    Replication
    Manager
    3
    Manager
    1
    Manager
    2
    Worker
    Leader Follower
    Follower • Manager sends list of all
    managers’ addresses to Workers
    • When a new manager joins, all
    workers are notified
    • Upon manager failure, workers
    will reconnect to a different
    manager
    Reconnect to
    random manager

    View full-size slide

  17. 17
    Replication
    • gRPC handles connection management
    − Exponential backoff, reconnection jitter, …
    − Avoids flooding managers on failover
    − Connections evenly spread across Managers
    • Manager Weights
    − Allows Manager prioritization / de-prioritization
    − Gracefully remove Manager from rotation

    View full-size slide

  18. Presence
    Scalable presence in a distributed environment

    View full-size slide

  19. 19
    Presence
    • Leader commits Worker state (Up vs Down) into Raft
    − Propagates to all managers
    − Recoverable in case of leader re-election
    • Heartbeat TTLs kept in Leader memory
    − Too expensive to store “last ping time” in Raft
    • Every ping would result in a quorum write
    − Leader keeps worker<->TTL in a heap (time.AfterFunc)
    − Upon leader failover workers are given a grace period to reconnect
    • Workers considered Unknown until they reconnect
    • If they do they move back to Up
    • If they don’t they move to Down

    View full-size slide

  20. Andrea Luzzardi
    [email protected] / @aluzzardi

    View full-size slide