Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Orchestrating Linux Containers While Tolerating Failures

Drew Erny
October 06, 2016

Orchestrating Linux Containers While Tolerating Failures

A high-level overview of the concepts of failure tolerance when orchestrating containers with Docker's Swarmkit

Drew Erny

October 06, 2016
Tweet

Other Decks in Programming

Transcript

  1. 2 About Me Drew Erny B.S. from University of Alabama,

    May 2016 Working @ Docker since May on Swarmkit @dperny on literally every social network ever
  2. 3 Idea of this Talk Introduce at a high level

    the concepts that Docker uses to tolerate failures.
  3. 4 Contents of this talk • What is failure? •

    How do we narrow this problem? • How can a orchestration manage failures? • How does orchestration work?
  4. 22 MangoDS Ruby on Rectangles Ruby on Rectangles MangoDS Hardware

    IDK Maybe Redis Hardware Hardware Hardware TriangleJS TriangleJS MangoDS Availability Zone Availability Zone
  5. 26 Clustering MangoDS Ruby on Rectangles Ruby on Rectangles MangoDS

    Cluster MangoDS TriangleDB TriangleDB IDK Maybe Redis
  6. 27 Clustering Lets us treat many discrete units as one

    big virtual computer. Gives us a layer of abstraction that can handle failures for us.
  7. 30 Desired State Reconciliation Declare what you WANT your application

    state to be and let the cluster do the heavy lifting. If a failure occurs, the cluster will compensate
  8. 31 What does this look like? Node Docker Ruby on

    Rectangles Node Docker Application Node Docker Desired State: 2 Instances of Rectangles 2 Instances of MangoDS MangoDS Ruby on Rectangles MangoDS
  9. 32 What does this look like? Node Docker Ruby on

    Rectangles Node Docker Ruby on Rectangles Node Docker Desired State: 2 Instances of Rectangles 2 Instances of MangoDS Now Ruby on Rectangles crashes. MangoDS MangoDS
  10. 33 What does this look like? Node Docker Node Docker

    Ruby on Rectangles Node Docker Desired State: 2 Instances of Rectangles 2 Instances of MangoDS Now Ruby on Rectangles crashes. MangoDS MangoDS
  11. 34 What does this look like? Node Docker Node Docker

    Ruby on Rectangles Node Docker Desired State: 2 Instances of Rectangles 2 Instances of MangoDS And a new one is spawned MangoDS MangoDS Ruby on Rectangles
  12. 35 What does this look like? Node Docker Node Docker

    Ruby on Rectangles Node Docker Desired State: 2 Instances of Rectangles 2 Instances of MangoDS And a new one is spawned MangoDS MangoDS Ruby on Rectangles
  13. 36 What does this look like? Node Docker Node Docker

    Ruby on Rectangles Node Docker Desired State: 2 Instances of Rectangles 2 Instances of MangoDS Now a node failure occurs MangoDS MangoDS Ruby on Rectangles
  14. 37 What does this look like? Node Docker Node Docker

    Ruby on Rectangles Node Docker Desired State: 2 Instances of Rectangles 2 Instances of MangoDS Two containers down MangoDS
  15. 38 What does this look like? Node Docker Ruby on

    Rectangles Node Docker Ruby on Rectangles Node Docker Desired State: 2 Instances of Rectangles 2 Instances of MangoDS Schedule 2 new ones MangoDS MangoDS
  16. 39 What does this look like? Node Docker Ruby on

    Rectangles Node Docker Ruby on Rectangles Node Docker Desired State: 2 Instances of Rectangles 2 Instances of MangoDS Problem solved! MangoDS MangoDS
  17. 40 What does this look like? Node Docker Ruby on

    Rectangles Node Docker Ruby on Rectangles Node Docker Desired State: 2 Instances of Rectangles 2 Instances of MangoDS Node comes back up Nothing changes!
 Already have desired state! MangoDS MangoDS
  18. 43 Some Vocabulary Node - an individual unit of available

    computing resources. One node is generally one Docker Engine Task - an individual atomic scheduling unit, belonging to a service. One task is generally one container. Service - Individual unit of desired state. Defines what application and how many replicas.
  19. 46

  20. 47 Managers Make the decisions about what, where, and how

    a service will run Watch the workers for failures and adjust accordingly
  21. 48

  22. 62 The Raft Algorithm One manager is elected Leader, all

    others Followers Leader is the ultimate endpoint for all requests Leader informs all Followers about log changes, waits for acknowledgement from a quorum (more than half) of Followers before committing Followers proxy requests to the leader (all are valid endpoints) If Leader dies or goes missing from the quorum, a new one is elected
  23. 63 The Raft Algorithm We can guarantee correctness of state

    replication with Raft. Raft lets us proceed as long as greater than half of nodes are available.
  24. 65 How many managers? Managers Quorum (strictly greater than half)

    Failures Tolerated 1 1 > .5 0 2 2 > 1 0 3 2 > 1.5 1 4 3 > 2 1 5 3 > 2.5 2 6 4 > 3 2 7 4 > 3.5 3 n CEILING(n/2) FLOOR(n/2)
  25. 66 Workers Connect to the Docker Engine Actually spawn the

    containers Report back container status Do not participate in the decision-making process Route requests internally among themselves
  26. 67 Networking Tasks need to find each other Tasks need

    to know when other tasks fail External requests need to be routed correctly
  27. 79 Gossip Protocol Workers share information about what nodes have

    what services Every node must maintain a record of the services on every other node Done outside of Raft Eventually consistent, not guaranteed consistent
  28. 81 Recap • Failures can occur at many different levels

    • If we focus on container failures, we narrow the problem. • Orchestration does the heavy lifting for us. • Swarmkit uses lots of tricks to make orchestration itself failure-tolerant.