$30 off During Our Annual Pro Sale. View Details »

Building Highly Available Mesos Frameworks (MesosCon 2016)

Building Highly Available Mesos Frameworks (MesosCon 2016)

Production-quality Mesos frameworks must be able to continue managing tasks despite unreliable networks and faulty computers. Mesos provides tools to help developers do fault-tolerant task management, but putting these tools together effectively remains something of a black art. This talk will offer practical guidance to current and prospective framework developers to help them understand how Mesos deals with failures and the tools it provides to enable fault tolerant frameworks. Mesos operators will also benefit from a discussion of exactly how Mesos behaves during network partitions and other failure scenarios.

Neil Conway

June 01, 2016
Tweet

More Decks by Neil Conway

Other Decks in Programming

Transcript

  1. © 2016 Mesosphere, Inc. All Rights Reserved. 1
    MesosCon 2016 - Neil Conway
    Building Highly
    Available Mesos
    Frameworks

    View Slide

  2. © 2016 Mesosphere, Inc. All Rights Reserved. 2
    1. How does Mesos behave when failures occur?
    2. Mesos provides features to enable highly
    available frameworks. How can you use those
    features to write frameworks that tolerate
    failures?

    View Slide

  3. © 2016 Mesosphere, Inc. All Rights Reserved. 3
    High Availability

    Continuing to provide a service
    despite failures in one or more
    components

    View Slide

  4. © 2016 Mesosphere, Inc. All Rights Reserved. 4
    Components

    View Slide

  5. © 2016 Mesosphere, Inc. All Rights Reserved. 5
    1. Network failures
    ● Dropped messages
    ● Network partitions
    2. Process failures
    ● e.g., segfault
    3. System failures
    ● e.g., power loss
    ● Host might not restart with
    persistent storage
    Failure Model

    View Slide

  6. © 2016 Mesosphere, Inc. All Rights Reserved. 6
    Messages between Mesos
    components are ordered but
    unreliable (“at most once”).
    ● Messages might be
    dropped
    ● Messages won’t be
    delivered out-of-order
    Messaging Semantics

    View Slide

  7. © 2016 Mesosphere, Inc. All Rights Reserved. 7
    1. Tolerating
    Master Failures
    2. Tolerating
    Scheduler Failures
    3. Tolerating
    Partitioned Agents
    Outline

    View Slide

  8. © 2016 Mesosphere, Inc. All Rights Reserved. 8
    Tolerating
    Master
    Failures

    View Slide

  9. © 2016 Mesosphere, Inc. All Rights Reserved. 9
    1. Run multiple instances of the master
    ● Only one instance is active at any time
    (“leading master”)
    ● Other instances are backups
    2. Detect when the current leader has failed
    3. Elect a new leader
    4. Restore consistency of cluster state
    Highly Available Mesos Masters

    View Slide

  10. © 2016 Mesosphere, Inc. All Rights Reserved. 10
    Classical problem in
    distributed computing.
    Requirements:
    1. At most one node
    decides it is the leader
    2. Everyone agrees on who
    the leader is
    Leader Election
    Solution: Apache ZooKeeper

    View Slide

  11. © 2016 Mesosphere, Inc. All Rights Reserved. 11
    Revised
    Cluster
    Architecture

    View Slide

  12. © 2016 Mesosphere, Inc. All Rights Reserved. 12
    1. Typically, run one ZooKeeper instance
    on each Mesos master node.
    2. Run the mesos-master process using
    a process supervisor
    ● Leading master aborts (“fail-fast”) if
    it detects it is no longer the leader
    ZooKeeper Recommendations

    View Slide

  13. © 2016 Mesosphere, Inc. All Rights Reserved. 13
    Key Design Decision:
    The Mesos masters share
    very little replicated state.
    Most cluster state (e.g., active
    tasks and frameworks) is only
    stored in-memory at the
    leading master.
    Consequences:
    1. Easy to scale the master to large
    clusters.
    2. Source-of-truth for task state is
    the agent, not the master.
    3. After failover, new leading
    master knows almost nothing
    about the state of the cluster.
    Restoring Consistency of Cluster State

    View Slide

  14. © 2016 Mesosphere, Inc. All Rights Reserved. 14
    On leadership change:
    1. Agents reregister with new leading master
    ● “I am agent X, running tasks A, B, and C.”
    2. Frameworks reregister with new leading master
    ● “I am framework Y.”
    Consequence: Brief period after failover before
    state of new master has quiesced.
    Recovering Master State After Failover
    Handled by the
    SchedulerDriver
    automatically.
    Callbacks:
    ● disconnected()
    ● reregistered()

    View Slide

  15. © 2016 Mesosphere, Inc. All Rights Reserved. 15
    Problem:
    Framework tries to launch task; master
    fails. How can framework tell if the new
    master knows about the task launch?
    Consistency of Framework State
    Variant:
    Framework tries to launch task but
    doesn’t receive any status updates.
    Was the “launch task” message
    dropped or is the task merely slow to
    launch?
    Task State Reconciliation
    Ask the master for the
    current state of the task.

    View Slide

  16. © 2016 Mesosphere, Inc. All Rights Reserved. 16
    Example of Master Failover
    Scheduler
    Master X
    Master Y
    Agent Z
    Launch task 1
    on Agent Z.
    Reconcile: task
    1 on Agent Z.
    Status update:
    task 1 is in state
    TASK_LOST.
    Launch task 2
    on Agent Z.
    Master X
    Master Y
    disconnected()
    reregistered()
    [Reregister
    Framework F]
    Reconcile: task
    2 on Agent Z.
    No response!
    Reregister
    Agent Z
    Reconcile: task
    2 on Agent Z.
    Status update:
    task 2 is in state
    TASK_LOST.
    Launch task 3
    on Agent Z.
    Launch task 3.
    Task 3
    Status update:
    task 3 is in state
    TASK_RUNNING.
    Status update:
    task 3 is in state
    TASK_RUNNING.
    Master Y elected as
    new leading master
    NB: This relies on
    message ordering
    guarantee.
    Tip: Avoid using
    duplicate task IDs.

    View Slide

  17. © 2016 Mesosphere, Inc. All Rights Reserved. 17
    Ask master for current task state
    a. Specific tasks (“explicit”)
    b. All tasks known to master
    (“implicit”)
    Responses (if any) returned via
    statusUpdate() with
    REASON_RECONCILIATION set.
    Task State Reconciliation
    Notable Behavior:
    ● Unknown task ➝ TASK_LOST
    ● Master has not quiesced after
    failover ➝ no response
    Master failed over,
    agent has not reregistered and
    agent_reregister_timeout
    has not yet passed.

    View Slide

  18. © 2016 Mesosphere, Inc. All Rights Reserved. 18
    ● Reconcile all tasks periodically (e.g., 15 mins)
    ○ Required to detect lost updates
    ○ Also helps to catch bugs
    ● Wait-and-retry if no information returned
    ○ With exponential backoff
    ● Optimization: reconcile more promptly when you
    suspect missing info
    ○ E.g., task launch but no subsequent status update
    Reconciliation Recommendations

    View Slide

  19. © 2016 Mesosphere, Inc. All Rights Reserved. 19
    Tolerating
    Scheduler
    Failures

    View Slide

  20. © 2016 Mesosphere, Inc. All Rights Reserved. 20
    Continue scheduling tasks, despite:
    1. Crashes of the scheduler process
    2. Failures of machines that are
    running a scheduler
    3. Network partitions between a
    scheduler and the master
    Goals

    View Slide

  21. © 2016 Mesosphere, Inc. All Rights Reserved. 21
    1. Run multiple instances of your scheduler
    ● “Leader” and “backup” pattern is typical
    2. Detect when current leader has failed
    3. Elect new leader
    ● Upon election, register with Mesos master
    4. Restore consistency of framework state after
    failover
    Highly Available Mesos Schedulers

    View Slide

  22. © 2016 Mesosphere, Inc. All Rights Reserved. 22
    ● Similar to leader election
    among Mesos masters
    ● Schedulers often use the
    same solution: ZooKeeper
    ● Etcd, consul, etc. would
    also work
    Scheduler Leader Election

    View Slide

  23. © 2016 Mesosphere, Inc. All Rights Reserved. 23
    After a new leading scheduler has been
    elected, it should register with the Mesos
    master.
    How do we associate the new scheduler
    instance with the previous scheduler’s tasks?
    ● Scheduler must use same
    FrameworkID on registration
    ● This requires persistent state!
    ● Any other scheduler with the same ID
    will be disconnected
    Scheduler Reregistration
    Gotcha
    What happens to a scheduler’s tasks
    when it disconnects from the master?
    ● They live for failover_timeout
    ● Then they are killed
    ● Default failover_timeout is 0!
    Recommendation: Use a generous
    failover_timeout (e.g., 1 week) in
    production.

    View Slide

  24. © 2016 Mesosphere, Inc. All Rights Reserved. 24
    Problem:
    When a new leading
    scheduler is elected, how
    do we ensure the new
    leader has a consistent
    view of cluster state?
    Consistency of Framework State
    Solutions:
    1. Use strongly consistent
    distributed DB (e.g., Zk)
    2. Probe the current state
    of the cluster

    View Slide

  25. © 2016 Mesosphere, Inc. All Rights Reserved. 25
    On Framework State Change:
    1. Write to the distributed DB
    ● Ensure the write is durable!
    ● e.g., wait for quorum commit
    2. Take an action that depends on
    the DB state
    “Write-Ahead Logging”
    Durable Framework State
    On failover, the new leading
    scheduler will know about a
    superset of the possible in-
    progress actions.
    Then use reconciliation to
    determine status of actions.

    View Slide

  26. © 2016 Mesosphere, Inc. All Rights Reserved. 26
    Example of Framework Failover
    Master
    Leader
    Elector
    (Zk)
    State
    Store (Zk)
    Scheduler
    Instance A
    Scheduler
    Instance B
    Try to become the leader
    Scheduler A is
    elected leader
    Scheduler
    Instance A
    Register
    framework
    Registered with
    framework ID 123
    Store framework ID 123
    Framework
    Client
    Create job X
    Store job X
    Committed
    Job X created
    Resource offer
    New task 456 on
    agent Y for job X
    Committed
    Launch task
    456 on agent Y
    Scheduler
    Instance A
    Scheduler
    Instance B
    Scheduler B is
    elected leader
    Fetch current
    state: jobs, tasks,
    framework ID.
    Reconcile task
    456 on agent Y
    Register framework
    with ID 123
    Registered

    View Slide

  27. © 2016 Mesosphere, Inc. All Rights Reserved. 27
    Tolerating
    Partitioned
    Agents

    View Slide

  28. © 2016 Mesosphere, Inc. All Rights Reserved. 28
    Agents can be partitioned away
    from the leading master.
    When this happens,
    1. What does the master do?
    2. What should frameworks do?

    View Slide

  29. © 2016 Mesosphere, Inc. All Rights Reserved. 29
    1. Master pings agents to determine if they are
    reachable (“health checks”)
    ● Default: 5 pings, 15 secs each = 75 secs
    2. Unreachable agents are marked for removal
    from the cluster
    ● TASK_LOST sent for all tasks on the agent
    ● slaveLost() callback
    3. If agent reconnects, master will shut it down
    ● All tasks will be killed
    ● If agent is restarted, it will register with a
    new Agent ID
    Master Behavior for Partitioned Agents
    Tradeoff: speed of
    failure detection
    vs. cost of recovery
    Policy
    decision!

    View Slide

  30. © 2016 Mesosphere, Inc. All Rights Reserved. 30
    Problem:
    What if the master fails
    over after an agent has
    been marked for removal?
    New master won’t know
    agent has failed health
    checks, so it will be allowed
    to reregister.
    Exception: Master Failover
    Solution:
    ● Mark agent for removal in a
    replicated DB
    ● “Mesos replicated log”
    ● After failover: new master loads
    registered agents from log
    ● This behavior is not enabled by
    default (“non-strict registry”)
    ● LOST ➝ RUNNING is possible

    View Slide

  31. © 2016 Mesosphere, Inc. All Rights Reserved. 31
    ● Launch a replacement copy of the task
    ● TASK_LOST does not mean task is dead
    ○ Consider ZooKeeper if you need “at
    most 1” instance of the task
    ● By default: LOST tasks can be resurrected
    (“non-strict registry”)
    ○ Typical response: kill one of the copies
    Handling Partitioned Tasks

    View Slide

  32. © 2016 Mesosphere, Inc. All Rights Reserved. 32
    1. Allow frameworks to control behavior of partitioned
    tasks (MESOS-4049)
    2. Clarify semantics of TASK_LOST (MESOS-5345)
    ● Allow frameworks to determine when a task is
    definitely no longer running
    3. Strict-registry enabled by default (MESOS-1315)
    Upcoming Changes

    View Slide

  33. © 2016 Mesosphere, Inc. All Rights Reserved. 33
    1. Primary-backup is a simple approach to HA
    2. Mesos master is HA but stores little state
    3. Mesos provides primitives to enable HA frameworks, but
    leaves the details up to you
    4. Current policy: partitioned tasks eventually killed
    ● Soon: frameworks will be able to control this!
    Conclusions

    View Slide

  34. © 2016 Mesosphere, Inc. All Rights Reserved.
    THANK YOU!
    34

    View Slide