Building Highly Available Mesos Frameworks, 2.0

Building Highly Available Mesos Frameworks, 2.0

Production-quality Mesos frameworks must be able to continue managing tasks despite unreliable networks and faulty computers. Mesos provides tools to help developers do fault-tolerant task management, but putting these tools together effectively remains something of a black art. This talk will offer practical guidance to framework developers to help them understand how Mesos deals with failures and the tools it provides to enable fault tolerant frameworks. The talk will also cover new Mesos features that allow framework developers to control how partitioned tasks should be handled. Mesos operators will also benefit from a discussion of exactly how Mesos behaves during network partitions and other failure scenarios.

08e09dfff8d762b155d4d06788ae10d9?s=128

Neil Conway

August 31, 2016
Tweet

Transcript

  1. 1.

    © 2016 Mesosphere, Inc. All Rights Reserved. 1 MesosCon EU

    2016 - Neil Conway Building Highly Available Mesos Frameworks, 2.0
  2. 2.

    © 2016 Mesosphere, Inc. All Rights Reserved. 2 1. How

    does Mesos behave when failures occur? 2. Mesos provides features to enable highly available frameworks. How can you use those features to write frameworks that tolerate failures?
  3. 3.

    © 2016 Mesosphere, Inc. All Rights Reserved. 3 High Availability

    ≈ Continuing to provide a service despite failures in one or more components
  4. 5.

    © 2016 Mesosphere, Inc. All Rights Reserved. 5 1. Network

    failures • Dropped messages • Network partitions 2. Process failures • e.g., segfault 3. System failures • e.g., power loss • Host might not restart with persistent storage Failure Model
  5. 6.

    © 2016 Mesosphere, Inc. All Rights Reserved. 6 Messages between

    Mesos components are ordered but unreliable (“at most once”). • Messages might be dropped • Messages won’t be delivered out-of-order Messaging Semantics
  6. 7.

    © 2016 Mesosphere, Inc. All Rights Reserved. 7 1. Tolerating

    Master Failures 2. Tolerating Scheduler Failures 3. Tolerating Partitioned Agents Outline
  7. 9.

    © 2016 Mesosphere, Inc. All Rights Reserved. 9 1. Run

    multiple instances of the master • Only one instance is active at any time (“leading master”) • Other instances are backups 2. Detect when current leader has failed 3. Elect a new leader 4. Restore consistency of cluster state Highly Available Mesos Masters: “Primary-Backup”
  8. 10.

    © 2016 Mesosphere, Inc. All Rights Reserved. 10 Classical problem

    in distributed computing. Requirements: 1. At most one node decides it is the leader 2. Everyone agrees on who the leader is Leader Election Solution: Apache ZooKeeper
  9. 12.

    © 2016 Mesosphere, Inc. All Rights Reserved. 12 1. Typically,

    run one ZooKeeper instance on each Mesos master node. 2. Run the mesos-master process using a process supervisor • Leading master aborts (“fail-fast”) if it detects it is no longer the leader ZooKeeper Recommendations
  10. 13.

    © 2016 Mesosphere, Inc. All Rights Reserved. 13 Key Design

    Decision: The Mesos masters share very little replicated state. Most cluster state (e.g., active tasks and frameworks) is only stored in memory at the leading master. Consequences: 1. Easy to scale the master to large clusters. 2. Source-of-truth for task state is the agent, not the master. 3. After failover, new leading master knows almost nothing about the state of the cluster. Mesos Master State
  11. 14.

    © 2016 Mesosphere, Inc. All Rights Reserved. 14 On leadership

    change: 1. Agents reregister with new leading master • “I am agent X, running tasks A, B, and C.” 2. Frameworks reregister with new leading master • “I am framework Y.” Recovering Master State After Failover Handled by the SchedulerDriver automatically. Callbacks: • disconnected() • reregistered() Consequence: Brief period after failover before state of new master has quiesced.
  12. 15.

    © 2016 Mesosphere, Inc. All Rights Reserved. 15 Problem: Framework

    tries to launch task; master fails. How can framework tell if the new master knows about the task launch? Consistency of Framework State Variant: Framework tries to launch task but doesn’t receive any status updates. Was the “launch task” message dropped or is the task merely slow to launch? Task State Reconciliation Ask the master for the current state of the task.
  13. 16.

    © 2016 Mesosphere, Inc. All Rights Reserved. 16 Example of

    Master Failover Scheduler Master X Master Y Agent Z Launch task 1 on Agent Z. Reconcile: task 1 on Agent Z. Status update: task 1 is in state TASK_LOST. Launch task 2 on Agent Z. Master X Master Y disconnected() reregistered() [Reregister Framework F] Reconcile: task 2 on Agent Z. No response! Reregister Agent Z Reconcile: task 2 on Agent Z. Status update: task 2 is in state TASK_LOST. Launch task 3 on Agent Z. Launch task 3. Task 3 Status update: task 3 is in state TASK_RUNNING. Status update: task 3 is in state TASK_RUNNING. Master Y elected as new leading master NB: This relies on message ordering guarantee. Tip: Avoid using duplicate task IDs.
  14. 17.

    © 2016 Mesosphere, Inc. All Rights Reserved. 17 Ask master

    for current task state a. Specific task (“explicit”) b. All tasks known to master (“implicit”) Responses (if any) returned via statusUpdate() with REASON_RECONCILIATION set. Task State Reconciliation Notable Behavior: • Unknown task ➝ TASK_LOST • Master has not quiesced after failover ➝ no response Master failed over, agent has not reregistered, and agent_reregister_timeout has not yet passed.
  15. 18.

    © 2016 Mesosphere, Inc. All Rights Reserved. 18 • Reconcile

    all tasks periodically (e.g., 15 mins) ◦ Required to detect lost updates ◦ Also helps to detect bugs • Wait-and-retry if no information returned ◦ Use exponential backoff • Reconcile more promptly if you suspect missing information ◦ E.g., task launch but no subsequent status update Reconciliation Recommendations
  16. 20.

    © 2016 Mesosphere, Inc. All Rights Reserved. 20 Continue scheduling

    tasks, despite: 1. Crashes of the scheduler process 2. Failures of machines that are running a scheduler 3. Network partitions between a scheduler and the master Goals
  17. 21.

    © 2016 Mesosphere, Inc. All Rights Reserved. 21 1. Run

    multiple instances of your scheduler • “Leader” and “backup” pattern is typical 2. Detect when current leader has failed 3. Elect new leader • Upon election, register with Mesos master 4. Restore consistency of framework state after failover Highly Available Mesos Schedulers
  18. 22.

    © 2016 Mesosphere, Inc. All Rights Reserved. 22 • Similar

    to leader election among Mesos masters • Schedulers often use the same solution: ZooKeeper • Etcd, consul, etc. would also work Scheduler Leader Election
  19. 23.

    © 2016 Mesosphere, Inc. All Rights Reserved. 23 After new

    leading scheduler has been elected, it should reregister with the Mesos master. How do we associate the new scheduler instance with the previous scheduler’s tasks? • Scheduler must use same FrameworkID on registration • This requires distributed state! • Any other scheduler with the same ID will be disconnected Scheduler Reregistration Gotcha What happens to a scheduler’s tasks when it disconnects from the master? • They live for failover_timeout • Then they are killed • Default failover_timeout is 0! Recommendation: Use a generous failover_timeout (e.g., 1 week) in production.
  20. 24.

    © 2016 Mesosphere, Inc. All Rights Reserved. 24 Problem: When

    a new leading scheduler is elected, how do we ensure the new leader has a consistent view of cluster state? Consistency of Framework State Solutions: 1. Use strongly consistent distributed DB (e.g., Zk) 2. Probe cluster to find current state
  21. 25.

    © 2016 Mesosphere, Inc. All Rights Reserved. 25 On Framework

    State Change: 1. Write change to distributed DB • Ensure the write is durable! • e.g., wait for quorum commit 2. Take an action that depends on updated state “Write-Ahead Logging” Durable Framework State On failover, the new leading scheduler will know about a superset of the possible in-progress actions. Use reconciliation to determine status of in-progress operations.
  22. 26.

    © 2016 Mesosphere, Inc. All Rights Reserved. 26 Example of

    Framework Failover Master Leader Elector (Zk) State Store (Zk) Scheduler Instance A Scheduler Instance B Try to become the leader Scheduler A is elected leader Scheduler Instance A Register framework Registered with framework ID 123 Store framework ID 123 Framework Client Create job X Store job X Committed Job X created Resource offer New task 456 on agent Y for job X Committed Launch task 456 on agent Y Scheduler Instance A Scheduler Instance B Scheduler B is elected leader Fetch current state: jobs, tasks, framework ID. Reconcile task 456 on agent Y Register framework with ID 123 Registered
  23. 27.

    © 2016 Mesosphere, Inc. All Rights Reserved. 27 • Don’t

    have “backup” scheduler instances running all the time ◦ Instead, use Marathon to launch scheduler instances, detect when they have failed, and launch replacements • Similar to the approach described before ◦ Still achieve HA via redundancy ◦ Still at most one leading scheduler at any given time • Tradeoffs: ◦ Simpler to write ◦ Depend on Marathon behavior for determining when to launch replacement schedulers, how to handle network partitions, etc. • Recommended: still use Zk for leader election and to ensure mutual exclusion on updates to scheduler state Alternative: Use Marathon
  24. 29.

    © 2016 Mesosphere, Inc. All Rights Reserved. 29 Agents can

    be partitioned away from the leading master. When this happens, 1. What does the master do? 2. What should frameworks do?
  25. 30.

    © 2016 Mesosphere, Inc. All Rights Reserved. 30 1. Master

    pings agents to determine if they are reachable (“health checks”) • Default: 5 pings, 15 secs each = 75 secs 2. Frameworks are told that master has lost contact with agent • TASK_LOST sent for all tasks on agent • slaveLost() callback invoked Master Behavior for Partitioned Agents Tradeoff: speed of failure detection vs. cost of recovery What happens if/when the agent reconnects to the cluster?
  26. 31.

    © 2016 Mesosphere, Inc. All Rights Reserved. 31 Design Goal:

    When a partitioned agent reconnects to the cluster, it is shutdown by the master. • All tasks on the agent are terminated Problems: • Partitioned agents will not be shutdown if master fails over while agent is partitioned (bug!) • Single policy for handling partitioned tasks • Frameworks might want custom policies -- e.g., wait longer for stateful tasks Partitioned Agents in Mesos 1.0 “LOST” tasks might go back to RUNNING!
  27. 32.

    © 2016 Mesosphere, Inc. All Rights Reserved. 32 Design Goal:

    Allow frameworks to decide how partitioned tasks should be handled. Implementation: • Opt-in via new framework capability: PARTITION_AWARE • Partitioned agents will be allowed to reregister • Handling partitioned tasks is up to the framework • Time when task was partitioned: unreachable_time in TaskStatus Partitioned Agents in Mesos 1.1
  28. 33.

    © 2016 Mesosphere, Inc. All Rights Reserved. 33 • When

    a task is partitioned (TASK_LOST): ◦ Decide if/when you want to replace it • After timeout, launch replacement • TASK_LOST tasks may come back ◦ Typically: kill one of the copies ◦ Rate-limit killing of tasks! • TASK_LOST does not mean task is dead ◦ Consider Zk if you need “≤ 1 instance” Handling Partitioned Tasks Cost of replacement vs. cost of unavailability
  29. 34.

    © 2016 Mesosphere, Inc. All Rights Reserved. 34 Also enabled

    for PARTITION_AWARE frameworks: 1. Operators/scripts can tell Mesos that a partitioned agent is permanently gone 2. Frameworks can learn when a partitioned task is gone “forever” 3. Fine-grained task states, replacing TASK_LOST: • TASK_UNREACHABLE • TASK_DROPPED • TASK_GONE • TASK_GONE_BY_OPERATOR Other Changes in Mesos 1.1
  30. 35.

    © 2016 Mesosphere, Inc. All Rights Reserved. 35 1. Primary-backup

    is a simple approach to HA 2. Mesos master is HA but stores little state 3. Mesos provides features to enable HA frameworks, but leaves the details up to you 4. Mesos 1.1 will allow frameworks to define how to handle partitioned tasks Conclusions