Slide 1

Slide 1 text

© 2016 Mesosphere, Inc. All Rights Reserved. 1 MesosCon 2016 - Neil Conway Building Highly Available Mesos Frameworks

Slide 2

Slide 2 text

© 2016 Mesosphere, Inc. All Rights Reserved. 2 1. How does Mesos behave when failures occur? 2. Mesos provides features to enable highly available frameworks. How can you use those features to write frameworks that tolerate failures?

Slide 3

Slide 3 text

© 2016 Mesosphere, Inc. All Rights Reserved. 3 High Availability ≈ Continuing to provide a service despite failures in one or more components

Slide 4

Slide 4 text

© 2016 Mesosphere, Inc. All Rights Reserved. 4 Components

Slide 5

Slide 5 text

© 2016 Mesosphere, Inc. All Rights Reserved. 5 1. Network failures ● Dropped messages ● Network partitions 2. Process failures ● e.g., segfault 3. System failures ● e.g., power loss ● Host might not restart with persistent storage Failure Model

Slide 6

Slide 6 text

© 2016 Mesosphere, Inc. All Rights Reserved. 6 Messages between Mesos components are ordered but unreliable (“at most once”). ● Messages might be dropped ● Messages won’t be delivered out-of-order Messaging Semantics

Slide 7

Slide 7 text

© 2016 Mesosphere, Inc. All Rights Reserved. 7 1. Tolerating Master Failures 2. Tolerating Scheduler Failures 3. Tolerating Partitioned Agents Outline

Slide 8

Slide 8 text

© 2016 Mesosphere, Inc. All Rights Reserved. 8 Tolerating Master Failures

Slide 9

Slide 9 text

© 2016 Mesosphere, Inc. All Rights Reserved. 9 1. Run multiple instances of the master ● Only one instance is active at any time (“leading master”) ● Other instances are backups 2. Detect when the current leader has failed 3. Elect a new leader 4. Restore consistency of cluster state Highly Available Mesos Masters

Slide 10

Slide 10 text

© 2016 Mesosphere, Inc. All Rights Reserved. 10 Classical problem in distributed computing. Requirements: 1. At most one node decides it is the leader 2. Everyone agrees on who the leader is Leader Election Solution: Apache ZooKeeper

Slide 11

Slide 11 text

© 2016 Mesosphere, Inc. All Rights Reserved. 11 Revised Cluster Architecture

Slide 12

Slide 12 text

© 2016 Mesosphere, Inc. All Rights Reserved. 12 1. Typically, run one ZooKeeper instance on each Mesos master node. 2. Run the mesos-master process using a process supervisor ● Leading master aborts (“fail-fast”) if it detects it is no longer the leader ZooKeeper Recommendations

Slide 13

Slide 13 text

© 2016 Mesosphere, Inc. All Rights Reserved. 13 Key Design Decision: The Mesos masters share very little replicated state. Most cluster state (e.g., active tasks and frameworks) is only stored in-memory at the leading master. Consequences: 1. Easy to scale the master to large clusters. 2. Source-of-truth for task state is the agent, not the master. 3. After failover, new leading master knows almost nothing about the state of the cluster. Restoring Consistency of Cluster State

Slide 14

Slide 14 text

© 2016 Mesosphere, Inc. All Rights Reserved. 14 On leadership change: 1. Agents reregister with new leading master ● “I am agent X, running tasks A, B, and C.” 2. Frameworks reregister with new leading master ● “I am framework Y.” Consequence: Brief period after failover before state of new master has quiesced. Recovering Master State After Failover Handled by the SchedulerDriver automatically. Callbacks: ● disconnected() ● reregistered()

Slide 15

Slide 15 text

© 2016 Mesosphere, Inc. All Rights Reserved. 15 Problem: Framework tries to launch task; master fails. How can framework tell if the new master knows about the task launch? Consistency of Framework State Variant: Framework tries to launch task but doesn’t receive any status updates. Was the “launch task” message dropped or is the task merely slow to launch? Task State Reconciliation Ask the master for the current state of the task.

Slide 16

Slide 16 text

© 2016 Mesosphere, Inc. All Rights Reserved. 16 Example of Master Failover Scheduler Master X Master Y Agent Z Launch task 1 on Agent Z. Reconcile: task 1 on Agent Z. Status update: task 1 is in state TASK_LOST. Launch task 2 on Agent Z. Master X Master Y disconnected() reregistered() [Reregister Framework F] Reconcile: task 2 on Agent Z. No response! Reregister Agent Z Reconcile: task 2 on Agent Z. Status update: task 2 is in state TASK_LOST. Launch task 3 on Agent Z. Launch task 3. Task 3 Status update: task 3 is in state TASK_RUNNING. Status update: task 3 is in state TASK_RUNNING. Master Y elected as new leading master NB: This relies on message ordering guarantee. Tip: Avoid using duplicate task IDs.

Slide 17

Slide 17 text

© 2016 Mesosphere, Inc. All Rights Reserved. 17 Ask master for current task state a. Specific tasks (“explicit”) b. All tasks known to master (“implicit”) Responses (if any) returned via statusUpdate() with REASON_RECONCILIATION set. Task State Reconciliation Notable Behavior: ● Unknown task ➝ TASK_LOST ● Master has not quiesced after failover ➝ no response Master failed over, agent has not reregistered and agent_reregister_timeout has not yet passed.

Slide 18

Slide 18 text

© 2016 Mesosphere, Inc. All Rights Reserved. 18 ● Reconcile all tasks periodically (e.g., 15 mins) ○ Required to detect lost updates ○ Also helps to catch bugs ● Wait-and-retry if no information returned ○ With exponential backoff ● Optimization: reconcile more promptly when you suspect missing info ○ E.g., task launch but no subsequent status update Reconciliation Recommendations

Slide 19

Slide 19 text

© 2016 Mesosphere, Inc. All Rights Reserved. 19 Tolerating Scheduler Failures

Slide 20

Slide 20 text

© 2016 Mesosphere, Inc. All Rights Reserved. 20 Continue scheduling tasks, despite: 1. Crashes of the scheduler process 2. Failures of machines that are running a scheduler 3. Network partitions between a scheduler and the master Goals

Slide 21

Slide 21 text

© 2016 Mesosphere, Inc. All Rights Reserved. 21 1. Run multiple instances of your scheduler ● “Leader” and “backup” pattern is typical 2. Detect when current leader has failed 3. Elect new leader ● Upon election, register with Mesos master 4. Restore consistency of framework state after failover Highly Available Mesos Schedulers

Slide 22

Slide 22 text

© 2016 Mesosphere, Inc. All Rights Reserved. 22 ● Similar to leader election among Mesos masters ● Schedulers often use the same solution: ZooKeeper ● Etcd, consul, etc. would also work Scheduler Leader Election

Slide 23

Slide 23 text

© 2016 Mesosphere, Inc. All Rights Reserved. 23 After a new leading scheduler has been elected, it should register with the Mesos master. How do we associate the new scheduler instance with the previous scheduler’s tasks? ● Scheduler must use same FrameworkID on registration ● This requires persistent state! ● Any other scheduler with the same ID will be disconnected Scheduler Reregistration Gotcha What happens to a scheduler’s tasks when it disconnects from the master? ● They live for failover_timeout ● Then they are killed ● Default failover_timeout is 0! Recommendation: Use a generous failover_timeout (e.g., 1 week) in production.

Slide 24

Slide 24 text

© 2016 Mesosphere, Inc. All Rights Reserved. 24 Problem: When a new leading scheduler is elected, how do we ensure the new leader has a consistent view of cluster state? Consistency of Framework State Solutions: 1. Use strongly consistent distributed DB (e.g., Zk) 2. Probe the current state of the cluster

Slide 25

Slide 25 text

© 2016 Mesosphere, Inc. All Rights Reserved. 25 On Framework State Change: 1. Write to the distributed DB ● Ensure the write is durable! ● e.g., wait for quorum commit 2. Take an action that depends on the DB state “Write-Ahead Logging” Durable Framework State On failover, the new leading scheduler will know about a superset of the possible in- progress actions. Then use reconciliation to determine status of actions.

Slide 26

Slide 26 text

© 2016 Mesosphere, Inc. All Rights Reserved. 26 Example of Framework Failover Master Leader Elector (Zk) State Store (Zk) Scheduler Instance A Scheduler Instance B Try to become the leader Scheduler A is elected leader Scheduler Instance A Register framework Registered with framework ID 123 Store framework ID 123 Framework Client Create job X Store job X Committed Job X created Resource offer New task 456 on agent Y for job X Committed Launch task 456 on agent Y Scheduler Instance A Scheduler Instance B Scheduler B is elected leader Fetch current state: jobs, tasks, framework ID. Reconcile task 456 on agent Y Register framework with ID 123 Registered

Slide 27

Slide 27 text

© 2016 Mesosphere, Inc. All Rights Reserved. 27 Tolerating Partitioned Agents

Slide 28

Slide 28 text

© 2016 Mesosphere, Inc. All Rights Reserved. 28 Agents can be partitioned away from the leading master. When this happens, 1. What does the master do? 2. What should frameworks do?

Slide 29

Slide 29 text

© 2016 Mesosphere, Inc. All Rights Reserved. 29 1. Master pings agents to determine if they are reachable (“health checks”) ● Default: 5 pings, 15 secs each = 75 secs 2. Unreachable agents are marked for removal from the cluster ● TASK_LOST sent for all tasks on the agent ● slaveLost() callback 3. If agent reconnects, master will shut it down ● All tasks will be killed ● If agent is restarted, it will register with a new Agent ID Master Behavior for Partitioned Agents Tradeoff: speed of failure detection vs. cost of recovery Policy decision!

Slide 30

Slide 30 text

© 2016 Mesosphere, Inc. All Rights Reserved. 30 Problem: What if the master fails over after an agent has been marked for removal? New master won’t know agent has failed health checks, so it will be allowed to reregister. Exception: Master Failover Solution: ● Mark agent for removal in a replicated DB ● “Mesos replicated log” ● After failover: new master loads registered agents from log ● This behavior is not enabled by default (“non-strict registry”) ● LOST ➝ RUNNING is possible

Slide 31

Slide 31 text

© 2016 Mesosphere, Inc. All Rights Reserved. 31 ● Launch a replacement copy of the task ● TASK_LOST does not mean task is dead ○ Consider ZooKeeper if you need “at most 1” instance of the task ● By default: LOST tasks can be resurrected (“non-strict registry”) ○ Typical response: kill one of the copies Handling Partitioned Tasks

Slide 32

Slide 32 text

© 2016 Mesosphere, Inc. All Rights Reserved. 32 1. Allow frameworks to control behavior of partitioned tasks (MESOS-4049) 2. Clarify semantics of TASK_LOST (MESOS-5345) ● Allow frameworks to determine when a task is definitely no longer running 3. Strict-registry enabled by default (MESOS-1315) Upcoming Changes

Slide 33

Slide 33 text

© 2016 Mesosphere, Inc. All Rights Reserved. 33 1. Primary-backup is a simple approach to HA 2. Mesos master is HA but stores little state 3. Mesos provides primitives to enable HA frameworks, but leaves the details up to you 4. Current policy: partitioned tasks eventually killed ● Soon: frameworks will be able to control this! Conclusions

Slide 34

Slide 34 text

© 2016 Mesosphere, Inc. All Rights Reserved. THANK YOU! 34