$30 off During Our Annual Pro Sale. View Details »

Building Highly Available Mesos Frameworks, 2.0

Building Highly Available Mesos Frameworks, 2.0

Production-quality Mesos frameworks must be able to continue managing tasks despite unreliable networks and faulty computers. Mesos provides tools to help developers do fault-tolerant task management, but putting these tools together effectively remains something of a black art. This talk will offer practical guidance to framework developers to help them understand how Mesos deals with failures and the tools it provides to enable fault tolerant frameworks. The talk will also cover new Mesos features that allow framework developers to control how partitioned tasks should be handled. Mesos operators will also benefit from a discussion of exactly how Mesos behaves during network partitions and other failure scenarios.

Neil Conway

August 31, 2016
Tweet

More Decks by Neil Conway

Other Decks in Programming

Transcript

  1. © 2016 Mesosphere, Inc. All Rights Reserved. 1
    MesosCon EU 2016 - Neil Conway
    Building Highly
    Available Mesos
    Frameworks, 2.0

    View Slide

  2. © 2016 Mesosphere, Inc. All Rights Reserved. 2
    1. How does Mesos behave when failures occur?
    2. Mesos provides features to enable highly
    available frameworks. How can you use those
    features to write frameworks that tolerate
    failures?

    View Slide

  3. © 2016 Mesosphere, Inc. All Rights Reserved. 3
    High Availability

    Continuing to provide a service
    despite failures in one or more
    components

    View Slide

  4. © 2016 Mesosphere, Inc. All Rights Reserved. 4
    Components

    View Slide

  5. © 2016 Mesosphere, Inc. All Rights Reserved. 5
    1. Network failures
    ● Dropped messages
    ● Network partitions
    2. Process failures
    ● e.g., segfault
    3. System failures
    ● e.g., power loss
    ● Host might not restart with
    persistent storage
    Failure Model

    View Slide

  6. © 2016 Mesosphere, Inc. All Rights Reserved. 6
    Messages between Mesos
    components are ordered but
    unreliable (“at most once”).
    ● Messages might be
    dropped
    ● Messages won’t be
    delivered out-of-order
    Messaging Semantics

    View Slide

  7. © 2016 Mesosphere, Inc. All Rights Reserved. 7
    1. Tolerating
    Master Failures
    2. Tolerating
    Scheduler Failures
    3. Tolerating
    Partitioned Agents
    Outline

    View Slide

  8. © 2016 Mesosphere, Inc. All Rights Reserved. 8
    Tolerating
    Master
    Failures

    View Slide

  9. © 2016 Mesosphere, Inc. All Rights Reserved. 9
    1. Run multiple instances of the master
    ● Only one instance is active at any time
    (“leading master”)
    ● Other instances are backups
    2. Detect when current leader has failed
    3. Elect a new leader
    4. Restore consistency of cluster state
    Highly Available Mesos Masters: “Primary-Backup”

    View Slide

  10. © 2016 Mesosphere, Inc. All Rights Reserved. 10
    Classical problem in
    distributed computing.
    Requirements:
    1. At most one node
    decides it is the leader
    2. Everyone agrees on who
    the leader is
    Leader Election
    Solution: Apache ZooKeeper

    View Slide

  11. © 2016 Mesosphere, Inc. All Rights Reserved. 11
    Revised
    Cluster
    Architecture

    View Slide

  12. © 2016 Mesosphere, Inc. All Rights Reserved. 12
    1. Typically, run one ZooKeeper instance
    on each Mesos master node.
    2. Run the mesos-master process using
    a process supervisor
    ● Leading master aborts (“fail-fast”) if
    it detects it is no longer the leader
    ZooKeeper Recommendations

    View Slide

  13. © 2016 Mesosphere, Inc. All Rights Reserved. 13
    Key Design Decision:
    The Mesos masters share
    very little replicated state.
    Most cluster state (e.g., active
    tasks and frameworks) is only
    stored in memory at the
    leading master.
    Consequences:
    1. Easy to scale the master to large
    clusters.
    2. Source-of-truth for task state is
    the agent, not the master.
    3. After failover, new leading
    master knows almost nothing
    about the state of the cluster.
    Mesos Master State

    View Slide

  14. © 2016 Mesosphere, Inc. All Rights Reserved. 14
    On leadership change:
    1. Agents reregister with new leading master
    ● “I am agent X, running tasks A, B, and C.”
    2. Frameworks reregister with new leading master
    ● “I am framework Y.”
    Recovering Master State After Failover
    Handled by the
    SchedulerDriver
    automatically.
    Callbacks:
    ● disconnected()
    ● reregistered()
    Consequence: Brief period after failover before
    state of new master has quiesced.

    View Slide

  15. © 2016 Mesosphere, Inc. All Rights Reserved. 15
    Problem:
    Framework tries to launch task; master
    fails. How can framework tell if the new
    master knows about the task launch?
    Consistency of Framework State
    Variant:
    Framework tries to launch task but
    doesn’t receive any status updates.
    Was the “launch task” message
    dropped or is the task merely slow to
    launch?
    Task State Reconciliation
    Ask the master for the
    current state of the task.

    View Slide

  16. © 2016 Mesosphere, Inc. All Rights Reserved. 16
    Example of Master Failover
    Scheduler
    Master X
    Master Y
    Agent Z
    Launch task 1
    on Agent Z.
    Reconcile: task
    1 on Agent Z.
    Status update:
    task 1 is in state
    TASK_LOST.
    Launch task 2
    on Agent Z.
    Master X
    Master Y
    disconnected()
    reregistered()
    [Reregister
    Framework F]
    Reconcile: task
    2 on Agent Z.
    No response!
    Reregister
    Agent Z
    Reconcile: task
    2 on Agent Z.
    Status update:
    task 2 is in state
    TASK_LOST.
    Launch task 3
    on Agent Z.
    Launch task 3.
    Task 3
    Status update:
    task 3 is in state
    TASK_RUNNING.
    Status update:
    task 3 is in state
    TASK_RUNNING.
    Master Y elected as
    new leading master
    NB: This relies on
    message ordering
    guarantee.
    Tip: Avoid using
    duplicate task IDs.

    View Slide

  17. © 2016 Mesosphere, Inc. All Rights Reserved. 17
    Ask master for current task state
    a. Specific task (“explicit”)
    b. All tasks known to master
    (“implicit”)
    Responses (if any) returned via
    statusUpdate() with
    REASON_RECONCILIATION set.
    Task State Reconciliation
    Notable Behavior:
    ● Unknown task ➝ TASK_LOST
    ● Master has not quiesced after
    failover ➝ no response
    Master failed over,
    agent has not reregistered, and
    agent_reregister_timeout
    has not yet passed.

    View Slide

  18. © 2016 Mesosphere, Inc. All Rights Reserved. 18
    ● Reconcile all tasks periodically (e.g., 15 mins)
    ○ Required to detect lost updates
    ○ Also helps to detect bugs
    ● Wait-and-retry if no information returned
    ○ Use exponential backoff
    ● Reconcile more promptly if you suspect missing
    information
    ○ E.g., task launch but no subsequent status update
    Reconciliation Recommendations

    View Slide

  19. © 2016 Mesosphere, Inc. All Rights Reserved. 19
    Tolerating
    Scheduler
    Failures

    View Slide

  20. © 2016 Mesosphere, Inc. All Rights Reserved. 20
    Continue scheduling tasks, despite:
    1. Crashes of the scheduler process
    2. Failures of machines that are
    running a scheduler
    3. Network partitions between a
    scheduler and the master
    Goals

    View Slide

  21. © 2016 Mesosphere, Inc. All Rights Reserved. 21
    1. Run multiple instances of your scheduler
    ● “Leader” and “backup” pattern is typical
    2. Detect when current leader has failed
    3. Elect new leader
    ● Upon election, register with Mesos master
    4. Restore consistency of framework state after
    failover
    Highly Available Mesos Schedulers

    View Slide

  22. © 2016 Mesosphere, Inc. All Rights Reserved. 22
    ● Similar to leader election
    among Mesos masters
    ● Schedulers often use the
    same solution: ZooKeeper
    ● Etcd, consul, etc. would
    also work
    Scheduler Leader Election

    View Slide

  23. © 2016 Mesosphere, Inc. All Rights Reserved. 23
    After new leading scheduler has been elected,
    it should reregister with the Mesos master.
    How do we associate the new scheduler
    instance with the previous scheduler’s tasks?
    ● Scheduler must use same
    FrameworkID on registration
    ● This requires distributed state!
    ● Any other scheduler with the same ID
    will be disconnected
    Scheduler Reregistration
    Gotcha
    What happens to a scheduler’s tasks
    when it disconnects from the master?
    ● They live for failover_timeout
    ● Then they are killed
    ● Default failover_timeout is 0!
    Recommendation: Use a generous
    failover_timeout (e.g., 1 week) in
    production.

    View Slide

  24. © 2016 Mesosphere, Inc. All Rights Reserved. 24
    Problem:
    When a new leading
    scheduler is elected, how
    do we ensure the new
    leader has a consistent
    view of cluster state?
    Consistency of Framework State
    Solutions:
    1. Use strongly consistent
    distributed DB (e.g., Zk)
    2. Probe cluster to find
    current state

    View Slide

  25. © 2016 Mesosphere, Inc. All Rights Reserved. 25
    On Framework State Change:
    1. Write change to distributed DB
    ● Ensure the write is durable!
    ● e.g., wait for quorum commit
    2. Take an action that depends on
    updated state
    “Write-Ahead Logging”
    Durable Framework State
    On failover, the new leading
    scheduler will know about a
    superset of the possible
    in-progress actions.
    Use reconciliation to
    determine status of
    in-progress operations.

    View Slide

  26. © 2016 Mesosphere, Inc. All Rights Reserved. 26
    Example of Framework Failover
    Master
    Leader
    Elector
    (Zk)
    State
    Store (Zk)
    Scheduler
    Instance A
    Scheduler
    Instance B
    Try to become the leader
    Scheduler A is
    elected leader
    Scheduler
    Instance A
    Register
    framework
    Registered with
    framework ID 123
    Store framework ID 123
    Framework
    Client
    Create job X
    Store job X
    Committed
    Job X created
    Resource offer
    New task 456 on
    agent Y for job X
    Committed
    Launch task
    456 on agent Y
    Scheduler
    Instance A
    Scheduler
    Instance B
    Scheduler B is
    elected leader
    Fetch current
    state: jobs, tasks,
    framework ID.
    Reconcile task
    456 on agent Y
    Register framework
    with ID 123
    Registered

    View Slide

  27. © 2016 Mesosphere, Inc. All Rights Reserved. 27
    ● Don’t have “backup” scheduler instances running all the time
    ○ Instead, use Marathon to launch scheduler instances, detect when they
    have failed, and launch replacements
    ● Similar to the approach described before
    ○ Still achieve HA via redundancy
    ○ Still at most one leading scheduler at any given time
    ● Tradeoffs:
    ○ Simpler to write
    ○ Depend on Marathon behavior for determining when to launch
    replacement schedulers, how to handle network partitions, etc.
    ● Recommended: still use Zk for leader election and to ensure mutual exclusion
    on updates to scheduler state
    Alternative: Use Marathon

    View Slide

  28. © 2016 Mesosphere, Inc. All Rights Reserved. 28
    Tolerating
    Partitioned
    Agents

    View Slide

  29. © 2016 Mesosphere, Inc. All Rights Reserved. 29
    Agents can be partitioned away
    from the leading master.
    When this happens,
    1. What does the master do?
    2. What should frameworks do?

    View Slide

  30. © 2016 Mesosphere, Inc. All Rights Reserved. 30
    1. Master pings agents to determine if they are
    reachable (“health checks”)
    ● Default: 5 pings, 15 secs each = 75 secs
    2. Frameworks are told that master has lost
    contact with agent
    ● TASK_LOST sent for all tasks on agent
    ● slaveLost() callback invoked
    Master Behavior for Partitioned Agents
    Tradeoff: speed of
    failure detection
    vs. cost of recovery
    What happens if/when the agent reconnects
    to the cluster?

    View Slide

  31. © 2016 Mesosphere, Inc. All Rights Reserved. 31
    Design Goal:
    When a partitioned agent reconnects to the cluster, it is
    shutdown by the master.
    ● All tasks on the agent are terminated
    Problems:
    ● Partitioned agents will not be shutdown if master fails over
    while agent is partitioned (bug!)
    ● Single policy for handling partitioned tasks
    ● Frameworks might want custom policies -- e.g., wait
    longer for stateful tasks
    Partitioned Agents in Mesos 1.0
    “LOST” tasks
    might go back
    to RUNNING!

    View Slide

  32. © 2016 Mesosphere, Inc. All Rights Reserved. 32
    Design Goal:
    Allow frameworks to decide how partitioned tasks
    should be handled.
    Implementation:
    ● Opt-in via new framework capability:
    PARTITION_AWARE
    ● Partitioned agents will be allowed to reregister
    ● Handling partitioned tasks is up to the framework
    ● Time when task was partitioned:
    unreachable_time in TaskStatus
    Partitioned Agents in Mesos 1.1

    View Slide

  33. © 2016 Mesosphere, Inc. All Rights Reserved. 33
    ● When a task is partitioned (TASK_LOST):
    ○ Decide if/when you want to replace it
    ● After timeout, launch replacement
    ● TASK_LOST tasks may come back
    ○ Typically: kill one of the copies
    ○ Rate-limit killing of tasks!
    ● TASK_LOST does not mean task is dead
    ○ Consider Zk if you need “≤ 1 instance”
    Handling Partitioned Tasks
    Cost of replacement vs.
    cost of unavailability

    View Slide

  34. © 2016 Mesosphere, Inc. All Rights Reserved. 34
    Also enabled for PARTITION_AWARE frameworks:
    1. Operators/scripts can tell Mesos that a
    partitioned agent is permanently gone
    2. Frameworks can learn when a partitioned task
    is gone “forever”
    3. Fine-grained task states, replacing TASK_LOST:
    ● TASK_UNREACHABLE
    ● TASK_DROPPED
    ● TASK_GONE
    ● TASK_GONE_BY_OPERATOR
    Other Changes in Mesos 1.1

    View Slide

  35. © 2016 Mesosphere, Inc. All Rights Reserved. 35
    1. Primary-backup is a simple approach to HA
    2. Mesos master is HA but stores little state
    3. Mesos provides features to enable HA frameworks, but
    leaves the details up to you
    4. Mesos 1.1 will allow frameworks to define how to handle
    partitioned tasks
    Conclusions

    View Slide

  36. © 2016 Mesosphere, Inc. All Rights Reserved.
    THANK YOU!
    36
    [email protected]
    @neil_conway

    View Slide