Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Run from a Zombie: CloudStack Distribute...

How to Run from a Zombie: CloudStack Distributed Process Management

Exploration of CloudStack's distributed process management requirements and the challenges they present in the context of CAP theorem. These challenges will be addressed through a distributed process model that emphasizes efficiency, fault tolerance, and operational transparency.

John Burwell

June 24, 2013
Tweet

More Decks by John Burwell

Other Decks in Technology

Transcript

  1. HOW TO RUN FROM A ZOMBIE: CLOUDSTACK DISTRIBUTED PROCESS MANAGEMENT

    John Burwell (jburwell@apache.org | jburwell@basho.com @john_burwell) Tuesday, June 25, 13
  2. I Am Not A Zombie • Apache CloudStack PMC Member

    • Consulting Engineer @ Basho Technologies • Ran operations and designed automated provisioning for hybrid analytic/virtualization clouds • Led architectural design and server-side development of a SaaS physical security platform Tuesday, June 25, 13
  3. Current Process Management • No consistent system-wide model • Fail

    slowly, fail quietly • Resource overcommitment issues • Lack of instrumentation Tuesday, June 25, 13
  4. CP Resource? • Ordered/Serialized operations • Prevent overcommitment • Execution

    location independent • Lock free Tuesday, June 25, 13
  5. Orchestration Coordination 1. Build a list of commands to be

    executed against a resource 2. Enqueue the list of commands to the resource management layer for execution 3. A process applies the commands to the resource 4. Aggregate the results from the reply Tuesday, June 25, 13
  6. Unit Of Work (UoW) • Definition: A ordered list of

    commands executed against a one and only one resource. • Created in the Orchestration layer • Executed by processes in the resource management layer • Failure of a command halts UoW execution Tuesday, June 25, 13
  7. Instrumentation • Collect and report statistics on a per resource

    basis • Inspect and remove pending UoWs for a resource • Kill a running process • View a history of UoWs completed by a resource Tuesday, June 25, 13
  8. • Process execution fails • Resources become unavailable • Slow

    consumers When Gravity Fails Tuesday, June 25, 13
  9. Fail Fast; Fail Loudly • If the resource can be

    returned to a consistent state, reply with the process failure • If the resource can not be returned to a consistent state, change the transition the resource to a failure state, drain the queue of pending UoWs, and reply with the process failure for each UoW • The orchestration layer will determine the appropriate recovery strategy (e.g. retry request on another resource) Tuesday, June 25, 13
  10. Preventing A Logjam • Bounded Queues • Request and Message

    Timeouts • A failure to enqueue a request or a request timeout trigger a the resource’s circuit breaker Tuesday, June 25, 13
  11. Lightweight Threads A thread that is not scheduled by the

    operating system -- avoiding context switch overhead. Tuesday, June 25, 13
  12. Actor Model • An actor represents state and behavior •

    Communicate by message passing • Each actor is allocated a lightweight thread and mailbox • Location independent Tuesday, June 25, 13
  13. Summary • Orchestration and Resource Management must be properly divided

    to satisfy CAP • To provide resource serialization guarantees, assign a queue and a process to each resource • Fast fast, fail loudly • An Actor Model based on lightweight threads may provide the scalability required to dedicate a queue and process per resource Tuesday, June 25, 13