Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Run from a Zombie: CloudStack Distributed Process Management

How to Run from a Zombie: CloudStack Distributed Process Management

Exploration of CloudStack's distributed process management requirements and the challenges they present in the context of CAP theorem. These challenges will be addressed through a distributed process model that emphasizes efficiency, fault tolerance, and operational transparency.

John Burwell

June 24, 2013
Tweet

More Decks by John Burwell

Other Decks in Technology

Transcript

  1. I Am Not A Zombie • Apache CloudStack PMC Member

    • Consulting Engineer @ Basho Technologies • Ran operations and designed automated provisioning for hybrid analytic/virtualization clouds • Led architectural design and server-side development of a SaaS physical security platform Tuesday, June 25, 13
  2. Current Process Management • No consistent system-wide model • Fail

    slowly, fail quietly • Resource overcommitment issues • Lack of instrumentation Tuesday, June 25, 13
  3. CP Resource? • Ordered/Serialized operations • Prevent overcommitment • Execution

    location independent • Lock free Tuesday, June 25, 13
  4. Orchestration Coordination 1. Build a list of commands to be

    executed against a resource 2. Enqueue the list of commands to the resource management layer for execution 3. A process applies the commands to the resource 4. Aggregate the results from the reply Tuesday, June 25, 13
  5. Unit Of Work (UoW) • Definition: A ordered list of

    commands executed against a one and only one resource. • Created in the Orchestration layer • Executed by processes in the resource management layer • Failure of a command halts UoW execution Tuesday, June 25, 13
  6. Instrumentation • Collect and report statistics on a per resource

    basis • Inspect and remove pending UoWs for a resource • Kill a running process • View a history of UoWs completed by a resource Tuesday, June 25, 13
  7. • Process execution fails • Resources become unavailable • Slow

    consumers When Gravity Fails Tuesday, June 25, 13
  8. Fail Fast; Fail Loudly • If the resource can be

    returned to a consistent state, reply with the process failure • If the resource can not be returned to a consistent state, change the transition the resource to a failure state, drain the queue of pending UoWs, and reply with the process failure for each UoW • The orchestration layer will determine the appropriate recovery strategy (e.g. retry request on another resource) Tuesday, June 25, 13
  9. Preventing A Logjam • Bounded Queues • Request and Message

    Timeouts • A failure to enqueue a request or a request timeout trigger a the resource’s circuit breaker Tuesday, June 25, 13
  10. Lightweight Threads A thread that is not scheduled by the

    operating system -- avoiding context switch overhead. Tuesday, June 25, 13
  11. Actor Model • An actor represents state and behavior •

    Communicate by message passing • Each actor is allocated a lightweight thread and mailbox • Location independent Tuesday, June 25, 13
  12. Summary • Orchestration and Resource Management must be properly divided

    to satisfy CAP • To provide resource serialization guarantees, assign a queue and a process to each resource • Fast fast, fail loudly • An Actor Model based on lightweight threads may provide the scalability required to dedicate a queue and process per resource Tuesday, June 25, 13