How to Run from a Zombie: CloudStack Distributed Process Management

HOW TO RUN FROM A ZOMBIE: CLOUDSTACK DISTRIBUTED PROCESS MANAGEMENT
John Burwell ([email protected] | [email protected] @john_burwell) Tuesday, June 25, 13

I Am Not A Zombie • Apache CloudStack PMC Member
• Consulting Engineer @ Basho Technologies • Ran operations and designed automated provisioning for hybrid analytic/virtualization clouds • Led architectural design and server-side development of a SaaS physical security platform Tuesday, June 25, 13

Current Process Management • No consistent system-wide model • Fail
slowly, fail quietly • Resource overcommitment issues • Lack of instrumentation Tuesday, June 25, 13

What is a cloud? Tuesday, June 25, 13

Tuesday, June 25, 13

Hopefully not ... Tuesday, June 25, 13

Hosts Virtual Routers Virtual Machines Primary Storage Networks Secondary Storage
Load

Resource Process State A

At it’s core, CloudStack ... Integrates infrastructure components Manages resources

Consistency Availability Partition

CloudStack provides zones, clusters, and pods to partition resources. Tuesday,
June 25, 13

Orchestration operations are eventually consistent Tuesday, June 25, 13

... but resource operations must be consistent & serialized. Tuesday,
June 25, 13

A system can not be simultaneously consistent and available. Tuesday,
June 25, 13

Orchestration

CP Resource? • Ordered/Serialized operations • Prevent overcommitment • Execution
location independent • Lock free Tuesday, June 25, 13

Orchestration Coordination 1. Build a list of commands to be
executed against a resource 2. Enqueue the list of commands to the resource management layer for execution 3. A process applies the commands to the resource 4. Aggregate the results from the reply Tuesday, June 25, 13

Resource Process State Queue 1 1 Unit

Unit Of Work (UoW) • Deﬁnition: A ordered list of
commands executed against a one and only one resource. • Created in the Orchestration layer • Executed by processes in the resource management layer • Failure of a command halts UoW execution Tuesday, June 25, 13

Instrumentation • Collect and report statistics on a per resource
basis • Inspect and remove pending UoWs for a resource • Kill a running process • View a history of UoWs completed by a resource Tuesday, June 25, 13

• Process execution fails • Resources become unavailable • Slow
consumers When Gravity Fails Tuesday, June 25, 13

Fail Fast; Fail Loudly • If the resource can be
returned to a consistent state, reply with the process failure • If the resource can not be returned to a consistent state, change the transition the resource to a failure state, drain the queue of pending UoWs, and reply with the process failure for each UoW • The orchestration layer will determine the appropriate recovery strategy (e.g. retry request on another resource) Tuesday, June 25, 13

Preventing A Logjam • Bounded Queues • Request and Message
Timeouts • A failure to enqueue a request or a request timeout trigger a the resource’s circuit breaker Tuesday, June 25, 13

How could we implement this model? Tuesday, June 25, 13

Lightweight Threads A thread that is not scheduled by the
operating system -- avoiding context switch overhead. Tuesday, June 25, 13

Actor Model • An actor represents state and behavior •
Communicate by message passing • Each actor is allocated a lightweight thread and mailbox • Location independent Tuesday, June 25, 13

Mailbox Resource Actor FSM Orchestration Unit

Java Actor Frameworks • Akka (http://akka.io) • Quasar (https://github.com/puniverse/quasar) Tuesday,
June 25, 13

Summary • Orchestration and Resource Management must be properly divided
to satisfy CAP • To provide resource serialization guarantees, assign a queue and a process to each resource • Fast fast, fail loudly • An Actor Model based on lightweight threads may provide the scalability required to dedicate a queue and process per resource Tuesday, June 25, 13

Thoughts? Questions? Tuesday, June 25, 13

Thank you! Slides available @ http://speakerdeck.com/jburwell Tuesday, June 25, 13

How to Run from a Zombie: CloudStack Distribute...

How to Run from a Zombie: CloudStack Distributed Process Management

More Decks by John Burwell

Other Decks in Technology

Featured

Transcript