Slide 1

Slide 1 text

HOW TO RUN FROM A ZOMBIE: CLOUDSTACK DISTRIBUTED PROCESS MANAGEMENT John Burwell ([email protected] | [email protected] @john_burwell) Tuesday, June 25, 13

Slide 2

Slide 2 text

I Am Not A Zombie • Apache CloudStack PMC Member • Consulting Engineer @ Basho Technologies • Ran operations and designed automated provisioning for hybrid analytic/virtualization clouds • Led architectural design and server-side development of a SaaS physical security platform Tuesday, June 25, 13

Slide 3

Slide 3 text

Current Process Management • No consistent system-wide model • Fail slowly, fail quietly • Resource overcommitment issues • Lack of instrumentation Tuesday, June 25, 13

Slide 4

Slide 4 text

What is a cloud? Tuesday, June 25, 13

Slide 5

Slide 5 text

Tuesday, June 25, 13

Slide 6

Slide 6 text

Hopefully not ... Tuesday, June 25, 13

Slide 7

Slide 7 text

Tuesday, June 25, 13

Slide 8

Slide 8 text

Tuesday, June 25, 13

Slide 9

Slide 9 text

Tuesday, June 25, 13

Slide 10

Slide 10 text

Hosts Virtual Routers Virtual Machines Primary Storage Networks Secondary Storage Load

Slide 11

Slide 11 text

Resource Process State A

Slide 12

Slide 12 text

At it’s core, CloudStack ... Integrates infrastructure components Manages resources Tuesday, June 25, 13

Slide 13

Slide 13 text

Tuesday, June 25, 13

Slide 14

Slide 14 text

Consistency Availability Partition

Slide 15

Slide 15 text

CloudStack provides zones, clusters, and pods to partition resources. Tuesday, June 25, 13

Slide 16

Slide 16 text

Orchestration operations are eventually consistent Tuesday, June 25, 13

Slide 17

Slide 17 text

Tuesday, June 25, 13

Slide 18

Slide 18 text

... but resource operations must be consistent & serialized. Tuesday, June 25, 13

Slide 19

Slide 19 text

Tuesday, June 25, 13

Slide 20

Slide 20 text

A system can not be simultaneously consistent and available. Tuesday, June 25, 13

Slide 21

Slide 21 text

Orchestration

Slide 22

Slide 22 text

CP Resource? • Ordered/Serialized operations • Prevent overcommitment • Execution location independent • Lock free Tuesday, June 25, 13

Slide 23

Slide 23 text

Orchestration Coordination 1. Build a list of commands to be executed against a resource 2. Enqueue the list of commands to the resource management layer for execution 3. A process applies the commands to the resource 4. Aggregate the results from the reply Tuesday, June 25, 13

Slide 24

Slide 24 text

Resource Process State Queue 1 1 Unit

Slide 25

Slide 25 text

Unit Of Work (UoW) • Definition: A ordered list of commands executed against a one and only one resource. • Created in the Orchestration layer • Executed by processes in the resource management layer • Failure of a command halts UoW execution Tuesday, June 25, 13

Slide 26

Slide 26 text

Instrumentation • Collect and report statistics on a per resource basis • Inspect and remove pending UoWs for a resource • Kill a running process • View a history of UoWs completed by a resource Tuesday, June 25, 13

Slide 27

Slide 27 text

• Process execution fails • Resources become unavailable • Slow consumers When Gravity Fails Tuesday, June 25, 13

Slide 28

Slide 28 text

Fail Fast; Fail Loudly • If the resource can be returned to a consistent state, reply with the process failure • If the resource can not be returned to a consistent state, change the transition the resource to a failure state, drain the queue of pending UoWs, and reply with the process failure for each UoW • The orchestration layer will determine the appropriate recovery strategy (e.g. retry request on another resource) Tuesday, June 25, 13

Slide 29

Slide 29 text

Preventing A Logjam • Bounded Queues • Request and Message Timeouts • A failure to enqueue a request or a request timeout trigger a the resource’s circuit breaker Tuesday, June 25, 13

Slide 30

Slide 30 text

How could we implement this model? Tuesday, June 25, 13

Slide 31

Slide 31 text

Lightweight Threads A thread that is not scheduled by the operating system -- avoiding context switch overhead. Tuesday, June 25, 13

Slide 32

Slide 32 text

Actor Model • An actor represents state and behavior • Communicate by message passing • Each actor is allocated a lightweight thread and mailbox • Location independent Tuesday, June 25, 13

Slide 33

Slide 33 text

Mailbox Resource Actor FSM Orchestration Unit

Slide 34

Slide 34 text

Java Actor Frameworks • Akka (http://akka.io) • Quasar (https://github.com/puniverse/quasar) Tuesday, June 25, 13

Slide 35

Slide 35 text

Summary • Orchestration and Resource Management must be properly divided to satisfy CAP • To provide resource serialization guarantees, assign a queue and a process to each resource • Fast fast, fail loudly • An Actor Model based on lightweight threads may provide the scalability required to dedicate a queue and process per resource Tuesday, June 25, 13

Slide 36

Slide 36 text

Thoughts? Questions? Tuesday, June 25, 13

Slide 37

Slide 37 text

Thank you! Slides available @ http://speakerdeck.com/jburwell Tuesday, June 25, 13