Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to Amazon SWF and simpleflow

Intro to Amazon SWF and simpleflow

This presentation provides an overview of Amazon Simple Workflow (aka SWF) and some details about the python high-level library simpleflow developed by Botify. If you're interested in simpleflow or Amazon SWF, don't hesitate to drop a message to me or an issue in our Github tracker!

Jean-Baptiste Barth

December 02, 2015
Tweet

More Decks by Jean-Baptiste Barth

Other Decks in Programming

Transcript

  1. What is Amazon SWF ? • Amazon Simple Workflow (==

    SWF) “makes it easy to build applications that coordinate work across distributed components” (doc) • the doc is really good, check it out ! • it provides a centralized coordination service for executing a distributed workflow by organizing the work of stateless deciders and activity workers • SWF centralizes: ◦ administrative informations: you declare the name of your workflow types and activity types, (with default timeouts, etc.) ◦ the workflow history for each workflow execution (took this decision, made this activity, etc.) • it also provides a web console to manage things (not very good :-/)
  2. An example: build a pirate business* find or steal a

    lot of money ? build a boat * untested, don’t try this at home steal a boat find a crew ? find a parrot ? Domain: caribbean-sea Workflow Type: build-pirate-business Activity Types: raise-money, build-boat, steal-boat, find-crew, find-parrot
  3. Deciders and activity workers roles • a decider (captain): ◦

    needs to know how to take decisions depending on the workflow type, and the workflow execution history ◦ he listens on a task list (shared between multiple workflow executions potentially) ◦ he doesn’t need to know how to perform tasks ◦ he doesn’t communicate directly with activity workers ◦ ideally, he’s stateless ; state is stored on SWF • each activity worker (boat maker, crew, parrot vendor, …): ◦ needs to know how to perform their tasks depending on the activity type and an input ◦ he listens on a task list (only routing mechanism) ◦ he doesn’t communicate directly with the decider ◦ he’s stateless ; input comes in, result comes out (maybe indirectly, ex: S3 backend)
  4. Workflow Execution lifecycle (1) : start REQUEST (captain, acting as

    workflow starter): StartWorkflowExecution { “domain”: “carribean-sea”, “workflowId”: “black-pearl-1”, “workflowType”: “build-pirate-business”, “startToCloseTimeout”: “3 months”, “taskList”: “captain-task-list”, “input”: “536$” } RESPONSE: { “runId”: “22DbtQ+dcw2CZ4ECxTFdBboI5A=” } WORKFLOW EXECUTION HISTORY (just after): [ { “workflowExecutionStarted”: { “workflowType”: “build-pirate-business”, “input”: “536$”, … } }, { “decisionTaskScheduled”: { “taskList”: “captain-task-list”, .... } } ]
  5. Workflow Execution lifecycle (2) : poll for decision REQUEST (captain,

    acting as decider): PollForDecisionTask { “domain”: “caribbean-sea”, “taskList”: “captain-task-list”, “identity”: “captain” } RESPONSE: { “previousStartedEventId”: 0, “taskToken”: “2bb41a56fcec”, “workflowType”: “build-pirate-business”, “events”: [ { “workflowExecutionStarted”: … }, { “decisionTaskScheduled”: … } ] } WORKFLOW EXECUTION HISTORY (just after): [ { “workflowExecutionStarted”: ... }, { “decisionTaskScheduled”: … }, { “decisionTaskStarted”: { “identity”: “captain”, ... } } ]
  6. Workflow Execution lifecycle (3) : take decision REQUEST (captain, acting

    as decider): RespondDecisionTaskCompleted { “taskToken”: “2bb41a56fcec”, “decisions”: [ “scheduleActivityTask”: { “activityId”: “build-the-black-pearl-1”, “activityType”: “build-boat”, “input”: “guns: 74, length: 58”, “taskList”: “boat-maker”, “startToCloseTimeout”: “60 days”, .... } ] } RESPONSE: <empty> WORKFLOW EXECUTION HISTORY (just after): [ { “workflowExecutionStarted”: ... }, { “decisionTaskScheduled”: … }, { “decisionTaskStarted”: … }, { “decisionTaskCompleted”: { “identity”: “captain”, ... } }, { “activityTaskScheduled”: { “taskList”: “boat-maker”, “input”: … } ]
  7. Workflow Execution lifecycle (4) : start activity REQUEST (Mr Woodburry,

    acting as activity worker): PollForActivityTask { “domain”: “caribbean-sea”, “taskList”: “boat-maker”, “identity”: “john woodsburry” } RESPONSE: { “activityId”: “build-the-black-pearl-1”, “activityType”: “build-boat”, ““input”: “guns: 74, length: 58”, “taskToken”: “5ab64ec437d9” } WORKFLOW EXECUTION HISTORY (just after): [ { “workflowExecutionStarted”: ... }, { “decisionTaskScheduled”: … }, { “decisionTaskStarted”: … }, { “decisionTaskCompleted”: … }, { “activityTaskScheduled”: … }, { ”activityTaskStarted”: … } ]
  8. Workflow Execution lifecycle (5) : complete activity REQUEST (Mr Woodburry,

    acting as activity worker): RespondActivityTaskCompleted { “taskToken”: “5ab64ec437d9”, “result”: “finished! parked at Kingston” } RESPONSE: <empty> WORKFLOW EXECUTION HISTORY (just after): [ { “workflowExecutionStarted”: ... }, { “decisionTaskScheduled”: … }, { “decisionTaskStarted”: … }, { “decisionTaskCompleted”: … }, { “activityTaskScheduled”: … }, { ”activityTaskStarted”: … }, { ”activityTaskCompleted”: … }, { ”decisionTaskScheduled”: … } ]
  9. Workflow Execution lifecycle (6) : complete workflow REQUEST (long after

    : captain, acting as decider): RespondDecisionTaskCompleted { “taskToken”: “5ab6dzdbe349”, “decisions”: [ “completeWorkflowExecution”: { … } ] } RESPONSE: <empty> WORKFLOW EXECUTION HISTORY (just after): [ { “workflowExecutionStarted”: ... }, { “decisionTaskScheduled”: … }, { “decisionTaskStarted”: … }, { “decisionTaskCompleted”: … }, { “activityTaskScheduled”: … }, { ”activityTaskStarted”: … }, { ”activityTaskCompleted”: … }, … { “decisionTaskCompleted”: … }, { ”workflowExecutionCompleted”: … } ]
  10. A few remarks about this workflow • long polling everywhere

    • timeouts and heartbeating are optional (but some highly recommended) • the “domain”, “workflow type” and “activity types” should be defined in advance (hint: simpleflow simplifies this) • the decider decides what happens (for instance a failed activity doesn’t automatically fails the workflow) • there’s no retry mechanism (hint: provided by simpleflow) • input & results are totally opaque to SWF (with simpleflow: json ; with (Java|Ruby)Flow: depends on Data Converters) • beware it’s possible to “block” a workflow by taking no decision (could be useful but we usually avoid this)
  11. SWF guarantees : • only one workflow execution for a

    given “workflow id” • history is consistent (?) and only progresses across “replays” • a decision task is given to only one decider • an activity task is given to only one activity worker • only one decision task scheduled/started at a time for a given workflow execution (so no conflicting/redundant decision)
  12. SWF advanced concepts • Versionning (but task lists are still

    the only routing mechanism) • Tags (for finding workflow executions based on key=value) • ChildWorkflow (for grouping / isolating specific workflows) • Continuation Workflow (for permanent/long tasks) • Task Priorities (escapes the default FIFO) • Lambda tasks (== the activity worker is AWS Lambda) • Markers (ex: insert progress, retry/loop count, or consolidated history, ...) • Signals (ex: pause/resume, change input/conditions, …)
  13. Python libraries (1): boto, simple-workflow • boto.swf: ◦ Layer1: maps

    with SWF endpoints ◦ Layer1.decisions: helpers for easing decisions generation ◦ Layer2: object-oriented access above Layer1 (never used myself) • simple-workflow : ◦ above boto.swf.Layer1 (doesn’t use Layer1.decisions, Layer2) ◦ class-based abstractions for SWF objects (models, django-like querysets) ◦ basic actors for modelling a decider, worker, heartbeater ◦ object-oriented history/events ◦ object-oriented decisions making
  14. Python libraries (2): simpleflow • from simpleflow README: ◦ Future

    abstraction to define dependencies between tasks ◦ Define asynchronous tasks from callables ◦ Handle workflows with Amazon SWF ◦ Implement replay behavior like the Amazon Flow framework ◦ Handle retry of tasks that failed ◦ Automatically register decorated tasks ◦ Handle the completion of a decision with more than 100 tasks ◦ Provides a local executor to check a workflow without Amazon SWF (simpleflow --local) ◦ Provides decider and activity worker process for execution with Amazon SWF ◦ Ships with the simpleflow command
  15. Python libraries (3): simpleflow* • I add: ◦ Smart task

    naming for workflow with variable geometry (idempotent tasks) ◦ Various limits protections (like max open activities == 1000) ; needs some work though ◦ (soon!) smart routing via task list override ◦ (soon!) full support for child workflows (already something, not sure if works...) ◦ (soon!) tasks chaining and grouping ◦ (soon!) flexible task definition (callable OR class OR class instance) ◦ (soon!) smart data converters like Amazon Flow (json, S3, …) ◦ (soon!) soft limits for number of running activity tasks (< # workers) ◦ and maybe: EC2 helpers? checkpoints? metrology? execution history comparisons? • the goal is to make simpleflow complete-enough so you only have to write your business logic above it • other goals = simplify usage, better tests
  16. Simpleflow: decider part • it knows how to “replay” a

    workflow, described by a subclass of Workflow that implements a “run” method ; this is the job of the “Executor” • it reads the history and maps all activities to Futures (an object holding state, result, errors) ◦ for idempotent tasks, we hash arguments for defining ActivityId (foo.bar-8afc74207d711cdc) ◦ for non idempotent tasks, we can’t ; we use a registry (unfortunate name!!) for remembering the current index for a given task { ”foo.bar”: 1, “foo.baz”: 52 } => next “foo.baz” task to be scheduled will get the name “foo.bar-53” => a stable order accross replays is CRITICAL in this case (ok?) • during the replay, Futures will raise “ExecutionBlocked” exception so the replay stops and the calling process schedules the task if possible
  17. Simpleflow: worker part • has to know how to map

    an ActivityTask “input” to a python function + args • => “dispatchers” for finding a python callable ◦ ModuleDispatcher: foo.bar.baz will “from foo.bar import baz” then run “baz()” ◦ RegistryDispatcher: allows having different registries for different workflows (?) ; originally split by task list, but changed (now “None”) ; not sure if still interesting… ◦ => WARNING: simpleflow “executor” also has a “TaskRegistry”, but it has nothing to do with that (bad naming…) • then the callable is called with args/kwargs encoded in the json input • it knows PollForActivityTask, RespondActivityTask(Completed|Failed) • it also knows how to heartbeat (RecordActivityTaskHeartbeat)
  18. SWF Limits & shortcomings • various limits, not always well

    defined or easy to follow • no object removal (annoying!) • no easy progress mechanism • some error cases and edge cases are not documented ◦ ex: validations, what happens if you submit 10 decisions and the last one is invalid, etc. • python / boto / simpleflow ◦ boto doesn’t check types (but moto.swf does) ◦ boto implementation is very basic and not a 1st-class client (like Java Flow or Ruby Flow) ◦ too many layers: boto.swf.Layer1 => (Layer2|simple-workflow) => simpleflow => <your code> ◦ namespace/name collisions between simple-workflow & simpleflow ◦ (too) many concepts to grok ◦ processes replacement / cancellations not well supported ◦ low test coverage (=> hard to refactor / introduce new things ; dead code) => we need to improve that!