Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Choosing a Hadoop Workflow Engine

Choosing a Hadoop Workflow Engine

Once beyond the very early stages of a Hadoop deployment, the flow of data through the system is complex. Data is coming into the cluster from multiple different sources and being processed through many analytics and data science pipelines. In the beginning these can be managed via cron and shell scripts, but this is not sufficiently robust and does not scale to larger teams. To handle complex data pipelines at scale, a workflow engine is necessary. A workflow engine allows the user to define and configure their data pipelines and then handles scheduling these pipelines. They also have mechanisms for monitoring the progress of the pipelines and for recovering from failure. Once the point where a workflow engine is needed is reached, how does one choose one from the many available?

This talk will cover the major workflow engines for Hadoop: Oozie, Airflow, Luigi, and Azkaban. It will cover the key features of each workflow engine and the major differences between them. Use cases where each engine does particularly well will be highlighted. Attendees will leave this talk with a better understanding of the landscape of workflow engines in the Hadoop ecosystem and guidance on how to select a workflow engine for their needs.

39012684c3c0643d7f43119044fd6468?s=128

Andrew Johnson

April 06, 2017
Tweet

Transcript

  1. Choosing a Hadoop Workflow Engine Andrew Johnson Stripe @theajsquared 1

  2. Background • Software engineer on Data Platform at Stripe •

    Worked with Hadoop for 5 years • Run several workflow engines in production 2
  3. What is a workflow engine? 3

  4. Tool for managing data pipelines 4

  5. What is a workflow engine? • Schedules jobs in data

    pipelines • Ensures jobs are ordered by dependencies • Tracks individual jobs and overall pipeline state • Retry failed and downstream jobs 5
  6. Why use a workflow engine? 6

  7. Starting out - cron • Common to start out with

    cron • midnight - ingest data • 1am - apply a transformation • 2am - load output into a database • Very brittle! 7
  8. Advanced cron and beyond • Add data dependencies • No

    other features - tracking, reruns, etc. 8
  9. Workflow engine landscape 9

  10. Workflow engine landscape • Lots of available workflow engines •

    If it runs arbitrary commands, it probably works • Common choices in the Hadoop ecosystem • Oozie • Azkaban • Luigi • Airflow 10
  11. Oozie 11

  12. Architecture 12 Database Oozie Server Oozie Client Hadoop Cluster

  13. Azkaban 13

  14. Architecture 14 Database Executor Server Web Server Azkaban Client Hadoop

    Cluster
  15. Luigi 15

  16. Architecture 16 Luigi Workers Scheduler Client HDFS Hadoop Check task

    state Submit jobs
  17. Airflow 17

  18. Architecture 18 Airflow Workers Scheduler Database Hadoop Webserver

  19. Choosing a workflow engine 19

  20. Don’t Use One • Additional infrastructure = more operational overhead

    • Think carefully about future needs 20
  21. Considerations • How are workflows defined? • What job types

    are supported? How extensible is this? • How do you track the state of a workflow? • How are failures handled? • How stable is the engine itself? 21
  22. How are workflows defined? 22

  23. Configuration-based • Markup language for definition • Oozie - XML

    • Azkaban - Properties file 23
  24. Examples • Oozie example is too long! • Azkaban: •

    # foo.job
 type=command
 command=echo "Hello World" 24
  25. Code-based • Scripting language for definition • Luigi - Python

    • Airflow - Python 25
  26. Luigi Example class Foo(luigi.WrapperTask): task_namespace = 'examples' def run(self): print("Running

    Foo") def requires(self): for i in range(10): yield Bar(i) class Bar(luigi.Task): task_namespace = 'examples' num = luigi.IntParameter() def run(self): time.sleep(1) self.output().open('w').close() def output(self): return luigi.LocalTarget('/tmp/bar/%d' % self.num) 26
  27. Airflow Example dag = DAG('tutorial', default_args=default_args) t1 = BashOperator(task_id=‘print_date', bash_command=‘date',

    dag=dag) t2 = BashOperator(task_id=‘sleep', bash_command='sleep 5’, retries=3, dag=dag) t1.set_upstream(t2) 27
  28. Configuration vs Code • Configuration • Autogenerate definitions • Statically

    verify workflows • Code • More compact • Supports dynamic pipelines 28
  29. What kind of jobs are supported? 29

  30. Hadoop ecosystem support • Strong point for Oozie • Azkaban

    also good • Luigi, Airflow have less out of the box 30
  31. Custom job support • Code-based engines shine here • Oozie,

    Azkaban - job type plugins 31
  32. How do you track the state of a workflow? 32

  33. User Interface • Visualize workflow DAG and state • All

    roughly equivalent • Oozie is dated • Airflow has strong design 33
  34. API Integrations • Expose same info as UI • Alternative

    UI • Oozie has several • Notification bots (e.g. Slack) 34
  35. How are failures handled? 35

  36. Identifying Failures • See failed tasks in the UI •

    Automatic notifications • Not provided by Oozie • Custom notifications through API 36
  37. Rerunning jobs • Configurable retries • Oozie and Azkaban: CLI/UI

    action • Luigi and Airflow: task idempotency • Task state in filesystem/database • Clear state to run again 37
  38. How stable is the engine itself? 38

  39. Stability • Oozie is pretty good • Airflow is newer,

    some rough edges • Previous experience is valuable 39
  40. Community Support • Oozie - Apache, Hadoop distros • Azkaban/Luigi

    - Github/mailing list • Airflow - Apache incubating 40
  41. Wrapping up 41

  42. Questions? 42