Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Choosing a Hadoop Workflow Engine

Choosing a Hadoop Workflow Engine

Once beyond the very early stages of a Hadoop deployment, the flow of data through the system is complex. Data is coming into the cluster from multiple different sources and being processed through many analytics and data science pipelines. In the beginning these can be managed via cron and shell scripts, but this is not sufficiently robust and does not scale to larger teams. To handle complex data pipelines at scale, a workflow engine is necessary. A workflow engine allows the user to define and configure their data pipelines and then handles scheduling these pipelines. They also have mechanisms for monitoring the progress of the pipelines and for recovering from failure. Once the point where a workflow engine is needed is reached, how does one choose one from the many available?

This talk will cover the major workflow engines for Hadoop: Oozie, Airflow, Luigi, and Azkaban. It will cover the key features of each workflow engine and the major differences between them. Use cases where each engine does particularly well will be highlighted. Attendees will leave this talk with a better understanding of the landscape of workflow engines in the Hadoop ecosystem and guidance on how to select a workflow engine for their needs.

Andrew Johnson

April 06, 2017
Tweet

More Decks by Andrew Johnson

Other Decks in Technology

Transcript

  1. Background • Software engineer on Data Platform at Stripe •

    Worked with Hadoop for 5 years • Run several workflow engines in production 2
  2. What is a workflow engine? • Schedules jobs in data

    pipelines • Ensures jobs are ordered by dependencies • Tracks individual jobs and overall pipeline state • Retry failed and downstream jobs 5
  3. Starting out - cron • Common to start out with

    cron • midnight - ingest data • 1am - apply a transformation • 2am - load output into a database • Very brittle! 7
  4. Advanced cron and beyond • Add data dependencies • No

    other features - tracking, reruns, etc. 8
  5. Workflow engine landscape • Lots of available workflow engines •

    If it runs arbitrary commands, it probably works • Common choices in the Hadoop ecosystem • Oozie • Azkaban • Luigi • Airflow 10
  6. Considerations • How are workflows defined? • What job types

    are supported? How extensible is this? • How do you track the state of a workflow? • How are failures handled? • How stable is the engine itself? 21
  7. Examples • Oozie example is too long! • Azkaban: •

    # foo.job
 type=command
 command=echo "Hello World" 24
  8. Luigi Example class Foo(luigi.WrapperTask): task_namespace = 'examples' def run(self): print("Running

    Foo") def requires(self): for i in range(10): yield Bar(i) class Bar(luigi.Task): task_namespace = 'examples' num = luigi.IntParameter() def run(self): time.sleep(1) self.output().open('w').close() def output(self): return luigi.LocalTarget('/tmp/bar/%d' % self.num) 26
  9. Airflow Example dag = DAG('tutorial', default_args=default_args) t1 = BashOperator(task_id=‘print_date', bash_command=‘date',

    dag=dag) t2 = BashOperator(task_id=‘sleep', bash_command='sleep 5’, retries=3, dag=dag) t1.set_upstream(t2) 27
  10. Configuration vs Code • Configuration • Autogenerate definitions • Statically

    verify workflows • Code • More compact • Supports dynamic pipelines 28
  11. Hadoop ecosystem support • Strong point for Oozie • Azkaban

    also good • Luigi, Airflow have less out of the box 30
  12. User Interface • Visualize workflow DAG and state • All

    roughly equivalent • Oozie is dated • Airflow has strong design 33
  13. API Integrations • Expose same info as UI • Alternative

    UI • Oozie has several • Notification bots (e.g. Slack) 34
  14. Identifying Failures • See failed tasks in the UI •

    Automatic notifications • Not provided by Oozie • Custom notifications through API 36
  15. Rerunning jobs • Configurable retries • Oozie and Azkaban: CLI/UI

    action • Luigi and Airflow: task idempotency • Task state in filesystem/database • Clear state to run again 37
  16. Stability • Oozie is pretty good • Airflow is newer,

    some rough edges • Previous experience is valuable 39
  17. Community Support • Oozie - Apache, Hadoop distros • Azkaban/Luigi

    - Github/mailing list • Airflow - Apache incubating 40