Slide 1

Slide 1 text

Choosing a Hadoop Workflow Engine Andrew Johnson Stripe @theajsquared 1

Slide 2

Slide 2 text

Background • Software engineer on Data Platform at Stripe • Worked with Hadoop for 5 years • Run several workflow engines in production 2

Slide 3

Slide 3 text

What is a workflow engine? 3

Slide 4

Slide 4 text

Tool for managing data pipelines 4

Slide 5

Slide 5 text

What is a workflow engine? • Schedules jobs in data pipelines • Ensures jobs are ordered by dependencies • Tracks individual jobs and overall pipeline state • Retry failed and downstream jobs 5

Slide 6

Slide 6 text

Why use a workflow engine? 6

Slide 7

Slide 7 text

Starting out - cron • Common to start out with cron • midnight - ingest data • 1am - apply a transformation • 2am - load output into a database • Very brittle! 7

Slide 8

Slide 8 text

Advanced cron and beyond • Add data dependencies • No other features - tracking, reruns, etc. 8

Slide 9

Slide 9 text

Workflow engine landscape 9

Slide 10

Slide 10 text

Workflow engine landscape • Lots of available workflow engines • If it runs arbitrary commands, it probably works • Common choices in the Hadoop ecosystem • Oozie • Azkaban • Luigi • Airflow 10

Slide 11

Slide 11 text

Oozie 11

Slide 12

Slide 12 text

Architecture 12 Database Oozie Server Oozie Client Hadoop Cluster

Slide 13

Slide 13 text

Azkaban 13

Slide 14

Slide 14 text

Architecture 14 Database Executor Server Web Server Azkaban Client Hadoop Cluster

Slide 15

Slide 15 text

Luigi 15

Slide 16

Slide 16 text

Architecture 16 Luigi Workers Scheduler Client HDFS Hadoop Check task state Submit jobs

Slide 17

Slide 17 text

Airflow 17

Slide 18

Slide 18 text

Architecture 18 Airflow Workers Scheduler Database Hadoop Webserver

Slide 19

Slide 19 text

Choosing a workflow engine 19

Slide 20

Slide 20 text

Don’t Use One • Additional infrastructure = more operational overhead • Think carefully about future needs 20

Slide 21

Slide 21 text

Considerations • How are workflows defined? • What job types are supported? How extensible is this? • How do you track the state of a workflow? • How are failures handled? • How stable is the engine itself? 21

Slide 22

Slide 22 text

How are workflows defined? 22

Slide 23

Slide 23 text

Configuration-based • Markup language for definition • Oozie - XML • Azkaban - Properties file 23

Slide 24

Slide 24 text

Examples • Oozie example is too long! • Azkaban: • # foo.job
 type=command
 command=echo "Hello World" 24

Slide 25

Slide 25 text

Code-based • Scripting language for definition • Luigi - Python • Airflow - Python 25

Slide 26

Slide 26 text

Luigi Example class Foo(luigi.WrapperTask): task_namespace = 'examples' def run(self): print("Running Foo") def requires(self): for i in range(10): yield Bar(i) class Bar(luigi.Task): task_namespace = 'examples' num = luigi.IntParameter() def run(self): time.sleep(1) self.output().open('w').close() def output(self): return luigi.LocalTarget('/tmp/bar/%d' % self.num) 26

Slide 27

Slide 27 text

Airflow Example dag = DAG('tutorial', default_args=default_args) t1 = BashOperator(task_id=‘print_date', bash_command=‘date', dag=dag) t2 = BashOperator(task_id=‘sleep', bash_command='sleep 5’, retries=3, dag=dag) t1.set_upstream(t2) 27

Slide 28

Slide 28 text

Configuration vs Code • Configuration • Autogenerate definitions • Statically verify workflows • Code • More compact • Supports dynamic pipelines 28

Slide 29

Slide 29 text

What kind of jobs are supported? 29

Slide 30

Slide 30 text

Hadoop ecosystem support • Strong point for Oozie • Azkaban also good • Luigi, Airflow have less out of the box 30

Slide 31

Slide 31 text

Custom job support • Code-based engines shine here • Oozie, Azkaban - job type plugins 31

Slide 32

Slide 32 text

How do you track the state of a workflow? 32

Slide 33

Slide 33 text

User Interface • Visualize workflow DAG and state • All roughly equivalent • Oozie is dated • Airflow has strong design 33

Slide 34

Slide 34 text

API Integrations • Expose same info as UI • Alternative UI • Oozie has several • Notification bots (e.g. Slack) 34

Slide 35

Slide 35 text

How are failures handled? 35

Slide 36

Slide 36 text

Identifying Failures • See failed tasks in the UI • Automatic notifications • Not provided by Oozie • Custom notifications through API 36

Slide 37

Slide 37 text

Rerunning jobs • Configurable retries • Oozie and Azkaban: CLI/UI action • Luigi and Airflow: task idempotency • Task state in filesystem/database • Clear state to run again 37

Slide 38

Slide 38 text

How stable is the engine itself? 38

Slide 39

Slide 39 text

Stability • Oozie is pretty good • Airflow is newer, some rough edges • Previous experience is valuable 39

Slide 40

Slide 40 text

Community Support • Oozie - Apache, Hadoop distros • Azkaban/Luigi - Github/mailing list • Airflow - Apache incubating 40

Slide 41

Slide 41 text

Wrapping up 41

Slide 42

Slide 42 text

Questions? 42