Choosing a Hadoop Workflow Engine

Choosing a Hadoop Workﬂow Engine Andrew Johnson Stripe @theajsquared 1

Background • Software engineer on Data Platform at Stripe •
Worked with Hadoop for 5 years • Run several workﬂow engines in production 2

What is a workﬂow engine? 3

Tool for managing data pipelines 4

What is a workﬂow engine? • Schedules jobs in data
pipelines • Ensures jobs are ordered by dependencies • Tracks individual jobs and overall pipeline state • Retry failed and downstream jobs 5

Why use a workﬂow engine? 6

Starting out - cron • Common to start out with
cron • midnight - ingest data • 1am - apply a transformation • 2am - load output into a database • Very brittle! 7

Advanced cron and beyond • Add data dependencies • No
other features - tracking, reruns, etc. 8

Workﬂow engine landscape 9

Workflow engine landscape • Lots of available workflow engines •
If it runs arbitrary commands, it probably works • Common choices in the Hadoop ecosystem • Oozie • Azkaban • Luigi • Airflow 10

Oozie 11

Architecture 12 Database Oozie Server Oozie Client Hadoop Cluster

Azkaban 13

Architecture 14 Database Executor Server Web Server Azkaban Client Hadoop
Cluster

Luigi 15

Architecture 16 Luigi Workers Scheduler Client HDFS Hadoop Check task
state Submit jobs

Airﬂow 17

Architecture 18 Airﬂow Workers Scheduler Database Hadoop Webserver

Choosing a workﬂow engine 19

Don’t Use One • Additional infrastructure = more operational overhead
• Think carefully about future needs 20

Considerations • How are workflows defined? • What job types
are supported? How extensible is this? • How do you track the state of a workflow? • How are failures handled? • How stable is the engine itself? 21

How are workﬂows deﬁned? 22

Configuration-based • Markup language for definition • Oozie - XML
• Azkaban - Properties file 23

Examples • Oozie example is too long! • Azkaban: •
# foo.job  type=command  command=echo "Hello World" 24

Code-based • Scripting language for deﬁnition • Luigi - Python
• Airﬂow - Python 25

Luigi Example class Foo(luigi.WrapperTask): task_namespace = 'examples' def run(self): print("Running
Foo") def requires(self): for i in range(10): yield Bar(i) class Bar(luigi.Task): task_namespace = 'examples' num = luigi.IntParameter() def run(self): time.sleep(1) self.output().open('w').close() def output(self): return luigi.LocalTarget('/tmp/bar/%d' % self.num) 26

Airﬂow Example dag = DAG('tutorial', default_args=default_args) t1 = BashOperator(task_id=‘print_date', bash_command=‘date',
dag=dag) t2 = BashOperator(task_id=‘sleep', bash_command='sleep 5’, retries=3, dag=dag) t1.set_upstream(t2) 27

Configuration vs Code • Configuration • Autogenerate definitions • Statically
verify workflows • Code • More compact • Supports dynamic pipelines 28

What kind of jobs are supported? 29

Hadoop ecosystem support • Strong point for Oozie • Azkaban
also good • Luigi, Airﬂow have less out of the box 30

Custom job support • Code-based engines shine here • Oozie,
Azkaban - job type plugins 31

How do you track the state of a workﬂow? 32

User Interface • Visualize workﬂow DAG and state • All
roughly equivalent • Oozie is dated • Airﬂow has strong design 33

API Integrations • Expose same info as UI • Alternative
UI • Oozie has several • Notiﬁcation bots (e.g. Slack) 34

How are failures handled? 35

Identifying Failures • See failed tasks in the UI •
Automatic notiﬁcations • Not provided by Oozie • Custom notiﬁcations through API 36

Rerunning jobs • Configurable retries • Oozie and Azkaban: CLI/UI
action • Luigi and Airflow: task idempotency • Task state in filesystem/database • Clear state to run again 37

How stable is the engine itself? 38

Stability • Oozie is pretty good • Airﬂow is newer,
some rough edges • Previous experience is valuable 39

Community Support • Oozie - Apache, Hadoop distros • Azkaban/Luigi
- Github/mailing list • Airﬂow - Apache incubating 40

Wrapping up 41

Questions? 42

Choosing a Hadoop Workflow Engine

Choosing a Hadoop Workflow Engine

More Decks by Andrew Johnson

Other Decks in Technology

Featured

Transcript