Slide 1

Slide 1 text

What Is Apache Airflow ? ● A work flow management platform ● Uses Python based work flows ● Schedule by time or event ● Open source Apache 2.0 license ● Written in Python ● Monitor work flows in UI ● Has a wide range of integration options ● Originally developed at Airbnb

Slide 2

Slide 2 text

What Is Apache Airflow ? ● Uses SqlLite as a back end DB but can use – MySQL, Postgres, JDBC etc ● Install extra packages using pip command – Wide variety available, includes – Many databases, cloud services – Hadoop eco system – Security, web services, queues – Many more

Slide 3

Slide 3 text

Airflow Pipelines ● These are Python based work flows ● Are actually directed acyclic graphs ( DAG's ) ● Pipelines use Jinja templating ● Pipelines contain user defined tasks ● Tasks can run on different workers at different times ● Jinja scripts can be embedded in tasks ● Comments can be added in tasks in varying formats ● Inter task dependencies can be defined

Slide 4

Slide 4 text

Airflow Pipelines

Slide 5

Slide 5 text

Airflow Tasks ● Tasks have a lifecycle ● Tasks use operators to execute, depends upon type – For instance MySqlOperator ● Hooks are used to access external systems i.e. databases ● Worker specific queues can be used for tasks ● Xcom allows tasks to exchange messages ● Pipelines or DAG's allow – Branching – Sub DAG's – Service level agreements ( SLA ) – Triggering rules

Slide 6

Slide 6 text

Airflow Task Stages ● Tasks have life cycle stages

Slide 7

Slide 7 text

Airflow Task Life Cycle

Slide 8

Slide 8 text

Airflow UI ● Airflow UI provides views – DAG, Tree, Graph, Variables, Gantt Chart – Task duration, Code view ● Select a task instance in any view to manage ● Monitor and troubleshoot pipelines in views ● Monitor DAG's by owner, schedule, run time etc ● Use views to find pipeline problem areas ● Use views to find bottle necks

Slide 9

Slide 9 text

Airflow UI

Slide 10

Slide 10 text

Airflow Integration ● Airflow Integrates with – Azure: Microsoft Azure – AWS: Amazon Web Services – Databricks – GCP: Google Cloud Platform – Cloud Speech Translate Operators – Qubole ● Kubernetes – Run tasks as pods

Slide 11

Slide 11 text

Airflow Metrics ● Airflow can send metrics to StatsD – A network daemon that runs on Node.js – Listens for statistics, like counters, gauges, timers – Statistics sent over UDP or TCP ● Install metrics using pip command ● Specify which stats to record i.e. – scheduler,executor,dagrun

Slide 12

Slide 12 text

Available Books ● See “Big Data Made Easy” – Apress Jan 2015 ● See “Mastering Apache Spark” – Packt Oct 2015 ● See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” ● Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ ● Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

Slide 13

Slide 13 text

Connect ● Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at – open-source-systems.blogspot.com/ ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration