Introduction to Apache Airflow at PyData Delhi meetup #25

Introduction to Apache Airflow A workflow to track disease outbreaks
in India Vinayak Mehta, Data Engineer at SocialCops

What is Data Engineering?

Extract, Transform, Load (ETL) workflows

Scheduling Cron, anyone? An example workflow

Problems • Workflow management is a nightmare • Can’t scale
as team sizes grow • Can’t handle dependencies between tasks • “I want to debug my workflow, where are the logs?!”

A wild Apache Airflow appears! "Airflow is a platform to
programmatically author, schedule and monitor workflows." - Airflow docs • created by Maxime Beauchemin at Airbnb in 2014, open source from the very first commit • used by Airbnb, Lyft, Spotify, Quora, Reddit, Stripe, Yahoo! ...

Why Airflow?

Workflows are defined as Directed Acyclic Graphs, which means dependency
management is sorted Workflows as DAGs

Extensible • Each node/task in the DAG is an “operator”
• A lot of operators right out of the box which let you connect directly to MySQL, Postgres, Hive ... • PythonOperator, which means you can write what your task will do in the language that we all love!

Scalable • Tasks are executed by “executors” • Comes with
CeleryExecutor, which lets you horizontally scale your workflow to multiple worker nodes

Open source • Under incubation at the Apache Software Foundation
• New operators being added regularly! • Actively maintained

Architecture

The Holy Trinity • Webserver • Scheduler • Worker

A DAG!

What is this?

An orchestra!

Enough about Airflow

The data source • The Ministry of Health and Family
Welfare has the IDSP scheme in place to track disease outbreaks • Tracking done at sub-district and village level across India • Outbreak data released as a PDF document on a weekly basis

The workflow • (E) Extract the PDFs from IDSP website
• (T) Transform them into CSVs • (L) Load them into a data store

Extracting PDFs from IDSP website

Transforming PDFs into CSVs

Some minor cleaning

Tying it all together into a DAG

Where to go from here? • To the webserver! ◦
View task states ◦ View task logs • But what do we do with the code and data we have? ◦ Extend the code to extract from other sources. ◦ Use the data along with other sources to generate your own insights. ◦ *cough* predictive analytics *cough*

Where else do we use Airflow

Collect to Visualize workflows • Extract data from Collect, our
data collection app • Transform using R/Python • Load data into Visualize, our visualization platform to derive better insights out of the data

DISHA • Data from 20 different ministries breaking silos to
come together in one place • Bringing in accountability to a budget of over Rs. 2 lakh crore spent on 41 welfare schemes annually “DISHA is a crucial step towards good governance through which we will be able to monitor everything centrally. It will enable us to effectively monitor every village of the country.” – Narendra Modi, Prime Minister of India

Questions?

Thanks! SocialCops https://twitter.com/social_cops Email: [email protected] Blog: blog.socialcops.com Website: www.socialcops.com

Introduction to Apache Airflow at PyData Delhi ...

Introduction to Apache Airflow at PyData Delhi meetup #25

Vinayak Mehta

More Decks by Vinayak Mehta

Other Decks in How-to & DIY

Featured

Transcript

Introduction to Apache Airflow A workflow to track disease outbreaks

What is Data Engineering?

Extract, Transform, Load (ETL) workflows

Scheduling Cron, anyone? An example workflow

Problems • Workflow management is a nightmare • Can’t scale

A wild Apache Airflow appears! "Airflow is a platform to

Why Airflow?

Workflows are defined as Directed Acyclic Graphs, which means dependency

Extensible • Each node/task in the DAG is an “operator”

Scalable • Tasks are executed by “executors” • Comes with

Open source • Under incubation at the Apache Software Foundation