The presentation begins with a general introduction to Apache Airflow and then goes into how the audience can develop their own ETL workflows using the framework, with the help of an example use case of "tracking disease outbreaks in India". It also talks about briefly how SocialCops has used Apache Airflow to power the DISHA dashboard (https://blog.socialcops.com/inside-sc/announcements/disha-dashboard-good-governance/) and to move data across their internal systems.
The full code is present here: https://github.com/socialcopsdev/airflow_blog
Introduction to Apache Airflow
A workflow to track disease outbreaks in India
Vinayak Mehta, Data Engineer at SocialCops
What is Data Engineering?
Extract, Transform, Load (ETL) workflows
An example workflow
● Workflow management is a nightmare
● Can’t scale as team sizes grow
● Can’t handle dependencies between tasks
● “I want to debug my workflow, where are the logs?!”
A wild Apache Airflow appears!
"Airflow is a platform to programmatically author, schedule
and monitor workflows." - Airflow docs
● created by Maxime Beauchemin at Airbnb in 2014, open source from the
very first commit
● used by Airbnb, Lyft, Spotify, Quora, Reddit, Stripe, Yahoo! ...
Workflows are defined as Directed Acyclic Graphs, which means dependency
management is sorted
Workflows as DAGs
● Each node/task in the DAG is an “operator”
● A lot of operators right out of the box which let you
connect directly to MySQL, Postgres, Hive ...
● PythonOperator, which means you can write what your
task will do in the language that we all love!
● Tasks are executed by “executors”
● Comes with CeleryExecutor, which lets you horizontally
scale your workflow to multiple worker nodes
● Under incubation at the Apache Software Foundation
● New operators being added regularly!
● Actively maintained
The Holy Trinity
What is this?
Enough about Airflow
The data source
● The Ministry of Health and Family Welfare has the IDSP
scheme in place to track disease outbreaks
● Tracking done at sub-district and village level across
● Outbreak data released as a PDF document on a weekly
● (E) Extract the PDFs from IDSP website
● (T) Transform them into CSVs
● (L) Load them into a data store
Extracting PDFs from IDSP website
Transforming PDFs into CSVs
Some minor cleaning
Tying it all together into a DAG
Where to go from here?
● To the webserver!
○ View task states
○ View task logs
● But what do we do with the code
and data we have?
○ Extend the code to extract from other
○ Use the data along with other sources
to generate your own insights.
○ *cough* predictive analytics *cough*
Where else do we use Airflow
Collect to Visualize workflows
● Extract data from Collect, our data collection app
● Transform using R/Python
● Load data into Visualize, our visualization platform to derive better insights
out of the data
● Data from 20 different ministries breaking silos to come together in one
● Bringing in accountability to a budget of over Rs. 2 lakh crore spent on 41
welfare schemes annually
“DISHA is a crucial step towards good governance through which we will be
able to monitor everything centrally. It will enable us to effectively monitor
every village of the country.”
– Narendra Modi, Prime Minister of India
Email: [email protected]