Introduction to Apache Airflow at PyData Delhi meetup #25

Introduction to Apache Airflow at PyData Delhi meetup #25

Related blog post: https://blog.socialcops.com/engineering/apache-airflow-disease-outbreaks-india/

The presentation begins with a general introduction to Apache Airflow and then goes into how the audience can develop their own ETL workflows using the framework, with the help of an example use case of "tracking disease outbreaks in India". It also talks about briefly how SocialCops has used Apache Airflow to power the DISHA dashboard (https://blog.socialcops.com/inside-sc/announcements/disha-dashboard-good-governance/) and to move data across their internal systems.

The full code is present here: https://github.com/socialcopsdev/airflow_blog

5c9f6b47dcb9445449eefb07bc719483?s=128

Vinayak Mehta

June 16, 2018
Tweet

Transcript

  1. 1.

    Introduction to Apache Airflow A workflow to track disease outbreaks

    in India Vinayak Mehta, Data Engineer at SocialCops
  2. 5.

    Problems • Workflow management is a nightmare • Can’t scale

    as team sizes grow • Can’t handle dependencies between tasks • “I want to debug my workflow, where are the logs?!”
  3. 6.

    A wild Apache Airflow appears! "Airflow is a platform to

    programmatically author, schedule and monitor workflows." - Airflow docs • created by Maxime Beauchemin at Airbnb in 2014, open source from the very first commit • used by Airbnb, Lyft, Spotify, Quora, Reddit, Stripe, Yahoo! ...
  4. 9.

    Extensible • Each node/task in the DAG is an “operator”

    • A lot of operators right out of the box which let you connect directly to MySQL, Postgres, Hive ... • PythonOperator, which means you can write what your task will do in the language that we all love!
  5. 10.

    Scalable • Tasks are executed by “executors” • Comes with

    CeleryExecutor, which lets you horizontally scale your workflow to multiple worker nodes
  6. 11.

    Open source • Under incubation at the Apache Software Foundation

    • New operators being added regularly! • Actively maintained
  7. 14.
  8. 18.

    The data source • The Ministry of Health and Family

    Welfare has the IDSP scheme in place to track disease outbreaks • Tracking done at sub-district and village level across India • Outbreak data released as a PDF document on a weekly basis
  9. 19.

    The workflow • (E) Extract the PDFs from IDSP website

    • (T) Transform them into CSVs • (L) Load them into a data store
  10. 21.
  11. 22.
  12. 26.

    Where to go from here? • To the webserver! ◦

    View task states ◦ View task logs • But what do we do with the code and data we have? ◦ Extend the code to extract from other sources. ◦ Use the data along with other sources to generate your own insights. ◦ *cough* predictive analytics *cough*
  13. 28.

    Collect to Visualize workflows • Extract data from Collect, our

    data collection app • Transform using R/Python • Load data into Visualize, our visualization platform to derive better insights out of the data
  14. 29.

    DISHA • Data from 20 different ministries breaking silos to

    come together in one place • Bringing in accountability to a budget of over Rs. 2 lakh crore spent on 41 welfare schemes annually “DISHA is a crucial step towards good governance through which we will be able to monitor everything centrally. It will enable us to effectively monitor every village of the country.” – Narendra Modi, Prime Minister of India