Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Apache Airflow at PyData Delhi meetup #25

Introduction to Apache Airflow at PyData Delhi meetup #25

Related blog post: https://blog.socialcops.com/engineering/apache-airflow-disease-outbreaks-india/

The presentation begins with a general introduction to Apache Airflow and then goes into how the audience can develop their own ETL workflows using the framework, with the help of an example use case of "tracking disease outbreaks in India". It also talks about briefly how SocialCops has used Apache Airflow to power the DISHA dashboard (https://blog.socialcops.com/inside-sc/announcements/disha-dashboard-good-governance/) and to move data across their internal systems.

The full code is present here: https://github.com/socialcopsdev/airflow_blog

Vinayak Mehta

June 16, 2018
Tweet

More Decks by Vinayak Mehta

Other Decks in How-to & DIY

Transcript

  1. Introduction to Apache Airflow
    A workflow to track disease outbreaks in India
    Vinayak Mehta, Data Engineer at SocialCops

    View Slide

  2. What is Data Engineering?

    View Slide

  3. Extract, Transform, Load (ETL) workflows

    View Slide

  4. Scheduling
    Cron, anyone?
    An example workflow

    View Slide

  5. Problems
    ● Workflow management is a nightmare
    ● Can’t scale as team sizes grow
    ● Can’t handle dependencies between tasks
    ● “I want to debug my workflow, where are the logs?!”

    View Slide

  6. A wild Apache Airflow appears!
    "Airflow is a platform to programmatically author, schedule
    and monitor workflows." - Airflow docs
    ● created by Maxime Beauchemin at Airbnb in 2014, open source from the
    very first commit
    ● used by Airbnb, Lyft, Spotify, Quora, Reddit, Stripe, Yahoo! ...

    View Slide

  7. Why Airflow?

    View Slide

  8. Workflows are defined as Directed Acyclic Graphs, which means dependency
    management is sorted
    Workflows as DAGs

    View Slide

  9. Extensible
    ● Each node/task in the DAG is an “operator”
    ● A lot of operators right out of the box which let you
    connect directly to MySQL, Postgres, Hive ...
    ● PythonOperator, which means you can write what your
    task will do in the language that we all love!

    View Slide

  10. Scalable
    ● Tasks are executed by “executors”
    ● Comes with CeleryExecutor, which lets you horizontally
    scale your workflow to multiple worker nodes

    View Slide

  11. Open source
    ● Under incubation at the Apache Software Foundation
    ● New operators being added regularly!
    ● Actively maintained

    View Slide

  12. Architecture

    View Slide

  13. The Holy Trinity
    ● Webserver
    ● Scheduler
    ● Worker

    View Slide

  14. A DAG!

    View Slide

  15. What is this?

    View Slide

  16. An orchestra!

    View Slide

  17. Enough about Airflow

    View Slide

  18. The data source
    ● The Ministry of Health and Family Welfare has the IDSP
    scheme in place to track disease outbreaks
    ● Tracking done at sub-district and village level across
    India
    ● Outbreak data released as a PDF document on a weekly
    basis

    View Slide

  19. The workflow
    ● (E) Extract the PDFs from IDSP website
    ● (T) Transform them into CSVs
    ● (L) Load them into a data store

    View Slide

  20. Extracting PDFs from IDSP website

    View Slide

  21. View Slide

  22. View Slide

  23. Transforming PDFs into CSVs

    View Slide

  24. Some minor cleaning

    View Slide

  25. Tying it all together into a DAG

    View Slide

  26. Where to go from here?
    ● To the webserver!
    ○ View task states
    ○ View task logs
    ● But what do we do with the code
    and data we have?
    ○ Extend the code to extract from other
    sources.
    ○ Use the data along with other sources
    to generate your own insights.
    ○ *cough* predictive analytics *cough*

    View Slide

  27. Where else do we use Airflow

    View Slide

  28. Collect to Visualize workflows
    ● Extract data from Collect, our data collection app
    ● Transform using R/Python
    ● Load data into Visualize, our visualization platform to derive better insights
    out of the data

    View Slide

  29. DISHA
    ● Data from 20 different ministries breaking silos to come together in one
    place
    ● Bringing in accountability to a budget of over Rs. 2 lakh crore spent on 41
    welfare schemes annually
    “DISHA is a crucial step towards good governance through which we will be
    able to monitor everything centrally. It will enable us to effectively monitor
    every village of the country.”
    – Narendra Modi, Prime Minister of India

    View Slide

  30. Questions?

    View Slide

  31. Thanks!
    SocialCops
    https://twitter.com/social_cops
    Email: [email protected]
    Blog: blog.socialcops.com
    Website: www.socialcops.com

    View Slide