Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Airflow @ Umuzi.org by Sheena O'Connell

Pycon ZA
October 10, 2019

Apache Airflow @ Umuzi.org by Sheena O'Connell

People don't want data - what they really want is insight. Or even better, actionable insight. Now the road from data to insights can be a bit of a beast. Take Airbnb as an example - it started as a scrappy social hack and grew into a large and data-driven company. When they were small so was their data, but as the company and technical architecture grew in scale and complexity leveraging that data became a challenge. It became more and more necessary to combine multiple messy data-sources in novel ways, in the right order and on a strict schedule... using distributed computing... with proper logging and error recovery... gosh. Batch jobs, cron, sticky tape and bits of string soon proved insufficient.

Enter Airflow.

Airflow is an Apache top-level project that was open-sourced by Airbnb. It's a seriously powerful tool that's all about defining, scheduling, running, monitoring and distributing complicated workflows.

In this talk I'll give you a bit of a tour of airflow's moving parts. I'll also talk a little bit about how we are leveraging Airflow at Umuzi

Pycon ZA

October 10, 2019
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. UMUZI UMUZI Key metric: # people placed in high value

    careers online test -> human interview -> bootcamp -> learnership -> $$
  2. UMUZI UMUZI Key metric: # people placed in high value

    careers online test -> human interview -> bootcamp -> learnership -> $$ growing very fast
  3. UMUZI UMUZI Key metric: # people placed in high value

    careers online test -> human interview -> bootcamp -> learnership -> $$ growing very fast Based in Jeppestown
  4. UMUZI UMUZI Key metric: # people placed in high value

    careers online test -> human interview -> bootcamp -> learnership -> $$ growing very fast Based in Jeppestown We're hiring <- shameless plug
  5. THE PROBLEM THE PROBLEM AS DATA-DRIVEN ORGANIZATIONS GROW... AS DATA-DRIVEN

    ORGANIZATIONS GROW... Data pipeline requirements grow
  6. THE PROBLEM THE PROBLEM AS DATA-DRIVEN ORGANIZATIONS GROW... AS DATA-DRIVEN

    ORGANIZATIONS GROW... Data pipeline requirements grow Mul�ple data sources
  7. THE PROBLEM THE PROBLEM AS DATA-DRIVEN ORGANIZATIONS GROW... AS DATA-DRIVEN

    ORGANIZATIONS GROW... Data pipeline requirements grow Mul�ple data sources complex networks of processes
  8. THE PROBLEM THE PROBLEM AS DATA-DRIVEN ORGANIZATIONS GROW... AS DATA-DRIVEN

    ORGANIZATIONS GROW... Data pipeline requirements grow Mul�ple data sources complex networks of processes intricate dependencies
  9. THE PROBLEM THE PROBLEM AS DATA-DRIVEN ORGANIZATIONS GROW... AS DATA-DRIVEN

    ORGANIZATIONS GROW... Data pipeline requirements grow Mul�ple data sources complex networks of processes intricate dependencies specific schedules
  10. THE PROBLEM THE PROBLEM AS DATA-DRIVEN ORGANIZATIONS GROW... AS DATA-DRIVEN

    ORGANIZATIONS GROW... Data pipeline requirements grow Mul�ple data sources complex networks of processes intricate dependencies specific schedules stakeholder requirements grow and shi�
  11. ENTER APACHE AIRFLOW ENTER APACHE AIRFLOW “Airflow is a pla�orm

    to programma�cally author, schedule and monitor workflows ”
  12. EXAMPLE DAG: CONFIGURATION AS PYTHON CODE EXAMPLE DAG: CONFIGURATION AS

    PYTHON CODE with DAG( "do_nice_things", default_args=default_args, schedule_interval="@daily", ) as dag: task_fetch_data = BashOperator( task_id="fetch_the_data", bash_command="wget https://example.foo/blah.csv" ) task_clean_data = PythonOperator( task_id="clean_the_data", python_callable=clean_it_yo )
  13. WHY IS THIS COOL? WHY IS THIS COOL? can do

    simple "declara�ve" things like above
  14. WHY IS THIS COOL? WHY IS THIS COOL? can do

    simple "declara�ve" things like above can generate DAG tasks dynamically (eg create a bunch of tasks within a loop)
  15. WHY IS THIS COOL? WHY IS THIS COOL? can do

    simple "declara�ve" things like above can generate DAG tasks dynamically (eg create a bunch of tasks within a loop) Parameterize en�re DAGs
  16. OPERATORS OPERATORS PythonOperator - executes arbitrary Python code BashOperator -

    executes arbitrary bash commands,leveraging Jinja templa�ng
  17. OPERATORS OPERATORS PythonOperator - executes arbitrary Python code BashOperator -

    executes arbitrary bash commands,leveraging Jinja templa�ng BranchPythonOperator
  18. OPERATORS OPERATORS PythonOperator - executes arbitrary Python code BashOperator -

    executes arbitrary bash commands,leveraging Jinja templa�ng BranchPythonOperator TriggerDagRunOperator
  19. OPERATORS OPERATORS PythonOperator - executes arbitrary Python code BashOperator -

    executes arbitrary bash commands,leveraging Jinja templa�ng BranchPythonOperator TriggerDagRunOperator EmailOperator / SlackOperator
  20. OPERATORS OPERATORS PythonOperator - executes arbitrary Python code BashOperator -

    executes arbitrary bash commands,leveraging Jinja templa�ng BranchPythonOperator TriggerDagRunOperator EmailOperator / SlackOperator ...more than 40 in total
  21. DEPLOYMENT DEPLOYMENT # Based on https://github.com/puckel/docker-airflow version: "3.3" services: nginx:

    image: nginx:1.17.1 ports: - "80:80" - "443:443" volumes: - ./nginx:/etc/nginx/conf.d - ./gitignore/nginxlog:/var/log/nginx - $SECRET_DIR:/etc/apache2 restart: always postgres: image: gcr.io/cloudsql-docker/gce-proxy:1.12