Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Airflow @ Umuzi.org by Sheena O'Connell

Pycon ZA
October 10, 2019

Apache Airflow @ Umuzi.org by Sheena O'Connell

People don't want data - what they really want is insight. Or even better, actionable insight. Now the road from data to insights can be a bit of a beast. Take Airbnb as an example - it started as a scrappy social hack and grew into a large and data-driven company. When they were small so was their data, but as the company and technical architecture grew in scale and complexity leveraging that data became a challenge. It became more and more necessary to combine multiple messy data-sources in novel ways, in the right order and on a strict schedule... using distributed computing... with proper logging and error recovery... gosh. Batch jobs, cron, sticky tape and bits of string soon proved insufficient.

Enter Airflow.

Airflow is an Apache top-level project that was open-sourced by Airbnb. It's a seriously powerful tool that's all about defining, scheduling, running, monitoring and distributing complicated workflows.

In this talk I'll give you a bit of a tour of airflow's moving parts. I'll also talk a little bit about how we are leveraging Airflow at Umuzi

Pycon ZA

October 10, 2019
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. None
  2. APACHE AIRFLOW APACHE AIRFLOW @UMUZI.ORG @UMUZI.ORG

  3. UMUZI UMUZI

  4. UMUZI UMUZI Key metric: # people placed in high value

    careers
  5. UMUZI UMUZI Key metric: # people placed in high value

    careers online test -> human interview -> bootcamp -> learnership -> $$
  6. UMUZI UMUZI Key metric: # people placed in high value

    careers online test -> human interview -> bootcamp -> learnership -> $$ growing very fast
  7. UMUZI UMUZI Key metric: # people placed in high value

    careers online test -> human interview -> bootcamp -> learnership -> $$ growing very fast Based in Jeppestown
  8. UMUZI UMUZI Key metric: # people placed in high value

    careers online test -> human interview -> bootcamp -> learnership -> $$ growing very fast Based in Jeppestown We're hiring <- shameless plug
  9. ME :) ME :) Umuzi CTO Coding for more than

    10 years All about that Python life sheena.oconnell@umuzi.org
  10. THE PROBLEM THE PROBLEM AS DATA-DRIVEN ORGANIZATIONS GROW... AS DATA-DRIVEN

    ORGANIZATIONS GROW...
  11. THE PROBLEM THE PROBLEM AS DATA-DRIVEN ORGANIZATIONS GROW... AS DATA-DRIVEN

    ORGANIZATIONS GROW... Data pipeline requirements grow
  12. THE PROBLEM THE PROBLEM AS DATA-DRIVEN ORGANIZATIONS GROW... AS DATA-DRIVEN

    ORGANIZATIONS GROW... Data pipeline requirements grow Mul�ple data sources
  13. THE PROBLEM THE PROBLEM AS DATA-DRIVEN ORGANIZATIONS GROW... AS DATA-DRIVEN

    ORGANIZATIONS GROW... Data pipeline requirements grow Mul�ple data sources complex networks of processes
  14. THE PROBLEM THE PROBLEM AS DATA-DRIVEN ORGANIZATIONS GROW... AS DATA-DRIVEN

    ORGANIZATIONS GROW... Data pipeline requirements grow Mul�ple data sources complex networks of processes intricate dependencies
  15. THE PROBLEM THE PROBLEM AS DATA-DRIVEN ORGANIZATIONS GROW... AS DATA-DRIVEN

    ORGANIZATIONS GROW... Data pipeline requirements grow Mul�ple data sources complex networks of processes intricate dependencies specific schedules
  16. THE PROBLEM THE PROBLEM AS DATA-DRIVEN ORGANIZATIONS GROW... AS DATA-DRIVEN

    ORGANIZATIONS GROW... Data pipeline requirements grow Mul�ple data sources complex networks of processes intricate dependencies specific schedules stakeholder requirements grow and shi�
  17. OVERARCHING TECHNICAL NEEDS GROW... OVERARCHING TECHNICAL NEEDS GROW...

  18. OVERARCHING TECHNICAL NEEDS GROW... OVERARCHING TECHNICAL NEEDS GROW... monitoring

  19. OVERARCHING TECHNICAL NEEDS GROW... OVERARCHING TECHNICAL NEEDS GROW... monitoring retries

  20. OVERARCHING TECHNICAL NEEDS GROW... OVERARCHING TECHNICAL NEEDS GROW... monitoring retries

    maintainable code
  21. OVERARCHING TECHNICAL NEEDS GROW... OVERARCHING TECHNICAL NEEDS GROW... monitoring retries

    maintainable code scale
  22. OVERARCHING TECHNICAL NEEDS GROW... OVERARCHING TECHNICAL NEEDS GROW... monitoring retries

    maintainable code scale troubleshoot
  23. OVERARCHING TECHNICAL NEEDS GROW... OVERARCHING TECHNICAL NEEDS GROW... monitoring retries

    maintainable code scale troubleshoot authoriza�on
  24. OVERARCHING TECHNICAL NEEDS GROW... OVERARCHING TECHNICAL NEEDS GROW... monitoring retries

    maintainable code scale troubleshoot authoriza�on SLA
  25. ENTER APACHE AIRFLOW ENTER APACHE AIRFLOW

  26. ENTER APACHE AIRFLOW ENTER APACHE AIRFLOW “Airflow is a pla�orm

    to programma�cally author, schedule and monitor workflows ”
  27. SOME TERMINOLOGY SOME TERMINOLOGY

  28. EXAMPLE DAG: CONFIGURATION AS PYTHON CODE EXAMPLE DAG: CONFIGURATION AS

    PYTHON CODE with DAG( "do_nice_things", default_args=default_args, schedule_interval="@daily", ) as dag: task_fetch_data = BashOperator( task_id="fetch_the_data", bash_command="wget https://example.foo/blah.csv" ) task_clean_data = PythonOperator( task_id="clean_the_data", python_callable=clean_it_yo )
  29. WHY IS THIS COOL? WHY IS THIS COOL?

  30. WHY IS THIS COOL? WHY IS THIS COOL? can do

    simple "declara�ve" things like above
  31. WHY IS THIS COOL? WHY IS THIS COOL? can do

    simple "declara�ve" things like above can generate DAG tasks dynamically (eg create a bunch of tasks within a loop)
  32. WHY IS THIS COOL? WHY IS THIS COOL? can do

    simple "declara�ve" things like above can generate DAG tasks dynamically (eg create a bunch of tasks within a loop) Parameterize en�re DAGs
  33. OPERATORS OPERATORS

  34. OPERATORS OPERATORS PythonOperator - executes arbitrary Python code

  35. OPERATORS OPERATORS PythonOperator - executes arbitrary Python code BashOperator -

    executes arbitrary bash commands,leveraging Jinja templa�ng
  36. OPERATORS OPERATORS PythonOperator - executes arbitrary Python code BashOperator -

    executes arbitrary bash commands,leveraging Jinja templa�ng BranchPythonOperator
  37. OPERATORS OPERATORS PythonOperator - executes arbitrary Python code BashOperator -

    executes arbitrary bash commands,leveraging Jinja templa�ng BranchPythonOperator TriggerDagRunOperator
  38. OPERATORS OPERATORS PythonOperator - executes arbitrary Python code BashOperator -

    executes arbitrary bash commands,leveraging Jinja templa�ng BranchPythonOperator TriggerDagRunOperator EmailOperator / SlackOperator
  39. OPERATORS OPERATORS PythonOperator - executes arbitrary Python code BashOperator -

    executes arbitrary bash commands,leveraging Jinja templa�ng BranchPythonOperator TriggerDagRunOperator EmailOperator / SlackOperator ...more than 40 in total
  40. INTERACTION INTERACTION

  41. INTERACTION INTERACTION Command line u�lity

  42. INTERACTION INTERACTION Command line u�lity Experimental REST api

  43. INTERACTION INTERACTION Python ... Command line u�lity Experimental REST api

  44. INTERACTION INTERACTION Python ... Web UI Command line u�lity Experimental

    REST api
  45. WEB UI TOUR WEB UI TOUR

  46. ARCHITECTURE ARCHITECTURE

  47. DEPLOYMENT DEPLOYMENT # Based on https://github.com/puckel/docker-airflow version: "3.3" services: nginx:

    image: nginx:1.17.1 ports: - "80:80" - "443:443" volumes: - ./nginx:/etc/nginx/conf.d - ./gitignore/nginxlog:/var/log/nginx - $SECRET_DIR:/etc/apache2 restart: always postgres: image: gcr.io/cloudsql-docker/gce-proxy:1.12
  48. </PRESENTATION> </PRESENTATION> h�ps:/ /sheenarbw.github.io /pres-airflow/