Defining data pipeline workflows using Apache Airflow

2b15e7d28b45c399fdb4aa7525235b30?s=47 juanriaza
November 23, 2018

Defining data pipeline workflows using Apache Airflow

Madrid, Commit Conf 2018

2b15e7d28b45c399fdb4aa7525235b30?s=128

juanriaza

November 23, 2018
Tweet

Transcript

  1. Defining data pipeline workflows using Apache Airflow MAD · NOV

    23-24 · 2018 Juan Riaza ! jriaza@idealista.com " @juanriaza
  2. • SoEware Developer • OSS enthusiast • Pythonista & Djangonaut

    • Now trying to tame Gophers • Reverse engineering apps • Hobbies: cooking and reading Who am I? MAD · NOV 23-24 · 2018
  3. MAD · NOV 23-24 · 2018

  4. In the beginning there was cron…

  5. MAD · NOV 23-24 · 2018 Gather Data ETL Report

    22:00 00:00 02:00 Gather Data Report 22:00 23:30 01:00 ETL ☺
  6. MAD · NOV 23-24 · 2018 Gather Data Report 22:00

    23:30 01:00 ETL Report 22:00 23:30 01:00 ETL Gather Data ⁉ ⏱⁉
  7. MAD · NOV 23-24 · 2018 •Managing dependencies. Possible overlapping

    •Failure handling. Retries? •Error noXficaXons, metrics. Visibility? •Unified logs •Distributed cron… •Do you maintain a calendar of batch jobs? •What happens if…? Cron “hustle”
  8. Every day is a new adventure…

  9. MAD · NOV 23-24 · 2018 Meet Airflow

  10. •“A pla`orm to programmaXcally author, schedule, and monitor workflows” •The

    glue that binds your data ecosystem together •Open source, wricen in Python •Was started in Oct 2014 by Max Beauchemin at •IncubaXng at since 2016 •550+ contributors, 5300+ commits, 9300+ stars Overview MAD · NOV 23-24 · 2018
  11. MAD · NOV 23-24 · 2018

  12. Use cases •ETL pipelines •Machine learning pipelines •PredicXve data pipelines:

    fraud detecXon, scoring/ ranking, classificaXon, recommender system, etc. •General job scheduling: DB back-ups •Anything… automate the garage door? MAD · NOV 23-24 · 2018
  13. Airflow uses Operators as the fundamental unit of abstracXon to

    define tasks, and uses a DAG to define workflows using a set of operators MAD · NOV 23-24 · 2018
  14. •Directed Acyclic Graph •Represents a workflow: set of tasks with

    a dependency structure •Each node represents some form of data processing What is a DAG MAD · NOV 23-24 · 2018
  15. What does it look like? MAD · NOV 23-24 ·

    2018
  16. err…

  17. How it’s made MAD · NOV 23-24 · 2018

  18. import datetime from airflow import DAG from airflow.operators.bash_operator import BashOperator

    from airflow.operators.python_operator import PythonOperator def print_commit(): print(' commit conf') with DAG('commit_dag', schedule_interval='@weekly') as dag: print_bash = BashOperator( task_id='print_bash', bash_command='echo " commit conf"') sleep = BashOperator( task_id='sleep', bash_command='sleep 5') print_python = PythonOperator( task_id='print_python', python_callable=print_commit) print_bash >> sleep >> print_python
  19. Airflow UI MAD · NOV 23-24 · 2018

  20. Airflow UI MAD · NOV 23-24 · 2018

  21. Airflow UI MAD · NOV 23-24 · 2018

  22. Airflow UI MAD · NOV 23-24 · 2018

  23. •AcIon: perform an acXon locally or make a call to

    an external system to perform another acXon •Transfer: move data from one system to another •Sensor: wait for and detect some condiXon in a source system Operators MAD · NOV 23-24 · 2018
  24. •Perform an acXon such as execuXng a Python funcXon or

    submilng a Spark Job •Built-in BashOperator, PythonOperator, DockerOperator, EmailOperator, … •Community contributed Databricks, AWS, GCP AcIon operators MAD · NOV 23-24 · 2018
  25. •Move data between systems such as from Hive to Mysql

    or from S3 to Hive •Built-in HiveToMySqlTransfer, S3ToHiveTransfer •Community contributed Databricks, AWS, GCP Transfer operators MAD · NOV 23-24 · 2018
  26. •Triggers downstream tasks in the dependency graph when a certain

    criteria is met. For example, checking for a certain file has become available on S3 before using it downstream •Built-in HiveParXXonSensor, HcpSensor, S3KeySensor, SqlSensor, FTPSensor, … •Community contributed Databricks, AWS, GCP Sensor operators MAD · NOV 23-24 · 2018
  27. Advanced example MAD · NOV 23-24 · 2018 •Waits for

    a key (a file-like instance on S3) to be present in a S3 bucket •Add a new AWS Athena parXXon •Run an AWS Glue job and wait unXl its done •NoXfy the #dataservices channel at MicrosoE Teams S3KeySensor AwsAthenaQueryOperator AwsGlueScheduleJobOperator & AWSGlueJobSensor MSTeamsWebhookOperator
  28. with DAG("commit_dag", default_args=default_args, schedule_interval="@weekly") as dag: s3_key_sensor = S3KeySensor( task_id="s3_key_sensor",

    bucket_name="dataservices-ingest", bucket_key="ingest/client_xyz/year={{ execution_date.year }}/" \ "week={{ execution_date.isocalendar()[1] }}/{{ ds_nodash }}.json.gz") aws_athena_update_raw_partition = AwsAthenaQueryOperator( query=""" ALTER TABLE raw_client_xyz ADD IF NOT EXISTS PARTITION (year='{{ execution_date.year }}', week='{{ execution_date.isocalendar()[1] }}');""", task_id="athena_update_raw_partition") aws_glue_job_schedule = AwsGlueScheduleJobOperator( job_name="commit_job", job_args={ "--source_database": "dataservices_staging", "--source_table_name": "raw_client_xyz", "--partition_year": "{{ execution_date.year }}", "--partition_week": "{{ execution_date.isocalendar()[1] }}"}, task_id="aws_glue_job_schedule") aws_glue_job_sensor = AWSGlueJobSensor( job_name="commit_job", job_run_id="{{ task_instance.xcom_pull(task_ids='aws_glue_job_schedule', key='aws_glue_job_run_id') }}", task_id="aws_glue_job_sensor") s3_processed_key_check = S3KeySensor( bucket_name="dataservices-processed", bucket_key="processed/client_xyz/year={{ execution_date.year }}/" \ "week={{ execution_date.isocalendar()[1] }}/{{ ds_nodash }}.json.gz", task_id="s3_processed_key_check") ms_teams_notify = MSTeamsWebhookOperator( task_id="ms_teams_notify", message="Data from client ✨ xyz ✨ has been processed") s3_key_sensor >> aws_athena_update_raw_partition >> aws_glue_job_schedule >> \ aws_glue_job_sensor >> s3_processed_key_check >> ms_teams_notify
  29. What happens if… the ETL script has a bug …and

    has been running for weeks?
  30. We can travel in Ime and rerun * only *

    the related tasks downstream without side effects. This is called backfilling.
  31. How everything fits together

  32. MAD · NOV 23-24 · 2018 Airflow UI (webserver) Metadata

    DB Scheduler Worker 1 Worker 2 Worker nth The big picture Airflow CLI Airflow REST API
  33. •Webserver: Airflow's UI •Scheduler: responsible for scheduling jobs •Executor: Local,

    Celery, , •Metadata Database Core components MAD · NOV 23-24 · 2018
  34. •Airflow webserver UI •Airflow CLI •Airflow Rest API server (experimental)

    Interfaces MAD · NOV 23-24 · 2018
  35. Airflow CLI $ airflow [-h] <command> MAD · NOV 23-24

    · 2018
  36. Airflow CLI scheduler -n $NUM Core services webserver -p 80

    Meta-DB operaXons initdb resetdb upgradedb Operate on DAGs pause unpause run trigger_dag backfill dag_state task_state clear Develop & test list_dags list_tasks variables render test MAD · NOV 23-24 · 2018
  37. •A quick look into DAG and task progress •Error logging

    •Browse metadata: xcom, variables, SLAs •Historical stats Airflow webserver UI MAD · NOV 23-24 · 2018
  38. Moving data MAD · NOV 23-24 · 2018 In theory,

    all data processing and storage should be done in external systems with Airflow only containing workflow metadata
  39. •Variables: staXc values, config values, api keys •XComs: short for

    "cross communicaXon” CommunicaXon between tasks. Such as file name found by a sensor •ConnecIons: JDBCs, auth, etc. Airflow’s metadata storage MAD · NOV 23-24 · 2018
  40. •Failure handling and monitoring: retry policies, SLAs •Smarter Cron allows

    to define more complex schedules. Even/odd days… •Complex dependencies (trigger rules) •Backfills •Template system: Jinja BaWeries included MAD · NOV 23-24 · 2018
  41. •Idempotency (we’ll talk about this later…) •Logs can be piped

    to remote storage (S3, GCS …) •Backoff retries •Stage transformed data Best pracIces MAD · NOV 23-24 · 2018
  42. Recipes

  43. GeneraIng dynamic DAGs MAD · NOV 23-24 · 2018 Airflow

    checks every .py file in the DAGs folder… …and registers any available DAG defined as a global variable
  44. for n in range(1, 10): dag_id = 'hello_world_{}'.format(str(n)) default_args =

    {'owner': 'airflow', 'start_date': datetime(2018, 1, 1)} schedule = '@daily' dag_number = n globals()[dag_id] = create_dag(dag_id, schedule, dag_number, default_args)
  45. GeneraIng dynamic DAGs MAD · NOV 23-24 · 2018 •A

    Variable value •A staIc file (yaml, json, etc.) •Any external source (based on ConnecIons) It’s possible to create a DAG from:
  46. Data Engineering

  47. MAD · NOV 23-24 · 2018 Does anybody know about

    Maslow's hierarchy of needs?
  48. MAD · NOV 23-24 · 2018 Data literacy, collecIon, and

    infrastructure
  49. •Design, build, and maintain data warehouses •A data warehouse is

    a place where raw data is transformed and stored in query-able forms •Enable higher level analyXcs, be it business intelligence, online experimentaXon, or machine learning MAD · NOV 23-24 · 2018 The role
  50. •SQL mastery: if english is the language of business, SQL

    is the language of data •Load data incrementally •Process historic data (backfilling) •ParXXon ingested data •Enforce the idempotency constrain MAD · NOV 23-24 · 2018 Key skills
  51. Can't see the forest for the trees

  52. MAD · NOV 23-24 · 2018 Airflow empowers data engineers

    FuncIonal data engineering
  53. •Reproducible: determinisXc and idempotent •Re-running a task for the same

    date should always produce same output •Future proof: backfilling, versioning •Data can be repaired by rerunning the new code, either by clearing tasks or doing backfills MAD · NOV 23-24 · 2018 FuncIonal data engineering
  54. MAD · NOV 23-24 · 2018

  55. •Extract: where sensors wait for upstream data sources to land

    •Transform: where we apply business logic and perform acXons such as filtering, grouping, and aggregaXon to translate raw data into analysis- ready datasets •Load: we load the processed data and transport them to a final desXnaXon ETLs MAD · NOV 23-24 · 2018
  56. Airflow deployment MAD · NOV 23-24 · 2018 https://github.com/idealista/airflow-role

  57. Airflow deployment MAD · NOV 23-24 · 2018

  58. Airflow deployment MAD · NOV 23-24 · 2018

  59. Airflow at Idealista MAD · NOV 23-24 · 2018 •Deployment

    via Ansible role •AWS Glue & AWS Athena plugins •MicrosoE Teams plugin •AWS S3 to FTP sync operator
  60. Q&A

  61. Useful resources MAD · NOV 23-24 · 2018 •Useful SQL

    queries for Apache Airflow •Astronomer guides •Airflow maintenance dags •A Beginner’s Guide to Data Engineering by Robert Chang •The Rise of the Data Engineer by Maxime Beauchemin
  62. CitaIons & AWribuIons MAD · NOV 23-24 · 2018 •

    Slide 3: image from XKCD (hcps://xkcd.com/2054/) • Slide 4: picture by Andrew Seaman (hcps://unsplash.com/photos/EuDUHo5yyAg) • Slide 8: image from Pinterest (hcps://www.pinterest.com/pin/377739487476469866/) • Slide 9: Apache Airflow logo (hcps://airflow.apache.org/) • Slide 11: logos belong to the respecXve companies • Slide 14: image from Wikipedia (hcps://en.wikipedia.org/wiki/Directed_acyclic_graph) • Slide 31: picture by Garec Mizunaka (hcps://unsplash.com/photos/xFjX9rYILo) • Slide 33: Flask logo (hcp://flask.pocoo.org/), Mesos logo (hcp://mesos.apache.org/), Kubernetes logo (hcps://kubernetes.io/) • Slide 42: picture by Dan Gold (hcps://unsplash.com/photos/5O1ddenSM4g) • Slide 43-45: hcps://www.astronomer.io/guides/dynamically-generaXng-dags/ • Slide 48: Monica RogaX’s “The AI Hierarchy of Needs” hcps://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007 • Slide 49: Robert Chang’s “A Beginner’s Guide to Data Engineering” hcps://medium.com/@rchang/a-beginners-guide-to-data-engineering-part- i-4227c5c457d7 • Slide 50: Maxime Beauchemin’s “The Rise of the Data Engineer” hcps://medium.freecodecamp.org/the-rise-of-the-data-engineer-91be18f1e603 • Slide 51: picture by Jesse Gardner (hcps://unsplash.com/photos/mERlBKFGJiQ) • Slide 54: Vineet Goel’s “Why Robinhood uses Airflow?” (hcps://robinhood.engineering/why-robinhood-uses-airflow-aed13a9a90c8) • Slide 56: Ansible logo (hcps://www.ansible.com/) • Slide 57: Astronomer logo (hcps://www.astronomer.io) • Slide 58: Google Cloud Pla`orm logo and Google Cloud Composer logo (hcps://cloud.google.com/composer/) • Slide 60: picture by Edwin Andrade (hcps://unsplash.com/photos/4V1dC_eoCwg) • Commit logo (hcps://2018.commit-conf.com)