Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Defining data pipeline workflows using Apache Airflow

juanriaza
November 23, 2018

Defining data pipeline workflows using Apache Airflow

Madrid, Commit Conf 2018

juanriaza

November 23, 2018
Tweet

More Decks by juanriaza

Other Decks in Technology

Transcript

  1. • SoEware Developer • OSS enthusiast • Pythonista & Djangonaut

    • Now trying to tame Gophers • Reverse engineering apps • Hobbies: cooking and reading Who am I? MAD · NOV 23-24 · 2018
  2. MAD · NOV 23-24 · 2018 Gather Data ETL Report

    22:00 00:00 02:00 Gather Data Report 22:00 23:30 01:00 ETL ☺
  3. MAD · NOV 23-24 · 2018 Gather Data Report 22:00

    23:30 01:00 ETL Report 22:00 23:30 01:00 ETL Gather Data ⁉ ⏱⁉
  4. MAD · NOV 23-24 · 2018 •Managing dependencies. Possible overlapping

    •Failure handling. Retries? •Error noXficaXons, metrics. Visibility? •Unified logs •Distributed cron… •Do you maintain a calendar of batch jobs? •What happens if…? Cron “hustle”
  5. •“A pla`orm to programmaXcally author, schedule, and monitor workflows” •The

    glue that binds your data ecosystem together •Open source, wricen in Python •Was started in Oct 2014 by Max Beauchemin at •IncubaXng at since 2016 •550+ contributors, 5300+ commits, 9300+ stars Overview MAD · NOV 23-24 · 2018
  6. Use cases •ETL pipelines •Machine learning pipelines •PredicXve data pipelines:

    fraud detecXon, scoring/ ranking, classificaXon, recommender system, etc. •General job scheduling: DB back-ups •Anything… automate the garage door? MAD · NOV 23-24 · 2018
  7. Airflow uses Operators as the fundamental unit of abstracXon to

    define tasks, and uses a DAG to define workflows using a set of operators MAD · NOV 23-24 · 2018
  8. •Directed Acyclic Graph •Represents a workflow: set of tasks with

    a dependency structure •Each node represents some form of data processing What is a DAG MAD · NOV 23-24 · 2018
  9. import datetime from airflow import DAG from airflow.operators.bash_operator import BashOperator

    from airflow.operators.python_operator import PythonOperator def print_commit(): print(' commit conf') with DAG('commit_dag', schedule_interval='@weekly') as dag: print_bash = BashOperator( task_id='print_bash', bash_command='echo " commit conf"') sleep = BashOperator( task_id='sleep', bash_command='sleep 5') print_python = PythonOperator( task_id='print_python', python_callable=print_commit) print_bash >> sleep >> print_python
  10. •AcIon: perform an acXon locally or make a call to

    an external system to perform another acXon •Transfer: move data from one system to another •Sensor: wait for and detect some condiXon in a source system Operators MAD · NOV 23-24 · 2018
  11. •Perform an acXon such as execuXng a Python funcXon or

    submilng a Spark Job •Built-in BashOperator, PythonOperator, DockerOperator, EmailOperator, … •Community contributed Databricks, AWS, GCP AcIon operators MAD · NOV 23-24 · 2018
  12. •Move data between systems such as from Hive to Mysql

    or from S3 to Hive •Built-in HiveToMySqlTransfer, S3ToHiveTransfer •Community contributed Databricks, AWS, GCP Transfer operators MAD · NOV 23-24 · 2018
  13. •Triggers downstream tasks in the dependency graph when a certain

    criteria is met. For example, checking for a certain file has become available on S3 before using it downstream •Built-in HiveParXXonSensor, HcpSensor, S3KeySensor, SqlSensor, FTPSensor, … •Community contributed Databricks, AWS, GCP Sensor operators MAD · NOV 23-24 · 2018
  14. Advanced example MAD · NOV 23-24 · 2018 •Waits for

    a key (a file-like instance on S3) to be present in a S3 bucket •Add a new AWS Athena parXXon •Run an AWS Glue job and wait unXl its done •NoXfy the #dataservices channel at MicrosoE Teams S3KeySensor AwsAthenaQueryOperator AwsGlueScheduleJobOperator & AWSGlueJobSensor MSTeamsWebhookOperator
  15. with DAG("commit_dag", default_args=default_args, schedule_interval="@weekly") as dag: s3_key_sensor = S3KeySensor( task_id="s3_key_sensor",

    bucket_name="dataservices-ingest", bucket_key="ingest/client_xyz/year={{ execution_date.year }}/" \ "week={{ execution_date.isocalendar()[1] }}/{{ ds_nodash }}.json.gz") aws_athena_update_raw_partition = AwsAthenaQueryOperator( query=""" ALTER TABLE raw_client_xyz ADD IF NOT EXISTS PARTITION (year='{{ execution_date.year }}', week='{{ execution_date.isocalendar()[1] }}');""", task_id="athena_update_raw_partition") aws_glue_job_schedule = AwsGlueScheduleJobOperator( job_name="commit_job", job_args={ "--source_database": "dataservices_staging", "--source_table_name": "raw_client_xyz", "--partition_year": "{{ execution_date.year }}", "--partition_week": "{{ execution_date.isocalendar()[1] }}"}, task_id="aws_glue_job_schedule") aws_glue_job_sensor = AWSGlueJobSensor( job_name="commit_job", job_run_id="{{ task_instance.xcom_pull(task_ids='aws_glue_job_schedule', key='aws_glue_job_run_id') }}", task_id="aws_glue_job_sensor") s3_processed_key_check = S3KeySensor( bucket_name="dataservices-processed", bucket_key="processed/client_xyz/year={{ execution_date.year }}/" \ "week={{ execution_date.isocalendar()[1] }}/{{ ds_nodash }}.json.gz", task_id="s3_processed_key_check") ms_teams_notify = MSTeamsWebhookOperator( task_id="ms_teams_notify", message="Data from client ✨ xyz ✨ has been processed") s3_key_sensor >> aws_athena_update_raw_partition >> aws_glue_job_schedule >> \ aws_glue_job_sensor >> s3_processed_key_check >> ms_teams_notify
  16. We can travel in Ime and rerun * only *

    the related tasks downstream without side effects. This is called backfilling.
  17. MAD · NOV 23-24 · 2018 Airflow UI (webserver) Metadata

    DB Scheduler Worker 1 Worker 2 Worker nth The big picture Airflow CLI Airflow REST API
  18. •Webserver: Airflow's UI •Scheduler: responsible for scheduling jobs •Executor: Local,

    Celery, , •Metadata Database Core components MAD · NOV 23-24 · 2018
  19. Airflow CLI scheduler -n $NUM Core services webserver -p 80

    Meta-DB operaXons initdb resetdb upgradedb Operate on DAGs pause unpause run trigger_dag backfill dag_state task_state clear Develop & test list_dags list_tasks variables render test MAD · NOV 23-24 · 2018
  20. •A quick look into DAG and task progress •Error logging

    •Browse metadata: xcom, variables, SLAs •Historical stats Airflow webserver UI MAD · NOV 23-24 · 2018
  21. Moving data MAD · NOV 23-24 · 2018 In theory,

    all data processing and storage should be done in external systems with Airflow only containing workflow metadata
  22. •Variables: staXc values, config values, api keys •XComs: short for

    "cross communicaXon” CommunicaXon between tasks. Such as file name found by a sensor •ConnecIons: JDBCs, auth, etc. Airflow’s metadata storage MAD · NOV 23-24 · 2018
  23. •Failure handling and monitoring: retry policies, SLAs •Smarter Cron allows

    to define more complex schedules. Even/odd days… •Complex dependencies (trigger rules) •Backfills •Template system: Jinja BaWeries included MAD · NOV 23-24 · 2018
  24. •Idempotency (we’ll talk about this later…) •Logs can be piped

    to remote storage (S3, GCS …) •Backoff retries •Stage transformed data Best pracIces MAD · NOV 23-24 · 2018
  25. GeneraIng dynamic DAGs MAD · NOV 23-24 · 2018 Airflow

    checks every .py file in the DAGs folder… …and registers any available DAG defined as a global variable
  26. for n in range(1, 10): dag_id = 'hello_world_{}'.format(str(n)) default_args =

    {'owner': 'airflow', 'start_date': datetime(2018, 1, 1)} schedule = '@daily' dag_number = n globals()[dag_id] = create_dag(dag_id, schedule, dag_number, default_args)
  27. GeneraIng dynamic DAGs MAD · NOV 23-24 · 2018 •A

    Variable value •A staIc file (yaml, json, etc.) •Any external source (based on ConnecIons) It’s possible to create a DAG from:
  28. •Design, build, and maintain data warehouses •A data warehouse is

    a place where raw data is transformed and stored in query-able forms •Enable higher level analyXcs, be it business intelligence, online experimentaXon, or machine learning MAD · NOV 23-24 · 2018 The role
  29. •SQL mastery: if english is the language of business, SQL

    is the language of data •Load data incrementally •Process historic data (backfilling) •ParXXon ingested data •Enforce the idempotency constrain MAD · NOV 23-24 · 2018 Key skills
  30. •Reproducible: determinisXc and idempotent •Re-running a task for the same

    date should always produce same output •Future proof: backfilling, versioning •Data can be repaired by rerunning the new code, either by clearing tasks or doing backfills MAD · NOV 23-24 · 2018 FuncIonal data engineering
  31. •Extract: where sensors wait for upstream data sources to land

    •Transform: where we apply business logic and perform acXons such as filtering, grouping, and aggregaXon to translate raw data into analysis- ready datasets •Load: we load the processed data and transport them to a final desXnaXon ETLs MAD · NOV 23-24 · 2018
  32. Airflow at Idealista MAD · NOV 23-24 · 2018 •Deployment

    via Ansible role •AWS Glue & AWS Athena plugins •MicrosoE Teams plugin •AWS S3 to FTP sync operator
  33. Q&A

  34. Useful resources MAD · NOV 23-24 · 2018 •Useful SQL

    queries for Apache Airflow •Astronomer guides •Airflow maintenance dags •A Beginner’s Guide to Data Engineering by Robert Chang •The Rise of the Data Engineer by Maxime Beauchemin
  35. CitaIons & AWribuIons MAD · NOV 23-24 · 2018 •

    Slide 3: image from XKCD (hcps://xkcd.com/2054/) • Slide 4: picture by Andrew Seaman (hcps://unsplash.com/photos/EuDUHo5yyAg) • Slide 8: image from Pinterest (hcps://www.pinterest.com/pin/377739487476469866/) • Slide 9: Apache Airflow logo (hcps://airflow.apache.org/) • Slide 11: logos belong to the respecXve companies • Slide 14: image from Wikipedia (hcps://en.wikipedia.org/wiki/Directed_acyclic_graph) • Slide 31: picture by Garec Mizunaka (hcps://unsplash.com/photos/xFjX9rYILo) • Slide 33: Flask logo (hcp://flask.pocoo.org/), Mesos logo (hcp://mesos.apache.org/), Kubernetes logo (hcps://kubernetes.io/) • Slide 42: picture by Dan Gold (hcps://unsplash.com/photos/5O1ddenSM4g) • Slide 43-45: hcps://www.astronomer.io/guides/dynamically-generaXng-dags/ • Slide 48: Monica RogaX’s “The AI Hierarchy of Needs” hcps://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007 • Slide 49: Robert Chang’s “A Beginner’s Guide to Data Engineering” hcps://medium.com/@rchang/a-beginners-guide-to-data-engineering-part- i-4227c5c457d7 • Slide 50: Maxime Beauchemin’s “The Rise of the Data Engineer” hcps://medium.freecodecamp.org/the-rise-of-the-data-engineer-91be18f1e603 • Slide 51: picture by Jesse Gardner (hcps://unsplash.com/photos/mERlBKFGJiQ) • Slide 54: Vineet Goel’s “Why Robinhood uses Airflow?” (hcps://robinhood.engineering/why-robinhood-uses-airflow-aed13a9a90c8) • Slide 56: Ansible logo (hcps://www.ansible.com/) • Slide 57: Astronomer logo (hcps://www.astronomer.io) • Slide 58: Google Cloud Pla`orm logo and Google Cloud Composer logo (hcps://cloud.google.com/composer/) • Slide 60: picture by Edwin Andrade (hcps://unsplash.com/photos/4V1dC_eoCwg) • Commit logo (hcps://2018.commit-conf.com)