Defining data pipeline workflows using Apache Airflow

Defining data pipeline workflows using Apache Airflow MAD · NOV
23-24 · 2018 Juan Riaza ! [email protected] " @juanriaza

• SoEware Developer • OSS enthusiast • Pythonista & Djangonaut
• Now trying to tame Gophers • Reverse engineering apps • Hobbies: cooking and reading Who am I? MAD · NOV 23-24 · 2018

MAD · NOV 23-24 · 2018

In the beginning there was cron…

MAD · NOV 23-24 · 2018 Gather Data ETL Report
22:00 00:00 02:00 Gather Data Report 22:00 23:30 01:00 ETL ☺

MAD · NOV 23-24 · 2018 Gather Data Report 22:00
23:30 01:00 ETL Report 22:00 23:30 01:00 ETL Gather Data ⁉ ⏱⁉

MAD · NOV 23-24 · 2018 •Managing dependencies. Possible overlapping
•Failure handling. Retries? •Error noXﬁcaXons, metrics. Visibility? •Uniﬁed logs •Distributed cron… •Do you maintain a calendar of batch jobs? •What happens if…? Cron “hustle”

Every day is a new adventure…

MAD · NOV 23-24 · 2018 Meet Airﬂow

•“A pla`orm to programmaXcally author, schedule, and monitor workﬂows” •The
glue that binds your data ecosystem together •Open source, wricen in Python •Was started in Oct 2014 by Max Beauchemin at •IncubaXng at since 2016 •550+ contributors, 5300+ commits, 9300+ stars Overview MAD · NOV 23-24 · 2018

MAD · NOV 23-24 · 2018

Use cases •ETL pipelines •Machine learning pipelines •PredicXve data pipelines:
fraud detecXon, scoring/ ranking, classiﬁcaXon, recommender system, etc. •General job scheduling: DB back-ups •Anything… automate the garage door? MAD · NOV 23-24 · 2018

Airflow uses Operators as the fundamental unit of abstracXon to
define tasks, and uses a DAG to define workflows using a set of operators MAD · NOV 23-24 · 2018

•Directed Acyclic Graph •Represents a workﬂow: set of tasks with
a dependency structure •Each node represents some form of data processing What is a DAG MAD · NOV 23-24 · 2018

What does it look like? MAD · NOV 23-24 ·
2018

err…

How it’s made MAD · NOV 23-24 · 2018

import datetime from airflow import DAG from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator def print_commit(): print(' commit conf') with DAG('commit_dag', schedule_interval='@weekly') as dag: print_bash = BashOperator( task_id='print_bash', bash_command='echo " commit conf"') sleep = BashOperator( task_id='sleep', bash_command='sleep 5') print_python = PythonOperator( task_id='print_python', python_callable=print_commit) print_bash >> sleep >> print_python

Airﬂow UI MAD · NOV 23-24 · 2018

•AcIon: perform an acXon locally or make a call to
an external system to perform another acXon •Transfer: move data from one system to another •Sensor: wait for and detect some condiXon in a source system Operators MAD · NOV 23-24 · 2018

•Perform an acXon such as execuXng a Python funcXon or
submilng a Spark Job •Built-in BashOperator, PythonOperator, DockerOperator, EmailOperator, … •Community contributed Databricks, AWS, GCP AcIon operators MAD · NOV 23-24 · 2018

•Move data between systems such as from Hive to Mysql
or from S3 to Hive •Built-in HiveToMySqlTransfer, S3ToHiveTransfer •Community contributed Databricks, AWS, GCP Transfer operators MAD · NOV 23-24 · 2018

•Triggers downstream tasks in the dependency graph when a certain
criteria is met. For example, checking for a certain ﬁle has become available on S3 before using it downstream •Built-in HiveParXXonSensor, HcpSensor, S3KeySensor, SqlSensor, FTPSensor, … •Community contributed Databricks, AWS, GCP Sensor operators MAD · NOV 23-24 · 2018

Advanced example MAD · NOV 23-24 · 2018 •Waits for
a key (a ﬁle-like instance on S3) to be present in a S3 bucket •Add a new AWS Athena parXXon •Run an AWS Glue job and wait unXl its done •NoXfy the #dataservices channel at MicrosoE Teams S3KeySensor AwsAthenaQueryOperator AwsGlueScheduleJobOperator & AWSGlueJobSensor MSTeamsWebhookOperator

with DAG("commit_dag", default_args=default_args, schedule_interval="@weekly") as dag: s3_key_sensor = S3KeySensor( task_id="s3_key_sensor",
bucket_name="dataservices-ingest", bucket_key="ingest/client_xyz/year={{ execution_date.year }}/" \ "week={{ execution_date.isocalendar()[1] }}/{{ ds_nodash }}.json.gz") aws_athena_update_raw_partition = AwsAthenaQueryOperator( query=""" ALTER TABLE raw_client_xyz ADD IF NOT EXISTS PARTITION (year='{{ execution_date.year }}', week='{{ execution_date.isocalendar()[1] }}');""", task_id="athena_update_raw_partition") aws_glue_job_schedule = AwsGlueScheduleJobOperator( job_name="commit_job", job_args={ "--source_database": "dataservices_staging", "--source_table_name": "raw_client_xyz", "--partition_year": "{{ execution_date.year }}", "--partition_week": "{{ execution_date.isocalendar()[1] }}"}, task_id="aws_glue_job_schedule") aws_glue_job_sensor = AWSGlueJobSensor( job_name="commit_job", job_run_id="{{ task_instance.xcom_pull(task_ids='aws_glue_job_schedule', key='aws_glue_job_run_id') }}", task_id="aws_glue_job_sensor") s3_processed_key_check = S3KeySensor( bucket_name="dataservices-processed", bucket_key="processed/client_xyz/year={{ execution_date.year }}/" \ "week={{ execution_date.isocalendar()[1] }}/{{ ds_nodash }}.json.gz", task_id="s3_processed_key_check") ms_teams_notify = MSTeamsWebhookOperator( task_id="ms_teams_notify", message="Data from client ✨ xyz ✨ has been processed") s3_key_sensor >> aws_athena_update_raw_partition >> aws_glue_job_schedule >> \ aws_glue_job_sensor >> s3_processed_key_check >> ms_teams_notify

What happens if… the ETL script has a bug …and
has been running for weeks?

We can travel in Ime and rerun * only *
the related tasks downstream without side eﬀects. This is called backﬁlling.

How everything ﬁts together

MAD · NOV 23-24 · 2018 Airflow UI (webserver) Metadata
DB Scheduler Worker 1 Worker 2 Worker nth The big picture Airflow CLI Airflow REST API

•Webserver: Airﬂow's UI •Scheduler: responsible for scheduling jobs •Executor: Local,
Celery, , •Metadata Database Core components MAD · NOV 23-24 · 2018

•Airflow webserver UI •Airflow CLI •Airflow Rest API server (experimental)
Interfaces MAD · NOV 23-24 · 2018

Airﬂow CLI $ airflow [-h] <command> MAD · NOV 23-24
· 2018

Airﬂow CLI scheduler -n $NUM Core services webserver -p 80
Meta-DB operaXons initdb resetdb upgradedb Operate on DAGs pause unpause run trigger_dag backfill dag_state task_state clear Develop & test list_dags list_tasks variables render test MAD · NOV 23-24 · 2018

•A quick look into DAG and task progress •Error logging
•Browse metadata: xcom, variables, SLAs •Historical stats Airﬂow webserver UI MAD · NOV 23-24 · 2018

Moving data MAD · NOV 23-24 · 2018 In theory,
all data processing and storage should be done in external systems with Airﬂow only containing workﬂow metadata

•Variables: staXc values, config values, api keys •XComs: short for
"cross communicaXon” CommunicaXon between tasks. Such as file name found by a sensor •ConnecIons: JDBCs, auth, etc. Airflow’s metadata storage MAD · NOV 23-24 · 2018

•Failure handling and monitoring: retry policies, SLAs •Smarter Cron allows
to deﬁne more complex schedules. Even/odd days… •Complex dependencies (trigger rules) •Backﬁlls •Template system: Jinja BaWeries included MAD · NOV 23-24 · 2018

•Idempotency (we’ll talk about this later…) •Logs can be piped
to remote storage (S3, GCS …) •Backoﬀ retries •Stage transformed data Best pracIces MAD · NOV 23-24 · 2018

Recipes

GeneraIng dynamic DAGs MAD · NOV 23-24 · 2018 Airflow
checks every .py file in the DAGs folder… …and registers any available DAG defined as a global variable

for n in range(1, 10): dag_id = 'hello_world_{}'.format(str(n)) default_args =
{'owner': 'airflow', 'start_date': datetime(2018, 1, 1)} schedule = '@daily' dag_number = n globals()[dag_id] = create_dag(dag_id, schedule, dag_number, default_args)

GeneraIng dynamic DAGs MAD · NOV 23-24 · 2018 •A
Variable value •A staIc ﬁle (yaml, json, etc.) •Any external source (based on ConnecIons) It’s possible to create a DAG from:

Data Engineering

MAD · NOV 23-24 · 2018 Does anybody know about
Maslow's hierarchy of needs?

MAD · NOV 23-24 · 2018 Data literacy, collecIon, and
infrastructure

•Design, build, and maintain data warehouses •A data warehouse is
a place where raw data is transformed and stored in query-able forms •Enable higher level analyXcs, be it business intelligence, online experimentaXon, or machine learning MAD · NOV 23-24 · 2018 The role

•SQL mastery: if english is the language of business, SQL
is the language of data •Load data incrementally •Process historic data (backﬁlling) •ParXXon ingested data •Enforce the idempotency constrain MAD · NOV 23-24 · 2018 Key skills

Can't see the forest for the trees

MAD · NOV 23-24 · 2018 Airﬂow empowers data engineers
FuncIonal data engineering

•Reproducible: determinisXc and idempotent •Re-running a task for the same
date should always produce same output •Future proof: backﬁlling, versioning •Data can be repaired by rerunning the new code, either by clearing tasks or doing backﬁlls MAD · NOV 23-24 · 2018 FuncIonal data engineering

MAD · NOV 23-24 · 2018

•Extract: where sensors wait for upstream data sources to land
•Transform: where we apply business logic and perform acXons such as ﬁltering, grouping, and aggregaXon to translate raw data into analysis- ready datasets •Load: we load the processed data and transport them to a ﬁnal desXnaXon ETLs MAD · NOV 23-24 · 2018

Airﬂow deployment MAD · NOV 23-24 · 2018 https://github.com/idealista/airﬂow-role

Airﬂow deployment MAD · NOV 23-24 · 2018

Airﬂow at Idealista MAD · NOV 23-24 · 2018 •Deployment
via Ansible role •AWS Glue & AWS Athena plugins •MicrosoE Teams plugin •AWS S3 to FTP sync operator

Useful resources MAD · NOV 23-24 · 2018 •Useful SQL
queries for Apache Airﬂow •Astronomer guides •Airﬂow maintenance dags •A Beginner’s Guide to Data Engineering by Robert Chang •The Rise of the Data Engineer by Maxime Beauchemin

CitaIons & AWribuIons MAD · NOV 23-24 · 2018 •
Slide 3: image from XKCD (hcps://xkcd.com/2054/) • Slide 4: picture by Andrew Seaman (hcps://unsplash.com/photos/EuDUHo5yyAg) • Slide 8: image from Pinterest (hcps://www.pinterest.com/pin/377739487476469866/) • Slide 9: Apache Airflow logo (hcps://airflow.apache.org/) • Slide 11: logos belong to the respecXve companies • Slide 14: image from Wikipedia (hcps://en.wikipedia.org/wiki/Directed_acyclic_graph) • Slide 31: picture by Garec Mizunaka (hcps://unsplash.com/photos/xFjX9rYILo) • Slide 33: Flask logo (hcp://flask.pocoo.org/), Mesos logo (hcp://mesos.apache.org/), Kubernetes logo (hcps://kubernetes.io/) • Slide 42: picture by Dan Gold (hcps://unsplash.com/photos/5O1ddenSM4g) • Slide 43-45: hcps://www.astronomer.io/guides/dynamically-generaXng-dags/ • Slide 48: Monica RogaX’s “The AI Hierarchy of Needs” hcps://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007 • Slide 49: Robert Chang’s “A Beginner’s Guide to Data Engineering” hcps://medium.com/@rchang/a-beginners-guide-to-data-engineering-part- i-4227c5c457d7 • Slide 50: Maxime Beauchemin’s “The Rise of the Data Engineer” hcps://medium.freecodecamp.org/the-rise-of-the-data-engineer-91be18f1e603 • Slide 51: picture by Jesse Gardner (hcps://unsplash.com/photos/mERlBKFGJiQ) • Slide 54: Vineet Goel’s “Why Robinhood uses Airflow?” (hcps://robinhood.engineering/why-robinhood-uses-airflow-aed13a9a90c8) • Slide 56: Ansible logo (hcps://www.ansible.com/) • Slide 57: Astronomer logo (hcps://www.astronomer.io) • Slide 58: Google Cloud Plaòrm logo and Google Cloud Composer logo (hcps://cloud.google.com/composer/) • Slide 60: picture by Edwin Andrade (hcps://unsplash.com/photos/4V1dC_eoCwg) • Commit logo (hcps://2018.commit-conf.com)

Defining data pipeline workflows using Apache A...

Defining data pipeline workflows using Apache Airflow

More Decks by juanriaza

Other Decks in Technology

Featured

Transcript