Defining data pipeline workflows using Apache Airflow

Slide 1

Slide 1 text

Defining data pipeline workflows using Apache Airflow MAD · NOV 23-24 · 2018 Juan Riaza ! [email protected] " @juanriaza

Slide 2

Slide 2 text

• SoEware Developer • OSS enthusiast • Pythonista & Djangonaut • Now trying to tame Gophers • Reverse engineering apps • Hobbies: cooking and reading Who am I? MAD · NOV 23-24 · 2018

Slide 3

Slide 3 text

MAD · NOV 23-24 · 2018

Slide 4

Slide 4 text

In the beginning there was cron…

Slide 5

Slide 5 text

MAD · NOV 23-24 · 2018 Gather Data ETL Report 22:00 00:00 02:00 Gather Data Report 22:00 23:30 01:00 ETL ☺

Slide 6

Slide 6 text

MAD · NOV 23-24 · 2018 Gather Data Report 22:00 23:30 01:00 ETL Report 22:00 23:30 01:00 ETL Gather Data ⁉ ⏱⁉

Slide 7

Slide 7 text

MAD · NOV 23-24 · 2018 •Managing dependencies. Possible overlapping •Failure handling. Retries? •Error noXﬁcaXons, metrics. Visibility? •Uniﬁed logs •Distributed cron… •Do you maintain a calendar of batch jobs? •What happens if…? Cron “hustle”

Slide 8

Slide 8 text

Every day is a new adventure…

Slide 9

Slide 9 text

MAD · NOV 23-24 · 2018 Meet Airﬂow

Slide 10

Slide 10 text

•“A pla`orm to programmaXcally author, schedule, and monitor workﬂows” •The glue that binds your data ecosystem together •Open source, wricen in Python •Was started in Oct 2014 by Max Beauchemin at •IncubaXng at since 2016 •550+ contributors, 5300+ commits, 9300+ stars Overview MAD · NOV 23-24 · 2018

Slide 11

Slide 11 text

MAD · NOV 23-24 · 2018

Slide 12

Slide 12 text

Use cases •ETL pipelines •Machine learning pipelines •PredicXve data pipelines: fraud detecXon, scoring/ ranking, classiﬁcaXon, recommender system, etc. •General job scheduling: DB back-ups •Anything… automate the garage door? MAD · NOV 23-24 · 2018

Slide 13

Slide 13 text

Airflow uses Operators as the fundamental unit of abstracXon to define tasks, and uses a DAG to define workflows using a set of operators MAD · NOV 23-24 · 2018

Slide 14

Slide 14 text

•Directed Acyclic Graph •Represents a workﬂow: set of tasks with a dependency structure •Each node represents some form of data processing What is a DAG MAD · NOV 23-24 · 2018

Slide 15

Slide 15 text

What does it look like? MAD · NOV 23-24 · 2018

Slide 16

Slide 16 text

err…

Slide 17

Slide 17 text

How it’s made MAD · NOV 23-24 · 2018

Slide 18

Slide 18 text

import datetime from airflow import DAG from airflow.operators.bash_operator import BashOperator from airflow.operators.python_operator import PythonOperator def print_commit(): print(' commit conf') with DAG('commit_dag', schedule_interval='@weekly') as dag: print_bash = BashOperator( task_id='print_bash', bash_command='echo " commit conf"') sleep = BashOperator( task_id='sleep', bash_command='sleep 5') print_python = PythonOperator( task_id='print_python', python_callable=print_commit) print_bash >> sleep >> print_python

Slide 19

Slide 19 text

Airﬂow UI MAD · NOV 23-24 · 2018

Slide 20

Slide 20 text

Airﬂow UI MAD · NOV 23-24 · 2018

Slide 21

Slide 21 text

Airﬂow UI MAD · NOV 23-24 · 2018

Slide 22

Slide 22 text

Airﬂow UI MAD · NOV 23-24 · 2018

Slide 23

Slide 23 text

•AcIon: perform an acXon locally or make a call to an external system to perform another acXon •Transfer: move data from one system to another •Sensor: wait for and detect some condiXon in a source system Operators MAD · NOV 23-24 · 2018

Slide 24

Slide 24 text

•Perform an acXon such as execuXng a Python funcXon or submilng a Spark Job •Built-in BashOperator, PythonOperator, DockerOperator, EmailOperator, … •Community contributed Databricks, AWS, GCP AcIon operators MAD · NOV 23-24 · 2018

Slide 25

Slide 25 text

•Move data between systems such as from Hive to Mysql or from S3 to Hive •Built-in HiveToMySqlTransfer, S3ToHiveTransfer •Community contributed Databricks, AWS, GCP Transfer operators MAD · NOV 23-24 · 2018

Slide 26

Slide 26 text

•Triggers downstream tasks in the dependency graph when a certain criteria is met. For example, checking for a certain ﬁle has become available on S3 before using it downstream •Built-in HiveParXXonSensor, HcpSensor, S3KeySensor, SqlSensor, FTPSensor, … •Community contributed Databricks, AWS, GCP Sensor operators MAD · NOV 23-24 · 2018

Slide 27

Slide 27 text

Advanced example MAD · NOV 23-24 · 2018 •Waits for a key (a ﬁle-like instance on S3) to be present in a S3 bucket •Add a new AWS Athena parXXon •Run an AWS Glue job and wait unXl its done •NoXfy the #dataservices channel at MicrosoE Teams S3KeySensor AwsAthenaQueryOperator AwsGlueScheduleJobOperator & AWSGlueJobSensor MSTeamsWebhookOperator

Slide 28

Slide 28 text

with DAG("commit_dag", default_args=default_args, schedule_interval="@weekly") as dag: s3_key_sensor = S3KeySensor( task_id="s3_key_sensor", bucket_name="dataservices-ingest", bucket_key="ingest/client_xyz/year={{ execution_date.year }}/" \ "week={{ execution_date.isocalendar()[1] }}/{{ ds_nodash }}.json.gz") aws_athena_update_raw_partition = AwsAthenaQueryOperator( query=""" ALTER TABLE raw_client_xyz ADD IF NOT EXISTS PARTITION (year='{{ execution_date.year }}', week='{{ execution_date.isocalendar()[1] }}');""", task_id="athena_update_raw_partition") aws_glue_job_schedule = AwsGlueScheduleJobOperator( job_name="commit_job", job_args={ "--source_database": "dataservices_staging", "--source_table_name": "raw_client_xyz", "--partition_year": "{{ execution_date.year }}", "--partition_week": "{{ execution_date.isocalendar()[1] }}"}, task_id="aws_glue_job_schedule") aws_glue_job_sensor = AWSGlueJobSensor( job_name="commit_job", job_run_id="{{ task_instance.xcom_pull(task_ids='aws_glue_job_schedule', key='aws_glue_job_run_id') }}", task_id="aws_glue_job_sensor") s3_processed_key_check = S3KeySensor( bucket_name="dataservices-processed", bucket_key="processed/client_xyz/year={{ execution_date.year }}/" \ "week={{ execution_date.isocalendar()[1] }}/{{ ds_nodash }}.json.gz", task_id="s3_processed_key_check") ms_teams_notify = MSTeamsWebhookOperator( task_id="ms_teams_notify", message="Data from client ✨ xyz ✨ has been processed") s3_key_sensor >> aws_athena_update_raw_partition >> aws_glue_job_schedule >> \ aws_glue_job_sensor >> s3_processed_key_check >> ms_teams_notify

Slide 29

Slide 29 text

What happens if… the ETL script has a bug …and has been running for weeks?

Slide 30

Slide 30 text

We can travel in Ime and rerun * only * the related tasks downstream without side eﬀects. This is called backﬁlling.

Slide 31

Slide 31 text

How everything ﬁts together

Slide 32

Slide 32 text

MAD · NOV 23-24 · 2018 Airflow UI (webserver) Metadata DB Scheduler Worker 1 Worker 2 Worker nth The big picture Airflow CLI Airflow REST API

Slide 33

Slide 33 text

•Webserver: Airﬂow's UI •Scheduler: responsible for scheduling jobs •Executor: Local, Celery, , •Metadata Database Core components MAD · NOV 23-24 · 2018

Slide 34

Slide 34 text

•Airflow webserver UI •Airflow CLI •Airflow Rest API server (experimental) Interfaces MAD · NOV 23-24 · 2018

Slide 35

Slide 35 text

Airﬂow CLI $ airflow [-h] MAD · NOV 23-24 · 2018

Slide 36

Slide 36 text

Airﬂow CLI scheduler -n $NUM Core services webserver -p 80 Meta-DB operaXons initdb resetdb upgradedb Operate on DAGs pause unpause run trigger_dag backfill dag_state task_state clear Develop & test list_dags list_tasks variables render test MAD · NOV 23-24 · 2018

Slide 37

Slide 37 text

•A quick look into DAG and task progress •Error logging •Browse metadata: xcom, variables, SLAs •Historical stats Airﬂow webserver UI MAD · NOV 23-24 · 2018

Slide 38

Slide 38 text

Moving data MAD · NOV 23-24 · 2018 In theory, all data processing and storage should be done in external systems with Airﬂow only containing workﬂow metadata

Slide 39

Slide 39 text

•Variables: staXc values, config values, api keys •XComs: short for "cross communicaXon” CommunicaXon between tasks. Such as file name found by a sensor •ConnecIons: JDBCs, auth, etc. Airflow’s metadata storage MAD · NOV 23-24 · 2018

Slide 40

Slide 40 text

•Failure handling and monitoring: retry policies, SLAs •Smarter Cron allows to deﬁne more complex schedules. Even/odd days… •Complex dependencies (trigger rules) •Backﬁlls •Template system: Jinja BaWeries included MAD · NOV 23-24 · 2018

Slide 41

Slide 41 text

•Idempotency (we’ll talk about this later…) •Logs can be piped to remote storage (S3, GCS …) •Backoﬀ retries •Stage transformed data Best pracIces MAD · NOV 23-24 · 2018

Slide 42

Slide 42 text

Recipes

Slide 43

Slide 43 text

GeneraIng dynamic DAGs MAD · NOV 23-24 · 2018 Airflow checks every .py file in the DAGs folder… …and registers any available DAG defined as a global variable

Slide 44

Slide 44 text

for n in range(1, 10): dag_id = 'hello_world_{}'.format(str(n)) default_args = {'owner': 'airflow', 'start_date': datetime(2018, 1, 1)} schedule = '@daily' dag_number = n globals()[dag_id] = create_dag(dag_id, schedule, dag_number, default_args)

Slide 45

Slide 45 text

GeneraIng dynamic DAGs MAD · NOV 23-24 · 2018 •A Variable value •A staIc ﬁle (yaml, json, etc.) •Any external source (based on ConnecIons) It’s possible to create a DAG from:

Slide 46

Slide 46 text

Data Engineering

Slide 47

Slide 47 text

MAD · NOV 23-24 · 2018 Does anybody know about Maslow's hierarchy of needs?

Slide 48

Slide 48 text

MAD · NOV 23-24 · 2018 Data literacy, collecIon, and infrastructure

Slide 49

Slide 49 text

•Design, build, and maintain data warehouses •A data warehouse is a place where raw data is transformed and stored in query-able forms •Enable higher level analyXcs, be it business intelligence, online experimentaXon, or machine learning MAD · NOV 23-24 · 2018 The role

Slide 50

Slide 50 text

•SQL mastery: if english is the language of business, SQL is the language of data •Load data incrementally •Process historic data (backﬁlling) •ParXXon ingested data •Enforce the idempotency constrain MAD · NOV 23-24 · 2018 Key skills

Slide 51

Slide 51 text

Can't see the forest for the trees

Slide 52

Slide 52 text

MAD · NOV 23-24 · 2018 Airﬂow empowers data engineers FuncIonal data engineering

Slide 53

Slide 53 text

•Reproducible: determinisXc and idempotent •Re-running a task for the same date should always produce same output •Future proof: backﬁlling, versioning •Data can be repaired by rerunning the new code, either by clearing tasks or doing backﬁlls MAD · NOV 23-24 · 2018 FuncIonal data engineering

Slide 54

Slide 54 text

MAD · NOV 23-24 · 2018

Slide 55

Slide 55 text

•Extract: where sensors wait for upstream data sources to land •Transform: where we apply business logic and perform acXons such as ﬁltering, grouping, and aggregaXon to translate raw data into analysis- ready datasets •Load: we load the processed data and transport them to a ﬁnal desXnaXon ETLs MAD · NOV 23-24 · 2018

Slide 56

Slide 56 text

Airﬂow deployment MAD · NOV 23-24 · 2018 https://github.com/idealista/airﬂow-role

Slide 57

Slide 57 text

Airﬂow deployment MAD · NOV 23-24 · 2018

Slide 58

Slide 58 text

Airﬂow deployment MAD · NOV 23-24 · 2018

Slide 59

Slide 59 text

Airﬂow at Idealista MAD · NOV 23-24 · 2018 •Deployment via Ansible role •AWS Glue & AWS Athena plugins •MicrosoE Teams plugin •AWS S3 to FTP sync operator

Slide 60

Slide 60 text

Q&A

Slide 61

Slide 61 text

Useful resources MAD · NOV 23-24 · 2018 •Useful SQL queries for Apache Airﬂow •Astronomer guides •Airﬂow maintenance dags •A Beginner’s Guide to Data Engineering by Robert Chang •The Rise of the Data Engineer by Maxime Beauchemin

Slide 62

Slide 62 text

CitaIons & AWribuIons MAD · NOV 23-24 · 2018 • Slide 3: image from XKCD (hcps://xkcd.com/2054/) • Slide 4: picture by Andrew Seaman (hcps://unsplash.com/photos/EuDUHo5yyAg) • Slide 8: image from Pinterest (hcps://www.pinterest.com/pin/377739487476469866/) • Slide 9: Apache Airflow logo (hcps://airflow.apache.org/) • Slide 11: logos belong to the respecXve companies • Slide 14: image from Wikipedia (hcps://en.wikipedia.org/wiki/Directed_acyclic_graph) • Slide 31: picture by Garec Mizunaka (hcps://unsplash.com/photos/xFjX9rYILo) • Slide 33: Flask logo (hcp://flask.pocoo.org/), Mesos logo (hcp://mesos.apache.org/), Kubernetes logo (hcps://kubernetes.io/) • Slide 42: picture by Dan Gold (hcps://unsplash.com/photos/5O1ddenSM4g) • Slide 43-45: hcps://www.astronomer.io/guides/dynamically-generaXng-dags/ • Slide 48: Monica RogaX’s “The AI Hierarchy of Needs” hcps://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007 • Slide 49: Robert Chang’s “A Beginner’s Guide to Data Engineering” hcps://medium.com/@rchang/a-beginners-guide-to-data-engineering-part- i-4227c5c457d7 • Slide 50: Maxime Beauchemin’s “The Rise of the Data Engineer” hcps://medium.freecodecamp.org/the-rise-of-the-data-engineer-91be18f1e603 • Slide 51: picture by Jesse Gardner (hcps://unsplash.com/photos/mERlBKFGJiQ) • Slide 54: Vineet Goel’s “Why Robinhood uses Airflow?” (hcps://robinhood.engineering/why-robinhood-uses-airflow-aed13a9a90c8) • Slide 56: Ansible logo (hcps://www.ansible.com/) • Slide 57: Astronomer logo (hcps://www.astronomer.io) • Slide 58: Google Cloud Plaòrm logo and Google Cloud Composer logo (hcps://cloud.google.com/composer/) • Slide 60: picture by Edwin Andrade (hcps://unsplash.com/photos/4V1dC_eoCwg) • Commit logo (hcps://2018.commit-conf.com)