Slide 1

Slide 1 text

Feeding data to Redshift with Airflow Federico Marani - EuroPython 2017

Slide 2

Slide 2 text

Where I come from? But then “data pipelines” arrived... Apache Airflow Amazon S3

Slide 3

Slide 3 text

How do we query and cross-reference multiple external data sources?

Slide 4

Slide 4 text

+

Slide 5

Slide 5 text

Data pipelines

Slide 6

Slide 6 text

Many inputs

Slide 7

Slide 7 text

Processing in stages

Slide 8

Slide 8 text

Archive everything

Slide 9

Slide 9 text

Process A Process B Process C Process A Process B Process C time Batch processing Stream processing Data pipeline architectures

Slide 10

Slide 10 text

What is Airflow - Batch processing framework - 6000 stars on Github, 100+ companies using it, big community. - Python all the way down

Slide 11

Slide 11 text

Hey… why not using cron? - No dependencies between jobs - No retry mechanism - No error reporting - Does not scale out easily - No X - No Y - No Z - No A - No B - etc... SIMPLE IS GOOD, BUT THIS IS TOO SIMPLE

Slide 12

Slide 12 text

How do I start? # airflow needs a home, ~/airflow is the default, # but you can lay foundation somewhere else if you prefer # (optional) export AIRFLOW_HOME=~/airflow # install from pypi using pip pip install airflow # initialize the database airflow initdb # start the web server, default port is 8080 airflow webserver -p 8080

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

DAGs, dependencies / Tasks Next steps: Put this in airflow/dags folder Profit!

Slide 15

Slide 15 text

Different type of operators - Python operator - Bash operator - Sensors - Docker operator Everything can be templated in Jinja

Slide 16

Slide 16 text

XCom - By default, tasks are independent - You can use XCom to pass dynamic data between tasks - XCom is a key/value store integrated in Airflow

Slide 17

Slide 17 text

DagRuns and TaskInstances DAG Dependency Task Operator

Slide 18

Slide 18 text

How are Tasks run? Scale up using prefork model Each child handles one task at a time Scale out using CeleryExecutor Uses queues to push tasks to workers

Slide 19

Slide 19 text

Retries and state machine TaskInstance completed TaskInstance running Despite retrying, no luck Previous instance says I am not needed I failed, but I am retrying in a bit Celery is doing its thing I don’t know what I am

Slide 20

Slide 20 text

How Airflow deals with time - Time is a universal variable - {{ execution_date }} available in templates - Default behaviour is to backfill where runs were supposed to run but did not

Slide 21

Slide 21 text

When date is not a possible filter - LatestRunOnly operator, skips all past runs - Very useful for running scrapers

Slide 22

Slide 22 text

What steps do we normally have? 1. Generate a batch id 2. Create initial folder structure 3. Run downloaders/extractors 4. Compress 5. Upload in S3 6. Load in Redshift

Slide 23

Slide 23 text

IDing all the batch runs - Doing this, the load can be re-run - Airflow may re-run steps, so this needs to be taken into account - Idempotent operations batch_id id name surname 1 1 John Smith 1 2 Mark Shuttleworth 1 3 Luke Walker 1 4 Beth Simmons 1 5 Liz Taylor 2 1 John Smith 2 2 Mark Shuttleworth 2 3 Luke Walker 2 4 Beth Simmons 2 5 Liz Taylor 3 1 John Smith 3 2 Mark Shuttleworth 3 3 Luke Walker 3 4 Beth Simmons 3 5 Liz Taylor

Slide 24

Slide 24 text

Data timestamping - When was it downloaded? - When was it written to S3? - When was it ….?

Slide 25

Slide 25 text

Row-based archive formats for Redshift CSV (+gzip) JsonLines (+gzip) Apache Avro - Very simple - Everyone knows what it is - Order dependent - Hard to extend - Flat - No types - Very simple - Everyone knows what it is - Easy to extend - Verbose - Poor types - Not so common - Easy to extend - Space efficient - Smaller ecosystem - Rich types

Slide 26

Slide 26 text

Redshift by Amazon

Slide 27

Slide 27 text

A fork of Postgresql for “OLAP” workloads - Made for huge volumes of data (compression) - Columnar DB Column-oriented Row-oriented Aggregate on a single column Fast Slow SELECT on many columns table Fast Slow INSERT/UPDATE on single rows Slow Fast SELECT single row Slow Fast

Slide 28

Slide 28 text

Rows Columns Redshift has a column disk layout vs Postgres rows layout

Slide 29

Slide 29 text

Massively Parallel Processing architecture - Locality on distribution keys - Data transfers for unoptimized JOINs are expensive

Slide 30

Slide 30 text

Implementation of a loading mechanism Json(s) on S3 Decompression Jsonpath flattening Table loading

Slide 31

Slide 31 text

What is a JsonPath? Nested structure Flat structure

Slide 32

Slide 32 text

Redshift schema management - Schema is another asset to put under Git - Typical ORM operations not primary use case of Redshift

Slide 33

Slide 33 text

Schema migrations with Alembic - Multiple environments - Autogeneration only works some time - When altering columns, use temporary table alembic revision alembic upgrade

Slide 34

Slide 34 text

Annoyances - Varchar length expressed in bytes (not chars) - Cannot alter column types - Primary/Foreign keys not enforced - Columns are NULLable by default

Slide 35

Slide 35 text

Is this “future-proof”? - Redshift Spectrum very interesting - Remove the loading step? PrestoDB/AWS Athena - Streaming to S3? Compression? - Citus extensions to Postgresql?

Slide 36

Slide 36 text

Thanks! Any questions, you can find me at: http://federicomarani.me @flagZ “Federico Marani” on Google (looking for work)