Feeding data to AWS Redshift with Airflow

Feeding data to Redshift with Airflow Federico Marani - EuroPython
2017

Where I come from? But then “data pipelines” arrived... Apache
Airflow Amazon S3

How do we query and cross-reference multiple external data sources?

Data pipelines

Many inputs

Processing in stages

Archive everything

Process A Process B Process C Process A Process B
Process C time Batch processing Stream processing Data pipeline architectures

What is Airflow - Batch processing framework - 6000 stars
on Github, 100+ companies using it, big community. - Python all the way down

Hey… why not using cron? - No dependencies between jobs
- No retry mechanism - No error reporting - Does not scale out easily - No X - No Y - No Z - No A - No B - etc... SIMPLE IS GOOD, BUT THIS IS TOO SIMPLE

How do I start? # airflow needs a home, ~/airflow
is the default, # but you can lay foundation somewhere else if you prefer # (optional) export AIRFLOW_HOME=~/airflow # install from pypi using pip pip install airflow # initialize the database airflow initdb # start the web server, default port is 8080 airflow webserver -p 8080

DAGs, dependencies / Tasks Next steps: Put this in airflow/dags
folder Profit!

Different type of operators - Python operator - Bash operator
- Sensors - Docker operator Everything can be templated in Jinja

XCom - By default, tasks are independent - You can
use XCom to pass dynamic data between tasks - XCom is a key/value store integrated in Airflow

DagRuns and TaskInstances DAG Dependency Task Operator

How are Tasks run? Scale up using prefork model Each
child handles one task at a time Scale out using CeleryExecutor Uses queues to push tasks to workers

Retries and state machine TaskInstance completed TaskInstance running Despite retrying,
no luck Previous instance says I am not needed I failed, but I am retrying in a bit Celery is doing its thing I don’t know what I am

How Airflow deals with time - Time is a universal
variable - {{ execution_date }} available in templates - Default behaviour is to backfill where runs were supposed to run but did not

When date is not a possible filter - LatestRunOnly operator,
skips all past runs - Very useful for running scrapers

What steps do we normally have? 1. Generate a batch
id 2. Create initial folder structure 3. Run downloaders/extractors 4. Compress 5. Upload in S3 6. Load in Redshift

IDing all the batch runs - Doing this, the load
can be re-run - Airflow may re-run steps, so this needs to be taken into account - Idempotent operations batch_id id name surname 1 1 John Smith 1 2 Mark Shuttleworth 1 3 Luke Walker 1 4 Beth Simmons 1 5 Liz Taylor 2 1 John Smith 2 2 Mark Shuttleworth 2 3 Luke Walker 2 4 Beth Simmons 2 5 Liz Taylor 3 1 John Smith 3 2 Mark Shuttleworth 3 3 Luke Walker 3 4 Beth Simmons 3 5 Liz Taylor

Data timestamping - When was it downloaded? - When was
it written to S3? - When was it ….?

Row-based archive formats for Redshift CSV (+gzip) JsonLines (+gzip) Apache
Avro - Very simple - Everyone knows what it is - Order dependent - Hard to extend - Flat - No types - Very simple - Everyone knows what it is - Easy to extend - Verbose - Poor types - Not so common - Easy to extend - Space efficient - Smaller ecosystem - Rich types

Redshift by Amazon

A fork of Postgresql for “OLAP” workloads - Made for
huge volumes of data (compression) - Columnar DB Column-oriented Row-oriented Aggregate on a single column Fast Slow SELECT on many columns table Fast Slow INSERT/UPDATE on single rows Slow Fast SELECT single row Slow Fast

Rows Columns Redshift has a column disk layout vs Postgres
rows layout

Massively Parallel Processing architecture - Locality on distribution keys -
Data transfers for unoptimized JOINs are expensive

Implementation of a loading mechanism Json(s) on S3 Decompression Jsonpath
flattening Table loading

What is a JsonPath? Nested structure Flat structure

Redshift schema management - Schema is another asset to put
under Git - Typical ORM operations not primary use case of Redshift

Schema migrations with Alembic - Multiple environments - Autogeneration only
works some time - When altering columns, use temporary table alembic revision alembic upgrade

Annoyances - Varchar length expressed in bytes (not chars) -
Cannot alter column types - Primary/Foreign keys not enforced - Columns are NULLable by default

Is this “future-proof”? - Redshift Spectrum very interesting - Remove
the loading step? PrestoDB/AWS Athena - Streaming to S3? Compression? - Citus extensions to Postgresql?

Thanks! Any questions, you can find me at: http://federicomarani.me @flagZ
“Federico Marani” on Google (looking for work)

Feeding data to AWS Redshift with Airflow

Feeding data to AWS Redshift with Airflow

More Decks by Federico Marani

Other Decks in Technology

Featured

Transcript