Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Feeding data to AWS Redshift with Airflow

Feeding data to AWS Redshift with Airflow

Talk I gave at EuroPython 2017

Federico Marani

July 13, 2017
Tweet

More Decks by Federico Marani

Other Decks in Technology

Transcript

  1. +

  2. Process A Process B Process C Process A Process B

    Process C time Batch processing Stream processing Data pipeline architectures
  3. What is Airflow - Batch processing framework - 6000 stars

    on Github, 100+ companies using it, big community. - Python all the way down
  4. Hey… why not using cron? - No dependencies between jobs

    - No retry mechanism - No error reporting - Does not scale out easily - No X - No Y - No Z - No A - No B - etc... SIMPLE IS GOOD, BUT THIS IS TOO SIMPLE
  5. How do I start? # airflow needs a home, ~/airflow

    is the default, # but you can lay foundation somewhere else if you prefer # (optional) export AIRFLOW_HOME=~/airflow # install from pypi using pip pip install airflow # initialize the database airflow initdb # start the web server, default port is 8080 airflow webserver -p 8080
  6. Different type of operators - Python operator - Bash operator

    - Sensors - Docker operator Everything can be templated in Jinja
  7. XCom - By default, tasks are independent - You can

    use XCom to pass dynamic data between tasks - XCom is a key/value store integrated in Airflow
  8. How are Tasks run? Scale up using prefork model Each

    child handles one task at a time Scale out using CeleryExecutor Uses queues to push tasks to workers
  9. Retries and state machine TaskInstance completed TaskInstance running Despite retrying,

    no luck Previous instance says I am not needed I failed, but I am retrying in a bit Celery is doing its thing I don’t know what I am
  10. How Airflow deals with time - Time is a universal

    variable - {{ execution_date }} available in templates - Default behaviour is to backfill where runs were supposed to run but did not
  11. When date is not a possible filter - LatestRunOnly operator,

    skips all past runs - Very useful for running scrapers
  12. What steps do we normally have? 1. Generate a batch

    id 2. Create initial folder structure 3. Run downloaders/extractors 4. Compress 5. Upload in S3 6. Load in Redshift
  13. IDing all the batch runs - Doing this, the load

    can be re-run - Airflow may re-run steps, so this needs to be taken into account - Idempotent operations batch_id id name surname 1 1 John Smith 1 2 Mark Shuttleworth 1 3 Luke Walker 1 4 Beth Simmons 1 5 Liz Taylor 2 1 John Smith 2 2 Mark Shuttleworth 2 3 Luke Walker 2 4 Beth Simmons 2 5 Liz Taylor 3 1 John Smith 3 2 Mark Shuttleworth 3 3 Luke Walker 3 4 Beth Simmons 3 5 Liz Taylor
  14. Data timestamping - When was it downloaded? - When was

    it written to S3? - When was it ….?
  15. Row-based archive formats for Redshift CSV (+gzip) JsonLines (+gzip) Apache

    Avro - Very simple - Everyone knows what it is - Order dependent - Hard to extend - Flat - No types - Very simple - Everyone knows what it is - Easy to extend - Verbose - Poor types - Not so common - Easy to extend - Space efficient - Smaller ecosystem - Rich types
  16. A fork of Postgresql for “OLAP” workloads - Made for

    huge volumes of data (compression) - Columnar DB Column-oriented Row-oriented Aggregate on a single column Fast Slow SELECT on many columns table Fast Slow INSERT/UPDATE on single rows Slow Fast SELECT single row Slow Fast
  17. Massively Parallel Processing architecture - Locality on distribution keys -

    Data transfers for unoptimized JOINs are expensive
  18. Redshift schema management - Schema is another asset to put

    under Git - Typical ORM operations not primary use case of Redshift
  19. Schema migrations with Alembic - Multiple environments - Autogeneration only

    works some time - When altering columns, use temporary table alembic revision alembic upgrade
  20. Annoyances - Varchar length expressed in bytes (not chars) -

    Cannot alter column types - Primary/Foreign keys not enforced - Columns are NULLable by default
  21. Is this “future-proof”? - Redshift Spectrum very interesting - Remove

    the loading step? PrestoDB/AWS Athena - Streaming to S3? Compression? - Citus extensions to Postgresql?
  22. Thanks! Any questions, you can find me at: http://federicomarani.me @flagZ

    “Federico Marani” on Google (looking for work)