Feeding data to AWS Redshift with Airflow

Feeding data to AWS Redshift with Airflow

Talk I gave at EuroPython 2017

C6cdaadab563444e30d5489656b3ff60?s=128

Federico Marani

July 13, 2017
Tweet

Transcript

  1. Feeding data to Redshift with Airflow Federico Marani - EuroPython

    2017
  2. Where I come from? But then “data pipelines” arrived... Apache

    Airflow Amazon S3
  3. How do we query and cross-reference multiple external data sources?

  4. +

  5. Data pipelines

  6. Many inputs

  7. Processing in stages

  8. Archive everything

  9. Process A Process B Process C Process A Process B

    Process C time Batch processing Stream processing Data pipeline architectures
  10. What is Airflow - Batch processing framework - 6000 stars

    on Github, 100+ companies using it, big community. - Python all the way down
  11. Hey… why not using cron? - No dependencies between jobs

    - No retry mechanism - No error reporting - Does not scale out easily - No X - No Y - No Z - No A - No B - etc... SIMPLE IS GOOD, BUT THIS IS TOO SIMPLE
  12. How do I start? # airflow needs a home, ~/airflow

    is the default, # but you can lay foundation somewhere else if you prefer # (optional) export AIRFLOW_HOME=~/airflow # install from pypi using pip pip install airflow # initialize the database airflow initdb # start the web server, default port is 8080 airflow webserver -p 8080
  13. None
  14. DAGs, dependencies / Tasks Next steps: Put this in airflow/dags

    folder Profit!
  15. Different type of operators - Python operator - Bash operator

    - Sensors - Docker operator Everything can be templated in Jinja
  16. XCom - By default, tasks are independent - You can

    use XCom to pass dynamic data between tasks - XCom is a key/value store integrated in Airflow
  17. DagRuns and TaskInstances DAG Dependency Task Operator

  18. How are Tasks run? Scale up using prefork model Each

    child handles one task at a time Scale out using CeleryExecutor Uses queues to push tasks to workers
  19. Retries and state machine TaskInstance completed TaskInstance running Despite retrying,

    no luck Previous instance says I am not needed I failed, but I am retrying in a bit Celery is doing its thing I don’t know what I am
  20. How Airflow deals with time - Time is a universal

    variable - {{ execution_date }} available in templates - Default behaviour is to backfill where runs were supposed to run but did not
  21. When date is not a possible filter - LatestRunOnly operator,

    skips all past runs - Very useful for running scrapers
  22. What steps do we normally have? 1. Generate a batch

    id 2. Create initial folder structure 3. Run downloaders/extractors 4. Compress 5. Upload in S3 6. Load in Redshift
  23. IDing all the batch runs - Doing this, the load

    can be re-run - Airflow may re-run steps, so this needs to be taken into account - Idempotent operations batch_id id name surname 1 1 John Smith 1 2 Mark Shuttleworth 1 3 Luke Walker 1 4 Beth Simmons 1 5 Liz Taylor 2 1 John Smith 2 2 Mark Shuttleworth 2 3 Luke Walker 2 4 Beth Simmons 2 5 Liz Taylor 3 1 John Smith 3 2 Mark Shuttleworth 3 3 Luke Walker 3 4 Beth Simmons 3 5 Liz Taylor
  24. Data timestamping - When was it downloaded? - When was

    it written to S3? - When was it ….?
  25. Row-based archive formats for Redshift CSV (+gzip) JsonLines (+gzip) Apache

    Avro - Very simple - Everyone knows what it is - Order dependent - Hard to extend - Flat - No types - Very simple - Everyone knows what it is - Easy to extend - Verbose - Poor types - Not so common - Easy to extend - Space efficient - Smaller ecosystem - Rich types
  26. Redshift by Amazon

  27. A fork of Postgresql for “OLAP” workloads - Made for

    huge volumes of data (compression) - Columnar DB Column-oriented Row-oriented Aggregate on a single column Fast Slow SELECT on many columns table Fast Slow INSERT/UPDATE on single rows Slow Fast SELECT single row Slow Fast
  28. Rows Columns Redshift has a column disk layout vs Postgres

    rows layout
  29. Massively Parallel Processing architecture - Locality on distribution keys -

    Data transfers for unoptimized JOINs are expensive
  30. Implementation of a loading mechanism Json(s) on S3 Decompression Jsonpath

    flattening Table loading
  31. What is a JsonPath? Nested structure Flat structure

  32. Redshift schema management - Schema is another asset to put

    under Git - Typical ORM operations not primary use case of Redshift
  33. Schema migrations with Alembic - Multiple environments - Autogeneration only

    works some time - When altering columns, use temporary table alembic revision alembic upgrade
  34. Annoyances - Varchar length expressed in bytes (not chars) -

    Cannot alter column types - Primary/Foreign keys not enforced - Columns are NULLable by default
  35. Is this “future-proof”? - Redshift Spectrum very interesting - Remove

    the loading step? PrestoDB/AWS Athena - Streaming to S3? Compression? - Citus extensions to Postgresql?
  36. Thanks! Any questions, you can find me at: http://federicomarani.me @flagZ

    “Federico Marani” on Google (looking for work)