Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modularization in ETL

Todd Perry
February 21, 2021
670

Modularization in ETL

Some tips on how to modularize ETL flows, specifically focusing on Airflow

Todd Perry

February 21, 2021
Tweet

Transcript

  1. Copyright(C) Nowcast, Inc. All rights reserved. 2 1. Self Introduction

    2. What is ETL? 3. Modularization Patterns 4. Case Study 5. Problems 6. Airflow 7. ETL At Nowcast Summary
  2. Copyright(C) Nowcast, Inc. All rights reserved. 3 { ‘Name’︓’Todd Perry’,

    ‘From’︓’Southampton, UK’, ‘Company’︓’Nowcast’, ‘Role’︓’Data Engineer/Scientist’, ‘Hobbies’︓[‘Cycling’、’Languages’], ‘Favorite Food’︓’Mabo-dofu don’, } Self Introduction
  3. Copyright(C) Nowcast, Inc. All rights reserved. 4 Extract (読み込む) Transform

    (変換) Load (書き込む) Large scale aggregation and processing of big datasets is not feasible using environments favored by data scientists, such as Jupyter or Rstudio. It is often the responsibility of Data Engineers to support Data Scientists by building ETL pipelines to process this data. What is ETL?
  4. Copyright(C) Nowcast, Inc. All rights reserved. 5 There are a

    few approaches that can be taken to modularize ETL flows. To illustrate these patterns, lets consider a simple example. Modularization Patterns
  5. Copyright(C) Nowcast, Inc. All rights reserved. 6 Splitting Tasks This

    has a few merits – such as: • Testability due to Reduced Complexity • Scalability due to Separation of Concerns Modularization Patterns
  6. Copyright(C) Nowcast, Inc. All rights reserved. 7 Intermediate Data (中間データ)

    This also has a few advantages: • Historic Reruns become easier • More logs mean the system is easier to debug Modularization Patterns
  7. Copyright(C) Nowcast, Inc. All rights reserved. 8 Debugging Incorrect data

    being inserted into the Data warehouse is really the worst problem a data engineer can face. Being able to debug issues in the ETL flow is very important. Modularization Patterns
  8. Copyright(C) Nowcast, Inc. All rights reserved. 9 Debugging The log

    files can be binary searched (⼆分探索) – here we look at the middle log, and check for anomalies. Modularization Patterns
  9. Copyright(C) Nowcast, Inc. All rights reserved. 10 Debugging If the

    anomaly persists, we check the file before it. In this case the data looks good! So we know the problem exists in task B – we can rerun from this log! Modularization Patterns
  10. Copyright(C) Nowcast, Inc. All rights reserved. 11 This is a

    simplified case study of an ETL flow at my previous company. When I joined, one monolithic ETL pipeline handled many stages of data processing in real time. • API (data ingestion) • Preprocessing • Mapping • Updating DB with unseen data • Export to data warehouse Case Study
  11. Copyright(C) Nowcast, Inc. All rights reserved. 12 After Modularization... At

    first glance it actually looks far more complicated, as there are more tasks, but it has many advantages Case Study
  12. Copyright(C) Nowcast, Inc. All rights reserved. 16 Controlling the flow:

    As the number of tasks increases, we need to control the flow. How does task B know when task A has finished? How does it know what data to process? There are a number of different ways to accomplish this – we will look at one in particular. Problems
  13. Copyright(C) Nowcast, Inc. All rights reserved. 17 Airflow is tool

    made by Apache for scheduling and managing workflows. Airflow consists of an environment for running workflows and a python framework for building workflows (called DAGs). Below is a simple example of an Airflow DAG. Airflow
  14. Copyright(C) Nowcast, Inc. All rights reserved. 18 At Nowcast we

    take these modular approaches to ETL design. We have many different consumer transaction datasets to process... Airflow manages many ETL pipelines that perform tasks including the preprocessing of our data, and delivery of our data to our clients. Stats: - 12 DAGs (and growing) - Longest Uptime without failure 287 days - Largest DAG: ~130 tasks running daily ETL at Nowcast
  15. Copyright(C) Nowcast, Inc. All rights reserved. 19 One project at

    nowcast processes millions of rows of consumer transaction data on a daily basis. The pipeline contains many different steps, including: - UUID tagging - joining transaction to security codes - inserting newly seen data to an internal DB - anonymization - validation of data - loading mapped transaction data into internal Data Warehouse ETL at Nowcast
  16. Copyright(C) Nowcast, Inc. All rights reserved. 20 Below is the

    Airflow pipeline used to handle this data Many of these steps can be run in parallel – and we run each tasks on a different ECS/Batch task, taking advantage of AWSʼs serverless environment. It is very easy to add/remove tasks in the airflow code – which is written in python. ETL at Nowcast
  17. Copyright(C) Nowcast, Inc. All rights reserved. 21 How do you

    manage so many tasks? How can we manage the dependencies? Letʼs look at the code... ETL at Nowcast
  18. Copyright(C) Nowcast, Inc. All rights reserved. 22 ETL at Nowcast

    Everything is organized in builder functions, here 2 tasks are defined
  19. Copyright(C) Nowcast, Inc. All rights reserved. 26 ETL at Nowcast

    Pipelines for many versions can be built easily using these builder functions
  20. Copyright(C) Nowcast, Inc. All rights reserved. 27 ETL at Nowcast

    The ETL pipelines structure becomes obvious just by looking at what tasks are being passed into what functions
  21. Copyright(C) Nowcast, Inc. All rights reserved. 28 ETL at Nowcast

    If we donʼt do this, we should need to manually manage all of these dependancies
  22. Copyright(C) Nowcast, Inc. All rights reserved. 29 ETL at Nowcast

    With over 130 tasks, it doesnʼt take much imagination to realize how unreadable this would be
  23. Copyright(C) Nowcast, Inc. All rights reserved. 31 Interested in building

    ETL pipelines to process financial data? We are hiring. Thanks for Listening