Modularization in ETL

Copyright（C） Nowcast, Inc. All rights reserved. Nowcast Modularization in ETL
2021年01⽉19⽇株式会社 Nowcast Todd Perry

Copyright（C） Nowcast, Inc. All rights reserved. 2 1. Self Introduction
2. What is ETL? 3. Modularization Patterns 4. Case Study 5. Problems 6. Airflow 7. ETL At Nowcast Summary

Copyright（C） Nowcast, Inc. All rights reserved. 3 { ‘Name’︓’Todd Perry’,
‘From’︓’Southampton, UK’, ‘Company’︓’Nowcast’, ‘Role’︓’Data Engineer/Scientist’, ‘Hobbies’︓[‘Cycling’、’Languages’], ‘Favorite Food’︓’Mabo-dofu don’, } Self Introduction

Copyright（C） Nowcast, Inc. All rights reserved. 4 Extract (読み込む) Transform
(変換） Load （書き込む） Large scale aggregation and processing of big datasets is not feasible using environments favored by data scientists, such as Jupyter or Rstudio. It is often the responsibility of Data Engineers to support Data Scientists by building ETL pipelines to process this data. What is ETL?

Copyright（C） Nowcast, Inc. All rights reserved. 5 There are a
few approaches that can be taken to modularize ETL flows. To illustrate these patterns, lets consider a simple example. Modularization Patterns

Copyright（C） Nowcast, Inc. All rights reserved. 6 Splitting Tasks This
has a few merits – such as: • Testability due to Reduced Complexity • Scalability due to Separation of Concerns Modularization Patterns

Copyright（C） Nowcast, Inc. All rights reserved. 7 Intermediate Data (中間データ)
This also has a few advantages: • Historic Reruns become easier • More logs mean the system is easier to debug Modularization Patterns

Copyright（C） Nowcast, Inc. All rights reserved. 8 Debugging Incorrect data
being inserted into the Data warehouse is really the worst problem a data engineer can face. Being able to debug issues in the ETL flow is very important. Modularization Patterns

Copyright（C） Nowcast, Inc. All rights reserved. 9 Debugging The log
files can be binary searched (⼆分探索) – here we look at the middle log, and check for anomalies. Modularization Patterns

Copyright（C） Nowcast, Inc. All rights reserved. 10 Debugging If the
anomaly persists, we check the file before it. In this case the data looks good! So we know the problem exists in task B – we can rerun from this log! Modularization Patterns

Copyright（C） Nowcast, Inc. All rights reserved. 11 This is a
simplified case study of an ETL flow at my previous company. When I joined, one monolithic ETL pipeline handled many stages of data processing in real time. • API (data ingestion) • Preprocessing • Mapping • Updating DB with unseen data • Export to data warehouse Case Study

Copyright（C） Nowcast, Inc. All rights reserved. 12 After Modularization... At
first glance it actually looks far more complicated, as there are more tasks, but it has many advantages Case Study

Copyright（C） Nowcast, Inc. All rights reserved. 13 After Modularization... Move
to batch processing Case Study

Copyright（C） Nowcast, Inc. All rights reserved. 14 After Modularization... Logging
for debuggability/reruns Case Study

Copyright（C） Nowcast, Inc. All rights reserved. 15 After Modularization... Parallelism
built into the workflow Case Study

Copyright（C） Nowcast, Inc. All rights reserved. 16 Controlling the flow:
As the number of tasks increases, we need to control the flow. How does task B know when task A has finished? How does it know what data to process? There are a number of different ways to accomplish this – we will look at one in particular. Problems

Copyright（C） Nowcast, Inc. All rights reserved. 17 Airflow is tool
made by Apache for scheduling and managing workflows. Airflow consists of an environment for running workflows and a python framework for building workflows (called DAGs). Below is a simple example of an Airflow DAG. Airflow

Copyright（C） Nowcast, Inc. All rights reserved. 18 At Nowcast we
take these modular approaches to ETL design. We have many different consumer transaction datasets to process... Airflow manages many ETL pipelines that perform tasks including the preprocessing of our data, and delivery of our data to our clients. Stats: - 12 DAGs (and growing) - Longest Uptime without failure 287 days - Largest DAG: ~130 tasks running daily ETL at Nowcast

Copyright（C） Nowcast, Inc. All rights reserved. 19 One project at
nowcast processes millions of rows of consumer transaction data on a daily basis. The pipeline contains many different steps, including: - UUID tagging - joining transaction to security codes - inserting newly seen data to an internal DB - anonymization - validation of data - loading mapped transaction data into internal Data Warehouse ETL at Nowcast

Copyright（C） Nowcast, Inc. All rights reserved. 20 Below is the
Airflow pipeline used to handle this data Many of these steps can be run in parallel – and we run each tasks on a different ECS/Batch task, taking advantage of AWSʼs serverless environment. It is very easy to add/remove tasks in the airflow code – which is written in python. ETL at Nowcast

We pass in the dependant ʻtriggerʼ task

And return the final task in the group

Pipelines for many versions can be built easily using these builder functions

The ETL pipelines structure becomes obvious just by looking at what tasks are being passed into what functions

If we donʼt do this, we should need to manually manage all of these dependancies

With over 130 tasks, it doesnʼt take much imagination to realize how unreadable this would be

Modularization in ETL

Modularization in ETL

Todd Perry

Featured

Transcript

Copyright（C） Nowcast, Inc. All rights reserved. Nowcast Modularization in ETL

Copyright（C） Nowcast, Inc. All rights reserved. 2 1. Self Introduction

Copyright（C） Nowcast, Inc. All rights reserved. 3 { ‘Name’︓’Todd Perry’,

Copyright（C） Nowcast, Inc. All rights reserved. 4 Extract (読み込む) Transform

Copyright（C） Nowcast, Inc. All rights reserved. 5 There are a

Copyright（C） Nowcast, Inc. All rights reserved. 6 Splitting Tasks This

Copyright（C） Nowcast, Inc. All rights reserved. 7 Intermediate Data (中間データ)

Copyright（C） Nowcast, Inc. All rights reserved. 8 Debugging Incorrect data

Copyright（C） Nowcast, Inc. All rights reserved. 9 Debugging The log

Copyright（C） Nowcast, Inc. All rights reserved. 10 Debugging If the

Copyright（C） Nowcast, Inc. All rights reserved. 11 This is a

Copyright（C） Nowcast, Inc. All rights reserved. 12 After Modularization... At

Copyright（C） Nowcast, Inc. All rights reserved. 13 After Modularization... Move

Copyright（C） Nowcast, Inc. All rights reserved. 14 After Modularization... Logging

Copyright（C） Nowcast, Inc. All rights reserved. 15 After Modularization... Parallelism

Copyright（C） Nowcast, Inc. All rights reserved. 16 Controlling the flow:

Copyright（C） Nowcast, Inc. All rights reserved. 17 Airflow is tool

Copyright（C） Nowcast, Inc. All rights reserved. 18 At Nowcast we

Copyright（C） Nowcast, Inc. All rights reserved. 19 One project at

Copyright（C） Nowcast, Inc. All rights reserved. 20 Below is the

Copyright（C） Nowcast, Inc. All rights reserved. 21 How do you

Copyright（C） Nowcast, Inc. All rights reserved. 22 ETL at Nowcast

Copyright（C） Nowcast, Inc. All rights reserved. 23 ETL at Nowcast

Copyright（C） Nowcast, Inc. All rights reserved. 24 ETL at Nowcast

Copyright（C） Nowcast, Inc. All rights reserved. 25 ETL at Nowcast

Copyright（C） Nowcast, Inc. All rights reserved. 26 ETL at Nowcast

Copyright（C） Nowcast, Inc. All rights reserved. 27 ETL at Nowcast

Copyright（C） Nowcast, Inc. All rights reserved. 28 ETL at Nowcast

Copyright（C） Nowcast, Inc. All rights reserved. 29 ETL at Nowcast

Copyright（C） Nowcast, Inc. All rights reserved. 30 ETL at Nowcast

Copyright（C） Nowcast, Inc. All rights reserved. 31 Interested in building