Slide 1

Slide 1 text

Copyright(C) Nowcast, Inc. All rights reserved. Nowcast Modularization in ETL 2021年01⽉19⽇ 株式会社 Nowcast Todd Perry

Slide 2

Slide 2 text

Copyright(C) Nowcast, Inc. All rights reserved. 2 1. Self Introduction 2. What is ETL? 3. Modularization Patterns 4. Case Study 5. Problems 6. Airflow 7. ETL At Nowcast Summary

Slide 3

Slide 3 text

Copyright(C) Nowcast, Inc. All rights reserved. 3 { ‘Name’︓’Todd Perry’, ‘From’︓’Southampton, UK’, ‘Company’︓’Nowcast’, ‘Role’︓’Data Engineer/Scientist’, ‘Hobbies’︓[‘Cycling’、’Languages’], ‘Favorite Food’︓’Mabo-dofu don’, } Self Introduction

Slide 4

Slide 4 text

Copyright(C) Nowcast, Inc. All rights reserved. 4 Extract (読み込む) Transform (変換) Load (書き込む) Large scale aggregation and processing of big datasets is not feasible using environments favored by data scientists, such as Jupyter or Rstudio. It is often the responsibility of Data Engineers to support Data Scientists by building ETL pipelines to process this data. What is ETL?

Slide 5

Slide 5 text

Copyright(C) Nowcast, Inc. All rights reserved. 5 There are a few approaches that can be taken to modularize ETL flows. To illustrate these patterns, lets consider a simple example. Modularization Patterns

Slide 6

Slide 6 text

Copyright(C) Nowcast, Inc. All rights reserved. 6 Splitting Tasks This has a few merits – such as: • Testability due to Reduced Complexity • Scalability due to Separation of Concerns Modularization Patterns

Slide 7

Slide 7 text

Copyright(C) Nowcast, Inc. All rights reserved. 7 Intermediate Data (中間データ) This also has a few advantages: • Historic Reruns become easier • More logs mean the system is easier to debug Modularization Patterns

Slide 8

Slide 8 text

Copyright(C) Nowcast, Inc. All rights reserved. 8 Debugging Incorrect data being inserted into the Data warehouse is really the worst problem a data engineer can face. Being able to debug issues in the ETL flow is very important. Modularization Patterns

Slide 9

Slide 9 text

Copyright(C) Nowcast, Inc. All rights reserved. 9 Debugging The log files can be binary searched (⼆分探索) – here we look at the middle log, and check for anomalies. Modularization Patterns

Slide 10

Slide 10 text

Copyright(C) Nowcast, Inc. All rights reserved. 10 Debugging If the anomaly persists, we check the file before it. In this case the data looks good! So we know the problem exists in task B – we can rerun from this log! Modularization Patterns

Slide 11

Slide 11 text

Copyright(C) Nowcast, Inc. All rights reserved. 11 This is a simplified case study of an ETL flow at my previous company. When I joined, one monolithic ETL pipeline handled many stages of data processing in real time. • API (data ingestion) • Preprocessing • Mapping • Updating DB with unseen data • Export to data warehouse Case Study

Slide 12

Slide 12 text

Copyright(C) Nowcast, Inc. All rights reserved. 12 After Modularization... At first glance it actually looks far more complicated, as there are more tasks, but it has many advantages Case Study

Slide 13

Slide 13 text

Copyright(C) Nowcast, Inc. All rights reserved. 13 After Modularization... Move to batch processing Case Study

Slide 14

Slide 14 text

Copyright(C) Nowcast, Inc. All rights reserved. 14 After Modularization... Logging for debuggability/reruns Case Study

Slide 15

Slide 15 text

Copyright(C) Nowcast, Inc. All rights reserved. 15 After Modularization... Parallelism built into the workflow Case Study

Slide 16

Slide 16 text

Copyright(C) Nowcast, Inc. All rights reserved. 16 Controlling the flow: As the number of tasks increases, we need to control the flow. How does task B know when task A has finished? How does it know what data to process? There are a number of different ways to accomplish this – we will look at one in particular. Problems

Slide 17

Slide 17 text

Copyright(C) Nowcast, Inc. All rights reserved. 17 Airflow is tool made by Apache for scheduling and managing workflows. Airflow consists of an environment for running workflows and a python framework for building workflows (called DAGs). Below is a simple example of an Airflow DAG. Airflow

Slide 18

Slide 18 text

Copyright(C) Nowcast, Inc. All rights reserved. 18 At Nowcast we take these modular approaches to ETL design. We have many different consumer transaction datasets to process... Airflow manages many ETL pipelines that perform tasks including the preprocessing of our data, and delivery of our data to our clients. Stats: - 12 DAGs (and growing) - Longest Uptime without failure 287 days - Largest DAG: ~130 tasks running daily ETL at Nowcast

Slide 19

Slide 19 text

Copyright(C) Nowcast, Inc. All rights reserved. 19 One project at nowcast processes millions of rows of consumer transaction data on a daily basis. The pipeline contains many different steps, including: - UUID tagging - joining transaction to security codes - inserting newly seen data to an internal DB - anonymization - validation of data - loading mapped transaction data into internal Data Warehouse ETL at Nowcast

Slide 20

Slide 20 text

Copyright(C) Nowcast, Inc. All rights reserved. 20 Below is the Airflow pipeline used to handle this data Many of these steps can be run in parallel – and we run each tasks on a different ECS/Batch task, taking advantage of AWSʼs serverless environment. It is very easy to add/remove tasks in the airflow code – which is written in python. ETL at Nowcast

Slide 21

Slide 21 text

Copyright(C) Nowcast, Inc. All rights reserved. 21 How do you manage so many tasks? How can we manage the dependencies? Letʼs look at the code... ETL at Nowcast

Slide 22

Slide 22 text

Copyright(C) Nowcast, Inc. All rights reserved. 22 ETL at Nowcast Everything is organized in builder functions, here 2 tasks are defined

Slide 23

Slide 23 text

Copyright(C) Nowcast, Inc. All rights reserved. 23 ETL at Nowcast We pass in the dependant ʻtriggerʼ task

Slide 24

Slide 24 text

Copyright(C) Nowcast, Inc. All rights reserved. 24 ETL at Nowcast And return the final task in the group

Slide 25

Slide 25 text

Copyright(C) Nowcast, Inc. All rights reserved. 25 ETL at Nowcast

Slide 26

Slide 26 text

Copyright(C) Nowcast, Inc. All rights reserved. 26 ETL at Nowcast Pipelines for many versions can be built easily using these builder functions

Slide 27

Slide 27 text

Copyright(C) Nowcast, Inc. All rights reserved. 27 ETL at Nowcast The ETL pipelines structure becomes obvious just by looking at what tasks are being passed into what functions

Slide 28

Slide 28 text

Copyright(C) Nowcast, Inc. All rights reserved. 28 ETL at Nowcast If we donʼt do this, we should need to manually manage all of these dependancies

Slide 29

Slide 29 text

Copyright(C) Nowcast, Inc. All rights reserved. 29 ETL at Nowcast With over 130 tasks, it doesnʼt take much imagination to realize how unreadable this would be

Slide 30

Slide 30 text

Copyright(C) Nowcast, Inc. All rights reserved. 30 ETL at Nowcast

Slide 31

Slide 31 text

Copyright(C) Nowcast, Inc. All rights reserved. 31 Interested in building ETL pipelines to process financial data? We are hiring. Thanks for Listening

Slide 32

Slide 32 text

No content