Building Robust ETL with Talend Open Studio and Pentaho Data Integration

Slide 1

Slide 1 text

IYKRA Data Fellowship Program Batch 6 Building Robust ETL with Talend Open Studio and Pentaho Data Integration by Aditya Satrya 2 Agustus 2021 Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 2

Slide 2 text

Trainer Proﬁle Aditya Satrya Experiences: • Data Engineering Manager, Accenture AI (May 2021 – present) • Data Engineering Manager, Mekari (Jan 2020 – May 2021) • Head of Engineering, Jabar Digital Service (Oct 2018 – Des 2019) • CTO/Co-founder, Qiwii–Mobile Queueing App (Nov 2017 – Des 2018) • CTO/Co-founder, Nuesto Technology (2014 – 2018) Connect: • Linkedin: https://www.linkedin.com/in/asatrya/ • Telegram: @asatrya Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 3

Slide 3 text

Table of Content ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 4

Slide 4 text

ETL Introduction Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration

Slide 5

Slide 5 text

ETL INTRODUCTION OLTP vs OLAP Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 Image source:https://www.marklogic.com/blog/relational-databases-and-mixed-workloads/

Slide 6

Slide 6 text

ETL INTRODUCTION What is ETL? Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 Source: https://www.matillion.com/what-is-etl-the-ultimate-guide

Slide 7

Slide 7 text

ETL INTRODUCTION Why do we need ETL? Why not directly query the database? Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ● Slow down the source system ● Transactional system is not optimized for reporting and analysis

Slide 8

Slide 8 text

ETL - EXTRACT Source systems are usually heterogeneous Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ● Transactional applications ○ ERP, operational applications, marketing tools, etc ● APIs ● Sensor data ● SaaS ● Databases ● Files ● others

Slide 9

Slide 9 text

ETL - EXTRACT Know and understand your data source Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ● Data type ● Update frequency ● Data quality (completeness, validity, accuracy, etc) ● Data owner ● Policy to access the data ● Data volume ● Table indexing ● Availability of timestamp field ● Schema changes frequency

Slide 10

Slide 10 text

ETL - EXTRACT Data extraction methods Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ● Delta/incremental ○ SELECT query (require timestamp field) ○ CDC / change logs ● Full/snapshot ○ Source systems cannot identify which data has been changed

Slide 11

Slide 11 text

ETL - TRANSFORM Typical transformation job Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ● Column mapping ● Assigning data type ● Select attributes ● Translation/standardization of value ● Data validation ● Splitting columns ● Join/lookup from multiple source ● Calculation using existing fields ● etc

Slide 12

Slide 12 text

ETL - TRANSFORM Typical problems related to data quality Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ● Potential Problems with Data from a Single Source ○ Uniqueness ○ Redundancy/Duplicates ○ Outside domain range ○ Data entry errors ○ Referential integrity ● Potential Problems with Data from Multiple Sources ○ Naming conflicts at the schema level — using the same name for different things or using a different name for the same things ○ Naming conflicts at the data level ○ Inconsistent aggregating ○ Inconsistent timing

Slide 13

Slide 13 text

● ETL job/code maintainability ○ Change versioning ○ Modularity ○ Knowledge transfer ○ Changes in business requirement/logic ● ETL job/code resiliency to failure ○ Rapid increase in data volume ○ Circuit break when failure ○ Backfilling ○ Data loss ETL - TRANSFORM Typical problems related to data processing Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 14

Slide 14 text

ETL - LOAD Typical target systems Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ● Data lake ○ Common underlying storage: HDFS, Object storage ● Data warehouse ○ Traditional data warehouse ○ Cloud data warehouse (Massive Parallel Processing): BigQuery, Snowflake, Redshift

Slide 15

Slide 15 text

Robust ETL Principles Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration

Slide 16

Slide 16 text

ETL - PRINCIPLES Robust ETL Principles Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 #1 Extract and store data in raw format #2 Create idempotent and deterministic processes #3 Rest data between tasks #4 Validate data after each step

Slide 17

Slide 17 text

ETL - PRINCIPLES 1. Extract and store data in raw format Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 Why? ● Prevent losing data because of transformation failure ● Enable to reprocess data when there’s changes in business rules How? ● Don’t do any transformation before storing it for the first time ● Use format that have minimum metadata definition (i.e: CSV, JSON) ● Don’t use format like Parquet for ingestion

Slide 18

Slide 18 text

ETL - PRINCIPLES 2. Create idempotent & deterministic processes Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 Definition? ● Idempotent: Run multiple times without changing the result ● Deterministic: The results of a transformations only depends on determined input parameters ○ i.e: the result should NOT depend on the timing of it runs Why? ● Reproducible ○ When something breaks, just fix the underlying issue, then reruns the process without worrying about data consistency ○ When schema or business logic changes, just rerun the process ● Easier to debug ● Enable to create dev/staging environment that mirrors production

Slide 19

Slide 19 text

ETL - PRINCIPLES 2. Create idempotent & deterministic processes (cont’d) Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 How? ● Don’t append, overwrite the partition instead ● Don’t alter data, write new one instead ● Don’t produce side effect, treat your process as a function ● Define all factors that influence the result as input parameters ○ i.e.: Don't use date.today(), use parameter instead

Slide 20

Slide 20 text

ETL - PRINCIPLES 3. Rest data between tasks Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 Why? ● Enable data pipeline/ETL to run in cluster, means this is a good foundation to scaling How? ● Write result of a task to the storage ● Next task will read from the storage

Slide 21

Slide 21 text

ETL - PRINCIPLES 3. Rest data between tasks (cont’d) Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 22

Slide 22 text

ETL - PRINCIPLES 4. Validate data after each step Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 Why? ● Prevent silent data error ● Never publish wrong data

Slide 23

Slide 23 text

Lab: Talend Open Studio Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration

Slide 24

Slide 24 text

TALEND OS - OVERVIEW Talend Components Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 25

Slide 25 text

TALEND OS - OVERVIEW Talend for Data Management Reference Architecture Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 26

Slide 26 text

TALEND OS - LAB PREPARATION Required Tools and Environment Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ● Java 11 ● Talend Open Studio 7.3.1 ● MySQL Database ○ Installed locally ○ Database: northwind, dwh ○ Schema & data: https://github.com/jpwhite3/northwind-MySQL ● S3 ○ Bucket: data-fellowship-6 ○ AWS Key: ○ AWS Secret:

Slide 27

Slide 27 text

TALEND OS - LAB #1 - HELLO WORLD! Create a “Hello World!” dialog Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 28

Slide 28 text

TALEND OS - LAB #2 - EXTRACTION Data extraction for delimited ﬁles (CSV) from S3 + Metadata Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 29

Slide 29 text

TALEND OS - LAB #3 - EXTRACTION Data extraction for MySQL and store in S3 as CSV Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 30

Slide 30 text

TALEND OS - LAB #4 - ETL CASE STUDY 4.1 Source Tables Schema Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 31

Slide 31 text

TALEND OS - LAB #4 - ETL CASE STUDY 4.2 Business Questions Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 When do we have the most and least closed order last month? order_date count(order.id) status_name == closed order_date How many unique products are sold per day? count() count(product_id) status_name == closed order_date status_name == closed How much is our omzet per day? sum() quantity unit_price order_date

Slide 32

Slide 32 text

TALEND OS - LAB #4 - ETL CASE STUDY 4.3 Data Warehouse Table Schema (Dimensional Model) Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 33

Slide 33 text

TALEND OS - LAB #4 - ETL CASE STUDY 4.4 ETL Job + Context Variable + Debugging Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 34

Slide 34 text

TALEND OS - LAB #5 - BUILD & AUTOMATE JOB Build job + Run in command line + Override context vars Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ● Case: ○ We will use external orchestrator (ie: Airflow) to run Talend jobs ○ The job will be executed every month, immediately after a month period ends ● Expected to run in command line: ./etl_mysql_2_run.sh \ --context_param start_period={{execution_date}} \ --context_param end_period=2006-02-31

Slide 35

Slide 35 text

TALEND OS - LAB #6 - ELT CASE STUDY 6.1 Extract & load jobs Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 36

Slide 36 text

TALEND OS - LAB #6 - ELT CASE STUDY 6.2 Transform Job Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 37

Slide 37 text

Lab: Pentaho Data Integration Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration

Slide 38

Slide 38 text

PENTAHO DI - LAB PREPARATION Required Tools and Environment Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ● Java 8 ● Pentaho Data Integration 9.1 ● MySQL Database ○ Installed locally ○ Database: northwind, dwh ○ Schema & data: https://github.com/jpwhite3/northwind-MySQL ● S3 ○ Bucket: data-fellowship-6 ○ AWS Key: ○ AWS Secret:

Slide 39

Slide 39 text

PENTAHO DI - LAB #1 - HELLO WORLD! Write PDI version and “Hello World” to log Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 40

Slide 40 text

PENTAHO DI - LAB #2 - EXTRACTION Data extraction for delimited ﬁles (CSV) from S3 Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 41

Slide 41 text

PENTAHO DI - LAB #3 - EXTRACTION Data extraction from MySQL and store in S3 as CSV Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 42

Slide 42 text

PENTAHO DI - LAB #4 - ETL CASE STUDY 4.1 Source Tables Schema Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 43

Slide 43 text

PENTAHO DI - LAB #4 - ETL CASE STUDY 4.2 Business Questions Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 When do we have the most and least closed order last month? order_date count(order.id) status_name == closed order_date How many unique products are sold per day? order_date count(product_id) status_name == closed order_date status_name == closed How much is our omzet per day? sum() quantity unit_price order_date

Slide 44

Slide 44 text

PENTAHO DI - LAB #4 - ETL CASE STUDY 4.3 Data Warehouse Table Schema (Dimensional Model) Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 45

Slide 45 text

PENTAHO DI - LAB #4 - ETL CASE STUDY 4.4 ETL Transformation + Parameters + Debugging Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 46

Slide 46 text

PENTAHO DI - LAB #5 - BUILD & AUTOMATE JOB Build job + Run in command line + Override parameters Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ● Case: ○ We will use external orchestrator (ie: Airflow) to run Pentaho jobs ○ The job will be executed every month, immediately after a month period ends ● Expected to run in command line: ./pan.sh \ -file /lab_04_etl.ktr \ -param:period_start=2006-02-01 \ -param:period_end=2006-02-28

Slide 47

Slide 47 text

PENTAHO DI - LAB #6 - ELT CASE STUDY ELT Job Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 48

Slide 48 text

Practice Case Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Slide 49

Slide 49 text

PRACTICE CASE ETL Best Practices Implementation Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 ● Objective ○ Have a robust understanding on implementing good ETL principles ○ Able to implement ETL using a visual tool ● Instructions & Scoring ○ Implement ETL jobs/transformation in Lab #4 using Talend or Pentaho [score: 40] ○ Add several improvements: ■ Extract and store data in raw format (CSV) to S3 [score: 20] ■ Implement data quality check [score: 20] ■ Keep the ETL process idempotent [score: 10] ■ Keep the monthly batch period [score: 10] ○ You may make assumptions, but you need to state them

Slide 50

Slide 50 text

Thank you! IYKRA Data Fellowship Program Batch 6