Building Robust ETL with Talend Open Studio and Pentaho Data Integration

IYKRA Data Fellowship Program Batch 6 Building Robust ETL with
Talend Open Studio and Pentaho Data Integration by Aditya Satrya 2 Agustus 2021 Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Trainer Proﬁle Aditya Satrya Experiences: • Data Engineering Manager, Accenture
AI (May 2021 – present) • Data Engineering Manager, Mekari (Jan 2020 – May 2021) • Head of Engineering, Jabar Digital Service (Oct 2018 – Des 2019) • CTO/Co-founder, Qiwii–Mobile Queueing App (Nov 2017 – Des 2018) • CTO/Co-founder, Nuesto Technology (2014 – 2018) Connect: • Linkedin: https://www.linkedin.com/in/asatrya/ • Telegram: @asatrya Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Table of Content ETL Introduction Robust ETL Principles Lab: Talend
Open Studio Lab: Pentaho Data Integration Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

ETL Introduction Proprietary of IYKRA - Data Fellowship Batch 6
- 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration

ETL INTRODUCTION OLTP vs OLAP Proprietary of IYKRA - Data
Fellowship Batch 6 - 2021 Image source:https://www.marklogic.com/blog/relational-databases-and-mixed-workloads/

ETL INTRODUCTION What is ETL? Proprietary of IYKRA - Data
Fellowship Batch 6 - 2021 Source: https://www.matillion.com/what-is-etl-the-ultimate-guide

ETL INTRODUCTION Why do we need ETL? Why not directly
query the database? Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 • Slow down the source system • Transactional system is not optimized for reporting and analysis

ETL - EXTRACT Source systems are usually heterogeneous Proprietary of
IYKRA - Data Fellowship Batch 6 - 2021 • Transactional applications ◦ ERP, operational applications, marketing tools, etc • APIs • Sensor data • SaaS • Databases • Files • others

ETL - EXTRACT Know and understand your data source Proprietary
of IYKRA - Data Fellowship Batch 6 - 2021 • Data type • Update frequency • Data quality (completeness, validity, accuracy, etc) • Data owner • Policy to access the data • Data volume • Table indexing • Availability of timestamp field • Schema changes frequency

ETL - EXTRACT Data extraction methods Proprietary of IYKRA -
Data Fellowship Batch 6 - 2021 • Delta/incremental ◦ SELECT query (require timestamp field) ◦ CDC / change logs • Full/snapshot ◦ Source systems cannot identify which data has been changed

ETL - TRANSFORM Typical transformation job Proprietary of IYKRA -
Data Fellowship Batch 6 - 2021 • Column mapping • Assigning data type • Select attributes • Translation/standardization of value • Data validation • Splitting columns • Join/lookup from multiple source • Calculation using existing fields • etc

ETL - TRANSFORM Typical problems related to data quality Proprietary
of IYKRA - Data Fellowship Batch 6 - 2021 • Potential Problems with Data from a Single Source ◦ Uniqueness ◦ Redundancy/Duplicates ◦ Outside domain range ◦ Data entry errors ◦ Referential integrity • Potential Problems with Data from Multiple Sources ◦ Naming conflicts at the schema level — using the same name for different things or using a different name for the same things ◦ Naming conflicts at the data level ◦ Inconsistent aggregating ◦ Inconsistent timing

• ETL job/code maintainability ◦ Change versioning ◦ Modularity ◦
Knowledge transfer ◦ Changes in business requirement/logic • ETL job/code resiliency to failure ◦ Rapid increase in data volume ◦ Circuit break when failure ◦ Backfilling ◦ Data loss ETL - TRANSFORM Typical problems related to data processing Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

ETL - LOAD Typical target systems Proprietary of IYKRA -
Data Fellowship Batch 6 - 2021 • Data lake ◦ Common underlying storage: HDFS, Object storage • Data warehouse ◦ Traditional data warehouse ◦ Cloud data warehouse (Massive Parallel Processing): BigQuery, Snowflake, Redshift

Robust ETL Principles Proprietary of IYKRA - Data Fellowship Batch
6 - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration

ETL - PRINCIPLES Robust ETL Principles Proprietary of IYKRA -
Data Fellowship Batch 6 - 2021 #1 Extract and store data in raw format #2 Create idempotent and deterministic processes #3 Rest data between tasks #4 Validate data after each step

ETL - PRINCIPLES 1. Extract and store data in raw
format Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 Why? • Prevent losing data because of transformation failure • Enable to reprocess data when there’s changes in business rules How? • Don’t do any transformation before storing it for the first time • Use format that have minimum metadata definition (i.e: CSV, JSON) • Don’t use format like Parquet for ingestion

ETL - PRINCIPLES 2. Create idempotent & deterministic processes Proprietary
of IYKRA - Data Fellowship Batch 6 - 2021 Definition? • Idempotent: Run multiple times without changing the result • Deterministic: The results of a transformations only depends on determined input parameters ◦ i.e: the result should NOT depend on the timing of it runs Why? • Reproducible ◦ When something breaks, just fix the underlying issue, then reruns the process without worrying about data consistency ◦ When schema or business logic changes, just rerun the process • Easier to debug • Enable to create dev/staging environment that mirrors production

ETL - PRINCIPLES 2. Create idempotent & deterministic processes (cont’d)
Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 How? • Don’t append, overwrite the partition instead • Don’t alter data, write new one instead • Don’t produce side effect, treat your process as a function • Define all factors that influence the result as input parameters ◦ i.e.: Don't use date.today(), use parameter instead

ETL - PRINCIPLES 3. Rest data between tasks Proprietary of
IYKRA - Data Fellowship Batch 6 - 2021 Why? • Enable data pipeline/ETL to run in cluster, means this is a good foundation to scaling How? • Write result of a task to the storage • Next task will read from the storage

ETL - PRINCIPLES 3. Rest data between tasks (cont’d) Proprietary
of IYKRA - Data Fellowship Batch 6 - 2021

ETL - PRINCIPLES 4. Validate data after each step Proprietary
of IYKRA - Data Fellowship Batch 6 - 2021 Why? • Prevent silent data error • Never publish wrong data

Lab: Talend Open Studio Proprietary of IYKRA - Data Fellowship
Batch 6 - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration

TALEND OS - OVERVIEW Talend Components Proprietary of IYKRA -
Data Fellowship Batch 6 - 2021

TALEND OS - OVERVIEW Talend for Data Management Reference Architecture
Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

TALEND OS - LAB PREPARATION Required Tools and Environment Proprietary
of IYKRA - Data Fellowship Batch 6 - 2021 • Java 11 • Talend Open Studio 7.3.1 • MySQL Database ◦ Installed locally ◦ Database: northwind, dwh ◦ Schema & data: https://github.com/jpwhite3/northwind-MySQL • S3 ◦ Bucket: data-fellowship-6 ◦ AWS Key: <ask your instructor> ◦ AWS Secret: <ask your instructor>

TALEND OS - LAB #1 - HELLO WORLD! Create a
“Hello World!” dialog Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

TALEND OS - LAB #2 - EXTRACTION Data extraction for
delimited ﬁles (CSV) from S3 + Metadata Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

TALEND OS - LAB #3 - EXTRACTION Data extraction for
MySQL and store in S3 as CSV Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

TALEND OS - LAB #4 - ETL CASE STUDY 4.1
Source Tables Schema Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Business Questions Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 When do we have the most and least closed order last month? order_date count(order.id) status_name == closed order_date How many unique products are sold per day? count() count(product_id) status_name == closed order_date status_name == closed How much is our omzet per day? sum() quantity unit_price order_date

Data Warehouse Table Schema (Dimensional Model) Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

ETL Job + Context Variable + Debugging Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

TALEND OS - LAB #5 - BUILD & AUTOMATE JOB
Build job + Run in command line + Override context vars Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 • Case: ◦ We will use external orchestrator (ie: Airflow) to run Talend jobs ◦ The job will be executed every month, immediately after a month period ends • Expected to run in command line: ./etl_mysql_2_run.sh \ --context_param start_period={{execution_date}} \ --context_param end_period=2006-02-31

TALEND OS - LAB #6 - ELT CASE STUDY 6.1
Extract & load jobs Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

TALEND OS - LAB #6 - ELT CASE STUDY 6.2
Transform Job Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Lab: Pentaho Data Integration Proprietary of IYKRA - Data Fellowship
Batch 6 - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration

PENTAHO DI - LAB PREPARATION Required Tools and Environment Proprietary
of IYKRA - Data Fellowship Batch 6 - 2021 • Java 8 • Pentaho Data Integration 9.1 • MySQL Database ◦ Installed locally ◦ Database: northwind, dwh ◦ Schema & data: https://github.com/jpwhite3/northwind-MySQL • S3 ◦ Bucket: data-fellowship-6 ◦ AWS Key: <ask your instructor> ◦ AWS Secret: <ask your instructor>

PENTAHO DI - LAB #1 - HELLO WORLD! Write PDI
version and “Hello World” to log Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

PENTAHO DI - LAB #2 - EXTRACTION Data extraction for
delimited ﬁles (CSV) from S3 Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

PENTAHO DI - LAB #3 - EXTRACTION Data extraction from
MySQL and store in S3 as CSV Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

PENTAHO DI - LAB #4 - ETL CASE STUDY 4.1
Source Tables Schema Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Business Questions Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 When do we have the most and least closed order last month? order_date count(order.id) status_name == closed order_date How many unique products are sold per day? order_date count(product_id) status_name == closed order_date status_name == closed How much is our omzet per day? sum() quantity unit_price order_date

Data Warehouse Table Schema (Dimensional Model) Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

ETL Transformation + Parameters + Debugging Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

PENTAHO DI - LAB #5 - BUILD & AUTOMATE JOB
Build job + Run in command line + Override parameters Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 • Case: ◦ We will use external orchestrator (ie: Airflow) to run Pentaho jobs ◦ The job will be executed every month, immediately after a month period ends • Expected to run in command line: ./pan.sh \ -file <path-to-file>/lab_04_etl.ktr \ -param:period_start=2006-02-01 \ -param:period_end=2006-02-28

PENTAHO DI - LAB #6 - ELT CASE STUDY ELT
Job Proprietary of IYKRA - Data Fellowship Batch 6 - 2021

Practice Case Proprietary of IYKRA - Data Fellowship Batch 6
- 2021

PRACTICE CASE ETL Best Practices Implementation Proprietary of IYKRA -
Data Fellowship Batch 6 - 2021 • Objective ◦ Have a robust understanding on implementing good ETL principles ◦ Able to implement ETL using a visual tool • Instructions & Scoring ◦ Implement ETL jobs/transformation in Lab #4 using Talend or Pentaho [score: 40] ◦ Add several improvements: ▪ Extract and store data in raw format (CSV) to S3 [score: 20] ▪ Implement data quality check [score: 20] ▪ Keep the ETL process idempotent [score: 10] ▪ Keep the monthly batch period [score: 10] ◦ You may make assumptions, but you need to state them

Thank you! IYKRA Data Fellowship Program Batch 6

Building Robust ETL with Talend Open Studio and...

Building Robust ETL with Talend Open Studio and Pentaho Data Integration

More Decks by Aditya Satrya

Other Decks in Technology

Featured

Transcript