Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Robust ETL with Talend Open Studio and...

Building Robust ETL with Talend Open Studio and Pentaho Data Integration

My deck for delivering one of the topics at Data Fellowship, a full scholarship data training program by IYKRA.

Avatar for Aditya Satrya

Aditya Satrya

August 02, 2021
Tweet

More Decks by Aditya Satrya

Other Decks in Technology

Transcript

  1. IYKRA Data Fellowship Program Batch 6 Building Robust ETL with

    Talend Open Studio and Pentaho Data Integration by Aditya Satrya 2 Agustus 2021 Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  2. Trainer Profile Aditya Satrya Experiences: • Data Engineering Manager, Accenture

    AI (May 2021 – present) • Data Engineering Manager, Mekari (Jan 2020 – May 2021) • Head of Engineering, Jabar Digital Service (Oct 2018 – Des 2019) • CTO/Co-founder, Qiwii–Mobile Queueing App (Nov 2017 – Des 2018) • CTO/Co-founder, Nuesto Technology (2014 – 2018) Connect: • Linkedin: https://www.linkedin.com/in/asatrya/ • Telegram: @asatrya Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  3. Table of Content ETL Introduction Robust ETL Principles Lab: Talend

    Open Studio Lab: Pentaho Data Integration Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  4. ETL Introduction Proprietary of IYKRA - Data Fellowship Batch 6

    - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration
  5. ETL INTRODUCTION OLTP vs OLAP Proprietary of IYKRA - Data

    Fellowship Batch 6 - 2021 Image source:https://www.marklogic.com/blog/relational-databases-and-mixed-workloads/
  6. ETL INTRODUCTION What is ETL? Proprietary of IYKRA - Data

    Fellowship Batch 6 - 2021 Source: https://www.matillion.com/what-is-etl-the-ultimate-guide
  7. ETL INTRODUCTION Why do we need ETL? Why not directly

    query the database? Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 • Slow down the source system • Transactional system is not optimized for reporting and analysis
  8. ETL - EXTRACT Source systems are usually heterogeneous Proprietary of

    IYKRA - Data Fellowship Batch 6 - 2021 • Transactional applications ◦ ERP, operational applications, marketing tools, etc • APIs • Sensor data • SaaS • Databases • Files • others
  9. ETL - EXTRACT Know and understand your data source Proprietary

    of IYKRA - Data Fellowship Batch 6 - 2021 • Data type • Update frequency • Data quality (completeness, validity, accuracy, etc) • Data owner • Policy to access the data • Data volume • Table indexing • Availability of timestamp field • Schema changes frequency
  10. ETL - EXTRACT Data extraction methods Proprietary of IYKRA -

    Data Fellowship Batch 6 - 2021 • Delta/incremental ◦ SELECT query (require timestamp field) ◦ CDC / change logs • Full/snapshot ◦ Source systems cannot identify which data has been changed
  11. ETL - TRANSFORM Typical transformation job Proprietary of IYKRA -

    Data Fellowship Batch 6 - 2021 • Column mapping • Assigning data type • Select attributes • Translation/standardization of value • Data validation • Splitting columns • Join/lookup from multiple source • Calculation using existing fields • etc
  12. ETL - TRANSFORM Typical problems related to data quality Proprietary

    of IYKRA - Data Fellowship Batch 6 - 2021 • Potential Problems with Data from a Single Source ◦ Uniqueness ◦ Redundancy/Duplicates ◦ Outside domain range ◦ Data entry errors ◦ Referential integrity • Potential Problems with Data from Multiple Sources ◦ Naming conflicts at the schema level — using the same name for different things or using a different name for the same things ◦ Naming conflicts at the data level ◦ Inconsistent aggregating ◦ Inconsistent timing
  13. • ETL job/code maintainability ◦ Change versioning ◦ Modularity ◦

    Knowledge transfer ◦ Changes in business requirement/logic • ETL job/code resiliency to failure ◦ Rapid increase in data volume ◦ Circuit break when failure ◦ Backfilling ◦ Data loss ETL - TRANSFORM Typical problems related to data processing Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  14. ETL - LOAD Typical target systems Proprietary of IYKRA -

    Data Fellowship Batch 6 - 2021 • Data lake ◦ Common underlying storage: HDFS, Object storage • Data warehouse ◦ Traditional data warehouse ◦ Cloud data warehouse (Massive Parallel Processing): BigQuery, Snowflake, Redshift
  15. Robust ETL Principles Proprietary of IYKRA - Data Fellowship Batch

    6 - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration
  16. ETL - PRINCIPLES Robust ETL Principles Proprietary of IYKRA -

    Data Fellowship Batch 6 - 2021 #1 Extract and store data in raw format #2 Create idempotent and deterministic processes #3 Rest data between tasks #4 Validate data after each step
  17. ETL - PRINCIPLES 1. Extract and store data in raw

    format Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 Why? • Prevent losing data because of transformation failure • Enable to reprocess data when there’s changes in business rules How? • Don’t do any transformation before storing it for the first time • Use format that have minimum metadata definition (i.e: CSV, JSON) • Don’t use format like Parquet for ingestion
  18. ETL - PRINCIPLES 2. Create idempotent & deterministic processes Proprietary

    of IYKRA - Data Fellowship Batch 6 - 2021 Definition? • Idempotent: Run multiple times without changing the result • Deterministic: The results of a transformations only depends on determined input parameters ◦ i.e: the result should NOT depend on the timing of it runs Why? • Reproducible ◦ When something breaks, just fix the underlying issue, then reruns the process without worrying about data consistency ◦ When schema or business logic changes, just rerun the process • Easier to debug • Enable to create dev/staging environment that mirrors production
  19. ETL - PRINCIPLES 2. Create idempotent & deterministic processes (cont’d)

    Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 How? • Don’t append, overwrite the partition instead • Don’t alter data, write new one instead • Don’t produce side effect, treat your process as a function • Define all factors that influence the result as input parameters ◦ i.e.: Don't use date.today(), use parameter instead
  20. ETL - PRINCIPLES 3. Rest data between tasks Proprietary of

    IYKRA - Data Fellowship Batch 6 - 2021 Why? • Enable data pipeline/ETL to run in cluster, means this is a good foundation to scaling How? • Write result of a task to the storage • Next task will read from the storage
  21. ETL - PRINCIPLES 3. Rest data between tasks (cont’d) Proprietary

    of IYKRA - Data Fellowship Batch 6 - 2021
  22. ETL - PRINCIPLES 4. Validate data after each step Proprietary

    of IYKRA - Data Fellowship Batch 6 - 2021 Why? • Prevent silent data error • Never publish wrong data
  23. Lab: Talend Open Studio Proprietary of IYKRA - Data Fellowship

    Batch 6 - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration
  24. TALEND OS - OVERVIEW Talend for Data Management Reference Architecture

    Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  25. TALEND OS - LAB PREPARATION Required Tools and Environment Proprietary

    of IYKRA - Data Fellowship Batch 6 - 2021 • Java 11 • Talend Open Studio 7.3.1 • MySQL Database ◦ Installed locally ◦ Database: northwind, dwh ◦ Schema & data: https://github.com/jpwhite3/northwind-MySQL • S3 ◦ Bucket: data-fellowship-6 ◦ AWS Key: <ask your instructor> ◦ AWS Secret: <ask your instructor>
  26. TALEND OS - LAB #1 - HELLO WORLD! Create a

    “Hello World!” dialog Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  27. TALEND OS - LAB #2 - EXTRACTION Data extraction for

    delimited files (CSV) from S3 + Metadata Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  28. TALEND OS - LAB #3 - EXTRACTION Data extraction for

    MySQL and store in S3 as CSV Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  29. TALEND OS - LAB #4 - ETL CASE STUDY 4.1

    Source Tables Schema Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  30. TALEND OS - LAB #4 - ETL CASE STUDY 4.2

    Business Questions Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 When do we have the most and least closed order last month? order_date count(order.id) status_name == closed order_date How many unique products are sold per day? count() count(product_id) status_name == closed order_date status_name == closed How much is our omzet per day? sum() quantity unit_price order_date
  31. TALEND OS - LAB #4 - ETL CASE STUDY 4.3

    Data Warehouse Table Schema (Dimensional Model) Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  32. TALEND OS - LAB #4 - ETL CASE STUDY 4.4

    ETL Job + Context Variable + Debugging Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  33. TALEND OS - LAB #5 - BUILD & AUTOMATE JOB

    Build job + Run in command line + Override context vars Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 • Case: ◦ We will use external orchestrator (ie: Airflow) to run Talend jobs ◦ The job will be executed every month, immediately after a month period ends • Expected to run in command line: ./etl_mysql_2_run.sh \ --context_param start_period={{execution_date}} \ --context_param end_period=2006-02-31
  34. TALEND OS - LAB #6 - ELT CASE STUDY 6.1

    Extract & load jobs Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  35. TALEND OS - LAB #6 - ELT CASE STUDY 6.2

    Transform Job Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  36. Lab: Pentaho Data Integration Proprietary of IYKRA - Data Fellowship

    Batch 6 - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration
  37. PENTAHO DI - LAB PREPARATION Required Tools and Environment Proprietary

    of IYKRA - Data Fellowship Batch 6 - 2021 • Java 8 • Pentaho Data Integration 9.1 • MySQL Database ◦ Installed locally ◦ Database: northwind, dwh ◦ Schema & data: https://github.com/jpwhite3/northwind-MySQL • S3 ◦ Bucket: data-fellowship-6 ◦ AWS Key: <ask your instructor> ◦ AWS Secret: <ask your instructor>
  38. PENTAHO DI - LAB #1 - HELLO WORLD! Write PDI

    version and “Hello World” to log Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  39. PENTAHO DI - LAB #2 - EXTRACTION Data extraction for

    delimited files (CSV) from S3 Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  40. PENTAHO DI - LAB #3 - EXTRACTION Data extraction from

    MySQL and store in S3 as CSV Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  41. PENTAHO DI - LAB #4 - ETL CASE STUDY 4.1

    Source Tables Schema Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  42. PENTAHO DI - LAB #4 - ETL CASE STUDY 4.2

    Business Questions Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 When do we have the most and least closed order last month? order_date count(order.id) status_name == closed order_date How many unique products are sold per day? order_date count(product_id) status_name == closed order_date status_name == closed How much is our omzet per day? sum() quantity unit_price order_date
  43. PENTAHO DI - LAB #4 - ETL CASE STUDY 4.3

    Data Warehouse Table Schema (Dimensional Model) Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  44. PENTAHO DI - LAB #4 - ETL CASE STUDY 4.4

    ETL Transformation + Parameters + Debugging Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  45. PENTAHO DI - LAB #5 - BUILD & AUTOMATE JOB

    Build job + Run in command line + Override parameters Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 • Case: ◦ We will use external orchestrator (ie: Airflow) to run Pentaho jobs ◦ The job will be executed every month, immediately after a month period ends • Expected to run in command line: ./pan.sh \ -file <path-to-file>/lab_04_etl.ktr \ -param:period_start=2006-02-01 \ -param:period_end=2006-02-28
  46. PENTAHO DI - LAB #6 - ELT CASE STUDY ELT

    Job Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  47. PRACTICE CASE ETL Best Practices Implementation Proprietary of IYKRA -

    Data Fellowship Batch 6 - 2021 • Objective ◦ Have a robust understanding on implementing good ETL principles ◦ Able to implement ETL using a visual tool • Instructions & Scoring ◦ Implement ETL jobs/transformation in Lab #4 using Talend or Pentaho [score: 40] ◦ Add several improvements: ▪ Extract and store data in raw format (CSV) to S3 [score: 20] ▪ Implement data quality check [score: 20] ▪ Keep the ETL process idempotent [score: 10] ▪ Keep the monthly batch period [score: 10] ◦ You may make assumptions, but you need to state them