Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Robust ETL with Talend Open Studio and...

Building Robust ETL with Talend Open Studio and Pentaho Data Integration

My deck for delivering one of the topics at Data Fellowship, a full scholarship data training program by IYKRA.

Aditya Satrya

August 02, 2021
Tweet

More Decks by Aditya Satrya

Other Decks in Technology

Transcript

  1. IYKRA Data Fellowship Program Batch 6 Building Robust ETL with

    Talend Open Studio and Pentaho Data Integration by Aditya Satrya 2 Agustus 2021 Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  2. Trainer Profile Aditya Satrya Experiences: • Data Engineering Manager, Accenture

    AI (May 2021 – present) • Data Engineering Manager, Mekari (Jan 2020 – May 2021) • Head of Engineering, Jabar Digital Service (Oct 2018 – Des 2019) • CTO/Co-founder, Qiwii–Mobile Queueing App (Nov 2017 – Des 2018) • CTO/Co-founder, Nuesto Technology (2014 – 2018) Connect: • Linkedin: https://www.linkedin.com/in/asatrya/ • Telegram: @asatrya Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  3. Table of Content ETL Introduction Robust ETL Principles Lab: Talend

    Open Studio Lab: Pentaho Data Integration Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  4. ETL Introduction Proprietary of IYKRA - Data Fellowship Batch 6

    - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration
  5. ETL INTRODUCTION OLTP vs OLAP Proprietary of IYKRA - Data

    Fellowship Batch 6 - 2021 Image source:https://www.marklogic.com/blog/relational-databases-and-mixed-workloads/
  6. ETL INTRODUCTION What is ETL? Proprietary of IYKRA - Data

    Fellowship Batch 6 - 2021 Source: https://www.matillion.com/what-is-etl-the-ultimate-guide
  7. ETL INTRODUCTION Why do we need ETL? Why not directly

    query the database? Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 • Slow down the source system • Transactional system is not optimized for reporting and analysis
  8. ETL - EXTRACT Source systems are usually heterogeneous Proprietary of

    IYKRA - Data Fellowship Batch 6 - 2021 • Transactional applications ◦ ERP, operational applications, marketing tools, etc • APIs • Sensor data • SaaS • Databases • Files • others
  9. ETL - EXTRACT Know and understand your data source Proprietary

    of IYKRA - Data Fellowship Batch 6 - 2021 • Data type • Update frequency • Data quality (completeness, validity, accuracy, etc) • Data owner • Policy to access the data • Data volume • Table indexing • Availability of timestamp field • Schema changes frequency
  10. ETL - EXTRACT Data extraction methods Proprietary of IYKRA -

    Data Fellowship Batch 6 - 2021 • Delta/incremental ◦ SELECT query (require timestamp field) ◦ CDC / change logs • Full/snapshot ◦ Source systems cannot identify which data has been changed
  11. ETL - TRANSFORM Typical transformation job Proprietary of IYKRA -

    Data Fellowship Batch 6 - 2021 • Column mapping • Assigning data type • Select attributes • Translation/standardization of value • Data validation • Splitting columns • Join/lookup from multiple source • Calculation using existing fields • etc
  12. ETL - TRANSFORM Typical problems related to data quality Proprietary

    of IYKRA - Data Fellowship Batch 6 - 2021 • Potential Problems with Data from a Single Source ◦ Uniqueness ◦ Redundancy/Duplicates ◦ Outside domain range ◦ Data entry errors ◦ Referential integrity • Potential Problems with Data from Multiple Sources ◦ Naming conflicts at the schema level — using the same name for different things or using a different name for the same things ◦ Naming conflicts at the data level ◦ Inconsistent aggregating ◦ Inconsistent timing
  13. • ETL job/code maintainability ◦ Change versioning ◦ Modularity ◦

    Knowledge transfer ◦ Changes in business requirement/logic • ETL job/code resiliency to failure ◦ Rapid increase in data volume ◦ Circuit break when failure ◦ Backfilling ◦ Data loss ETL - TRANSFORM Typical problems related to data processing Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  14. ETL - LOAD Typical target systems Proprietary of IYKRA -

    Data Fellowship Batch 6 - 2021 • Data lake ◦ Common underlying storage: HDFS, Object storage • Data warehouse ◦ Traditional data warehouse ◦ Cloud data warehouse (Massive Parallel Processing): BigQuery, Snowflake, Redshift
  15. Robust ETL Principles Proprietary of IYKRA - Data Fellowship Batch

    6 - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration
  16. ETL - PRINCIPLES Robust ETL Principles Proprietary of IYKRA -

    Data Fellowship Batch 6 - 2021 #1 Extract and store data in raw format #2 Create idempotent and deterministic processes #3 Rest data between tasks #4 Validate data after each step
  17. ETL - PRINCIPLES 1. Extract and store data in raw

    format Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 Why? • Prevent losing data because of transformation failure • Enable to reprocess data when there’s changes in business rules How? • Don’t do any transformation before storing it for the first time • Use format that have minimum metadata definition (i.e: CSV, JSON) • Don’t use format like Parquet for ingestion
  18. ETL - PRINCIPLES 2. Create idempotent & deterministic processes Proprietary

    of IYKRA - Data Fellowship Batch 6 - 2021 Definition? • Idempotent: Run multiple times without changing the result • Deterministic: The results of a transformations only depends on determined input parameters ◦ i.e: the result should NOT depend on the timing of it runs Why? • Reproducible ◦ When something breaks, just fix the underlying issue, then reruns the process without worrying about data consistency ◦ When schema or business logic changes, just rerun the process • Easier to debug • Enable to create dev/staging environment that mirrors production
  19. ETL - PRINCIPLES 2. Create idempotent & deterministic processes (cont’d)

    Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 How? • Don’t append, overwrite the partition instead • Don’t alter data, write new one instead • Don’t produce side effect, treat your process as a function • Define all factors that influence the result as input parameters ◦ i.e.: Don't use date.today(), use parameter instead
  20. ETL - PRINCIPLES 3. Rest data between tasks Proprietary of

    IYKRA - Data Fellowship Batch 6 - 2021 Why? • Enable data pipeline/ETL to run in cluster, means this is a good foundation to scaling How? • Write result of a task to the storage • Next task will read from the storage
  21. ETL - PRINCIPLES 3. Rest data between tasks (cont’d) Proprietary

    of IYKRA - Data Fellowship Batch 6 - 2021
  22. ETL - PRINCIPLES 4. Validate data after each step Proprietary

    of IYKRA - Data Fellowship Batch 6 - 2021 Why? • Prevent silent data error • Never publish wrong data
  23. Lab: Talend Open Studio Proprietary of IYKRA - Data Fellowship

    Batch 6 - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration
  24. TALEND OS - OVERVIEW Talend for Data Management Reference Architecture

    Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  25. TALEND OS - LAB PREPARATION Required Tools and Environment Proprietary

    of IYKRA - Data Fellowship Batch 6 - 2021 • Java 11 • Talend Open Studio 7.3.1 • MySQL Database ◦ Installed locally ◦ Database: northwind, dwh ◦ Schema & data: https://github.com/jpwhite3/northwind-MySQL • S3 ◦ Bucket: data-fellowship-6 ◦ AWS Key: <ask your instructor> ◦ AWS Secret: <ask your instructor>
  26. TALEND OS - LAB #1 - HELLO WORLD! Create a

    “Hello World!” dialog Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  27. TALEND OS - LAB #2 - EXTRACTION Data extraction for

    delimited files (CSV) from S3 + Metadata Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  28. TALEND OS - LAB #3 - EXTRACTION Data extraction for

    MySQL and store in S3 as CSV Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  29. TALEND OS - LAB #4 - ETL CASE STUDY 4.1

    Source Tables Schema Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  30. TALEND OS - LAB #4 - ETL CASE STUDY 4.2

    Business Questions Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 When do we have the most and least closed order last month? order_date count(order.id) status_name == closed order_date How many unique products are sold per day? count() count(product_id) status_name == closed order_date status_name == closed How much is our omzet per day? sum() quantity unit_price order_date
  31. TALEND OS - LAB #4 - ETL CASE STUDY 4.3

    Data Warehouse Table Schema (Dimensional Model) Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  32. TALEND OS - LAB #4 - ETL CASE STUDY 4.4

    ETL Job + Context Variable + Debugging Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  33. TALEND OS - LAB #5 - BUILD & AUTOMATE JOB

    Build job + Run in command line + Override context vars Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 • Case: ◦ We will use external orchestrator (ie: Airflow) to run Talend jobs ◦ The job will be executed every month, immediately after a month period ends • Expected to run in command line: ./etl_mysql_2_run.sh \ --context_param start_period={{execution_date}} \ --context_param end_period=2006-02-31
  34. TALEND OS - LAB #6 - ELT CASE STUDY 6.1

    Extract & load jobs Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  35. TALEND OS - LAB #6 - ELT CASE STUDY 6.2

    Transform Job Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  36. Lab: Pentaho Data Integration Proprietary of IYKRA - Data Fellowship

    Batch 6 - 2021 ETL Introduction Robust ETL Principles Lab: Talend Open Studio Lab: Pentaho Data Integration
  37. PENTAHO DI - LAB PREPARATION Required Tools and Environment Proprietary

    of IYKRA - Data Fellowship Batch 6 - 2021 • Java 8 • Pentaho Data Integration 9.1 • MySQL Database ◦ Installed locally ◦ Database: northwind, dwh ◦ Schema & data: https://github.com/jpwhite3/northwind-MySQL • S3 ◦ Bucket: data-fellowship-6 ◦ AWS Key: <ask your instructor> ◦ AWS Secret: <ask your instructor>
  38. PENTAHO DI - LAB #1 - HELLO WORLD! Write PDI

    version and “Hello World” to log Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  39. PENTAHO DI - LAB #2 - EXTRACTION Data extraction for

    delimited files (CSV) from S3 Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  40. PENTAHO DI - LAB #3 - EXTRACTION Data extraction from

    MySQL and store in S3 as CSV Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  41. PENTAHO DI - LAB #4 - ETL CASE STUDY 4.1

    Source Tables Schema Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  42. PENTAHO DI - LAB #4 - ETL CASE STUDY 4.2

    Business Questions Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 When do we have the most and least closed order last month? order_date count(order.id) status_name == closed order_date How many unique products are sold per day? order_date count(product_id) status_name == closed order_date status_name == closed How much is our omzet per day? sum() quantity unit_price order_date
  43. PENTAHO DI - LAB #4 - ETL CASE STUDY 4.3

    Data Warehouse Table Schema (Dimensional Model) Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  44. PENTAHO DI - LAB #4 - ETL CASE STUDY 4.4

    ETL Transformation + Parameters + Debugging Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  45. PENTAHO DI - LAB #5 - BUILD & AUTOMATE JOB

    Build job + Run in command line + Override parameters Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 • Case: ◦ We will use external orchestrator (ie: Airflow) to run Pentaho jobs ◦ The job will be executed every month, immediately after a month period ends • Expected to run in command line: ./pan.sh \ -file <path-to-file>/lab_04_etl.ktr \ -param:period_start=2006-02-01 \ -param:period_end=2006-02-28
  46. PENTAHO DI - LAB #6 - ELT CASE STUDY ELT

    Job Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
  47. PRACTICE CASE ETL Best Practices Implementation Proprietary of IYKRA -

    Data Fellowship Batch 6 - 2021 • Objective ◦ Have a robust understanding on implementing good ETL principles ◦ Able to implement ETL using a visual tool • Instructions & Scoring ◦ Implement ETL jobs/transformation in Lab #4 using Talend or Pentaho [score: 40] ◦ Add several improvements: ▪ Extract and store data in raw format (CSV) to S3 [score: 20] ▪ Implement data quality check [score: 20] ▪ Keep the ETL process idempotent [score: 10] ▪ Keep the monthly batch period [score: 10] ◦ You may make assumptions, but you need to state them