query the database? Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 • Slow down the source system • Transactional system is not optimized for reporting and analysis
of IYKRA - Data Fellowship Batch 6 - 2021 • Data type • Update frequency • Data quality (completeness, validity, accuracy, etc) • Data owner • Policy to access the data • Data volume • Table indexing • Availability of timestamp field • Schema changes frequency
Data Fellowship Batch 6 - 2021 • Column mapping • Assigning data type • Select attributes • Translation/standardization of value • Data validation • Splitting columns • Join/lookup from multiple source • Calculation using existing fields • etc
of IYKRA - Data Fellowship Batch 6 - 2021 • Potential Problems with Data from a Single Source ◦ Uniqueness ◦ Redundancy/Duplicates ◦ Outside domain range ◦ Data entry errors ◦ Referential integrity • Potential Problems with Data from Multiple Sources ◦ Naming conflicts at the schema level — using the same name for different things or using a different name for the same things ◦ Naming conflicts at the data level ◦ Inconsistent aggregating ◦ Inconsistent timing
Knowledge transfer ◦ Changes in business requirement/logic • ETL job/code resiliency to failure ◦ Rapid increase in data volume ◦ Circuit break when failure ◦ Backfilling ◦ Data loss ETL - TRANSFORM Typical problems related to data processing Proprietary of IYKRA - Data Fellowship Batch 6 - 2021
Data Fellowship Batch 6 - 2021 • Data lake ◦ Common underlying storage: HDFS, Object storage • Data warehouse ◦ Traditional data warehouse ◦ Cloud data warehouse (Massive Parallel Processing): BigQuery, Snowflake, Redshift
Data Fellowship Batch 6 - 2021 #1 Extract and store data in raw format #2 Create idempotent and deterministic processes #3 Rest data between tasks #4 Validate data after each step
format Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 Why? • Prevent losing data because of transformation failure • Enable to reprocess data when there’s changes in business rules How? • Don’t do any transformation before storing it for the first time • Use format that have minimum metadata definition (i.e: CSV, JSON) • Don’t use format like Parquet for ingestion
of IYKRA - Data Fellowship Batch 6 - 2021 Definition? • Idempotent: Run multiple times without changing the result • Deterministic: The results of a transformations only depends on determined input parameters ◦ i.e: the result should NOT depend on the timing of it runs Why? • Reproducible ◦ When something breaks, just fix the underlying issue, then reruns the process without worrying about data consistency ◦ When schema or business logic changes, just rerun the process • Easier to debug • Enable to create dev/staging environment that mirrors production
Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 How? • Don’t append, overwrite the partition instead • Don’t alter data, write new one instead • Don’t produce side effect, treat your process as a function • Define all factors that influence the result as input parameters ◦ i.e.: Don't use date.today(), use parameter instead
IYKRA - Data Fellowship Batch 6 - 2021 Why? • Enable data pipeline/ETL to run in cluster, means this is a good foundation to scaling How? • Write result of a task to the storage • Next task will read from the storage
Business Questions Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 When do we have the most and least closed order last month? order_date count(order.id) status_name == closed order_date How many unique products are sold per day? count() count(product_id) status_name == closed order_date status_name == closed How much is our omzet per day? sum() quantity unit_price order_date
Build job + Run in command line + Override context vars Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 • Case: ◦ We will use external orchestrator (ie: Airflow) to run Talend jobs ◦ The job will be executed every month, immediately after a month period ends • Expected to run in command line: ./etl_mysql_2_run.sh \ --context_param start_period={{execution_date}} \ --context_param end_period=2006-02-31
Business Questions Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 When do we have the most and least closed order last month? order_date count(order.id) status_name == closed order_date How many unique products are sold per day? order_date count(product_id) status_name == closed order_date status_name == closed How much is our omzet per day? sum() quantity unit_price order_date
Build job + Run in command line + Override parameters Proprietary of IYKRA - Data Fellowship Batch 6 - 2021 • Case: ◦ We will use external orchestrator (ie: Airflow) to run Pentaho jobs ◦ The job will be executed every month, immediately after a month period ends • Expected to run in command line: ./pan.sh \ -file <path-to-file>/lab_04_etl.ktr \ -param:period_start=2006-02-01 \ -param:period_end=2006-02-28
Data Fellowship Batch 6 - 2021 • Objective ◦ Have a robust understanding on implementing good ETL principles ◦ Able to implement ETL using a visual tool • Instructions & Scoring ◦ Implement ETL jobs/transformation in Lab #4 using Talend or Pentaho [score: 40] ◦ Add several improvements: ▪ Extract and store data in raw format (CSV) to S3 [score: 20] ▪ Implement data quality check [score: 20] ▪ Keep the ETL process idempotent [score: 10] ▪ Keep the monthly batch period [score: 10] ◦ You may make assumptions, but you need to state them