Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FOSSASIA Summit 2023 - Modeling data fast, slow...

FOSSASIA Summit 2023 - Modeling data fast, slow and back-in-time: Change-aware dimensional data modeling in the modern data stack

Modeling data fast, slow and back-in-time: Change-aware dimensional data modeling in the modern data stack
Event: FOSSASIA Summit 2023
Date: 15 April 2023
Location: Lifelong Learning Institute, Singapore

Modeling data with versioning enables data teams to track and reproduce history for fast- and slow-changing data over time. In the era of read-optimized cloud data warehouses and ELT (Extract-Load-Transform) workflows, there is still value in integrating dimensional data modeling practices with modern functional approaches in the modern data stack to design efficient and reproducible data models for analytics reporting use cases that require accurate reporting of point-in-time historical dimensions. In this talk, I will be exploring how we can design history-preserving data models and reproducible data transformation workflows by adopting "change-aware" dimensional modeling practices in the modern data stack.

Ong Chin Hwee

April 15, 2023
Tweet

More Decks by Ong Chin Hwee

Other Decks in Programming

Transcript

  1. Modeling Data Fast, Slow and Back-In-Time Change-aware Dimensional Data Modeling

    in the Modern Data Stack By: Chin Hwee Ong (@ongchinhwee) 15 April 2023 FOSSASIA Summit 2023, Singapore
  2. About me Ong Chin Hwee 王敬惠 • Senior Data Engineer

    @ Grab • Speaker and (occasional) writer on data engineering topics • 90% self-taught (+10% from my engineering degrees) @ongchinhwee
  3. Before we begin, some disclaimers: @ongchinhwee • The insights in

    this talk are based on my own technical expertise and does not in any way represent the views of my team and employer. • Scope is primarily focused on structured tabular data ◦ Similar concept with different implementation for unstructured data, but that’s not the main focus of this talk
  4. Imagine this scenario in your data team: @ongchinhwee • Finance

    requests a regeneration of a business-critical report dated 11 months ago for Client A • Client A has been upgraded from Tier 2 to Tier 1 3 months ago, which increases the rebate earn rate • However, you need to generate the report based on the Tier 2 earn rate (as Client A was a Tier 2 customer 11 months ago)
  5. Time is an important dimension in your data @ongchinhwee •

    Data is usually not static ◦ State transitions during a business process ◦ Attributes can change over time e.g. age, income, status
  6. What is data versioning? @ongchinhwee • Concept of capturing state

    changes while keeping track of successive versions of data over time • Creates a unique reference for a collection of data when changes occur while retaining previous versions • Data versioning approaches: ◦ Change data capture (CDC) for structured data ◦ Data version control for unstructured data
  7. What is data versioning? @ongchinhwee • Types of data versioning:

    ◦ Change data capture (CDC) for structured data ◦ Data version control for unstructured data
  8. Why does data versioning matter? @ongchinhwee • Reproducibility for data

    governance and audit purposes • Build data history with backward and forward compatibility ◦ What if there is a change in transformation logic that only applies to a specific time range?
  9. Why does data versioning matter? @ongchinhwee • Reproducibility for data

    governance and audit purposes • Build data history with backward and forward compatibility ◦ What if there is a change in transformation logic that only applies to a specific time range? • When you need to use point-in-time values to track business metrics over time
  10. What is change data capture? @ongchinhwee • “Data versioning for

    databases” • Design patterns for capturing and tracking changes in data from upstream source systems over time • Changes are captured in data warehouses / data lakes
  11. Design Patterns for Change Data Capture @ongchinhwee • Data versioning

    based on a combination of: ◦ Data version identifiers ◦ Timestamps ◦ Status indicators • Log trigger / tuple versioning ◦ Also known as Type 2 Slowly Changing Dimensions (SCDs) • Transaction logs
  12. What do we mean by the “Modern Data Stack”? @ongchinhwee

    • Cloud-based • Built around the cloud data warehouse / lake • Modular and customizable - “choose the best tool for a specific job”
  13. Data Warehousing in the Modern Data Stack @ongchinhwee • Cloud-based

    compute and storage = more scalable • From ETL to ELT - transformation within the data warehouse
  14. Data Warehousing in the Modern Data Stack @ongchinhwee • Cloud-based

    compute and storage = more scalable • From ETL to ELT - transformation within the data warehouse • Possible to store “snapshots” of data in a cloud data warehouse to capture historical changes
  15. Change data capture in the Modern Data Stack • Some

    implementations of change data capture ◦ Traditional (Kimball’s) Dimensional Modelling techniques ▪ Slowly Changing Dimensions (SCDs) @ongchinhwee
  16. Change data capture in the Modern Data Stack • Some

    implementations of change data capture ◦ Traditional (Kimball’s) Dimensional Modelling techniques ▪ Slowly Changing Dimensions (SCDs) ◦ Modern approaches ▪ Data snapshots ▪ Incremental models @ongchinhwee
  17. Traditional (Kimball-style) Dimensional Data Modelling • Developed by Kimball in

    1996; updated in 2013 during the emergence of cloud data warehouses • Introduced the concept of facts vs dimensions (star schema) • Designed for storage and compute efficiency @ongchinhwee
  18. Traditional (Kimball-style) Dimensional Data Modelling • Fundamental Concepts: Facts vs

    Dimensions ◦ Fact Tables ▪ Contains metrics and facts about a business process e.g. process time, transaction amount ▪ Are typically fast-evolving during a business process event ▪ Eventually reaches a final state at a point in time upon completion @ongchinhwee
  19. Traditional (Kimball-style) Dimensional Data Modelling • Fundamental Concepts: Facts vs

    Dimensions ◦ Dimension Tables ▪ Describes the attributes of a business process e.g. customer details ▪ Are typically slow-changing and updated over a longer period of time ▪ Does not have a “final state” @ongchinhwee
  20. Traditional (Kimball-style) Dimensional Data Modelling • Fundamental Concepts: Facts vs

    Dimensions ◦ Dimension Tables ▪ Describes the attributes of a business process e.g. customer details ▪ Are typically slow-changing and updated over a longer period of time ▪ Does not have a “final state” → problem of Slowly Changing Dimensions (SCDs) @ongchinhwee
  21. What are Slowly Changing Dimensions (SCDs)? • Change tracking techniques

    to handle state changes in dimensions @ongchinhwee SCD Type How state changes are handled 0 Ignore changes 1 Overwrite 2 Row versioning (w/ “validity period”) 3 Previous value column 4 History table (“mini-dimension”)
  22. What are Slowly Changing Dimensions (SCDs)? • Type 0 Slowly

    Changing Dimensions (Type 0 SCD) ◦ Fixed dimension e.g. account opening date ◦ Ignores any changes ◦ Assumes that the attribute will not change forever @ongchinhwee
  23. What are Slowly Changing Dimensions (SCDs)? • Type 1 Slowly

    Changing Dimensions (Type 1 SCD) ◦ Reflects the latest version of dimension attributes ◦ Previous version of the value in the dimension row is overwritten with new value ◦ Destroys history - not possible to rewind back to previous versions of the data @ongchinhwee
  24. What is Slowly Changing Dimensions (SCDs)? • Type 2 Slowly

    Changing Dimensions (Type 2 SCD) ◦ Implements row versioning for each dimension attribute ◦ Concept of “validity period” for each version of the data ▪ Row effective date / timestamp ▪ Row expiration date / timestamp ▪ Current row indicator @ongchinhwee
  25. What is Slowly Changing Dimensions (SCDs)? • Type 2 Slowly

    Changing Dimensions (Type 2 SCD) ◦ When a change is detected in a data record: ▪ A new dimension row is added with updated attribute values for the data record @ongchinhwee
  26. What is Slowly Changing Dimensions (SCDs)? • Type 2 Slowly

    Changing Dimensions (Type 2 SCD) ◦ When a change is detected in a data record: ▪ A new dimension row is added with updated attribute values for the data record ▪ New primary surrogate key is assigned to new dimension row @ongchinhwee
  27. What is Slowly Changing Dimensions (SCDs)? • Type 2 Slowly

    Changing Dimensions (Type 2 SCD) ◦ When a change is detected in a data record: ▪ A new dimension row is added with updated attribute values for the data record ▪ New primary surrogate key is assigned to new dimension row ▪ Previous version of the attribute is updated with row expiration timestamp @ongchinhwee
  28. Type 2 SCD: What is tuple versioning? • Change data

    capture mechanism that records changes to a mutable upstream table over time • Implements Type-2 Slowly Changing Dimensions on mutable table sources • Detects changes based on an `updated_at` timestamp @ongchinhwee
  29. Components of tuple versioning • Target schema • Column for

    timestamp-based tracking • Primary key / unique identifier(s) for record • Whether to invalidate records no longer in source @ongchinhwee
  30. What is Slowly Changing Dimensions (SCDs)? • Slowly Changing Dimensions

    (SCDs) ◦ Type 3, Type 4 etc. ▪ Less commonly used due to additional complexity without significant performance benefits in cloud data warehouses • Type 3: add new column to store previous attribute value • Type 4: add history table to keep record of state changes More information on Slowly Changing Dimension Techniques: Chapter 5 of The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition by Ralph Kimball and Margy Ross @ongchinhwee
  31. Data Snapshots • Read-only (immutable) copies of the state of

    a data source at a particular point in time • Stored at the staging area of a data warehouse @ongchinhwee
  32. Incremental models • Limiting the data transformation to a specified

    subset of source data ◦ Usually on rows in source data that have been created or updated since the last scheduled run • Significantly optimises runtime of transformations on large data + reduces compute cost @ongchinhwee
  33. @ongchinhwee Anatomy of incremental materialization on dbt Adapted from: dbt-dummy

    project (https://github.com/gmyrianthous/dbt-dummy/) Condition for incremental load Where to get the load from? Where to insert incremental load? How to insert/update incremental load?
  34. @ongchinhwee Anatomy of incremental materialization with Type-2 SCD model as

    source Condition for incremental load Where to get the load from? Where to insert incremental load? How to insert/update incremental load?
  35. Anatomy of incremental models • Does destination table already exist

    in data warehouse / lake? • Full-refresh or incremental run? @ongchinhwee
  36. Anatomy of incremental models • Does destination table already exist

    in data warehouse / lake? • Full-refresh or incremental run? • Incremental strategy ◦ Data warehouse or data lake? @ongchinhwee
  37. Anatomy of incremental models • Does destination table already exist

    in data warehouse / lake? • Full-refresh or incremental run? • Incremental strategy ◦ Data warehouse or data lake? • Columns to track changes @ongchinhwee
  38. Anatomy of incremental models • Does destination table already exist

    in data warehouse / lake? • Full-refresh or incremental run? • Incremental strategy ◦ Data warehouse or data lake? • Columns to track changes • Query filter for incremental run @ongchinhwee
  39. Data Warehousing in the Modern Data Stack @ongchinhwee “Since storage

    and compute are dirt cheap, engineering time is expensive, why not snapshot all your data (and append new partitions for each ETL schedule)?” • Does not apply for very large dimensions • Does not preclude the importance of dimensional data modelling Related reading: Functional Data Enginering - a modern paradigm for batch data processing (and related talks) by Maxime Beauchemin, creator of Airflow and Superset
  40. Is Kimball’s Dimensional Data Modelling still relevant? • Yes, in

    some ways. • Dimensional data modelling is still needed for these use cases: ◦ Aggregation of facts ◦ Metrics drill-down based on dimensions ▪ e.g. how many customers in Singapore spent > US$400 per month ◦ Financial reporting and audit @ongchinhwee
  41. Snapshot all your upstream source data! • What if assumptions

    about your data change? ◦ Source data update mechanism (append-only vs overwrite) ◦ Source table schema (columns added/altered/dropped) ◦ Business logic • Store data snapshots in staging area ◦ Build SCDs from data snapshots @ongchinhwee
  42. Source-target schema reconciliation Detect schema diffs between upstream source system

    vs data warehouse • Useful for detecting schema changes @ongchinhwee
  43. Use Type 2 SCDs + incremental model (in most cases)

    • In most cases, Type 2 SCDs are sufficient for tracking and capturing data changes • Incremental models may be slightly complex, but pays off in efficiency + cost when data scales quickly and changes (quickly and slowly) over time @ongchinhwee
  44. Strategies on designing incremental models • Design with upstream data

    update mechanism in mind ◦ Affects how incremental load is loaded onto target model • Incremental strategy depends on data warehouse / lake ◦ Specify a unique key (for data warehouse) or partitioning key (for data lake) to update existing records • Performance tip: Filter rows early! @ongchinhwee
  45. Key Takeaways • Dimensional data modeling techniques are still relevant

    in the Modern Data Stack • Adopt a mixture of SCD-like approaches and incremental models when designing change-aware data models for non-retroactive business logic • Data snapshots are useful in cases of where re-modelling is required due to changes in business logic @ongchinhwee
  46. Reach out to me! : ongchinhwee : [email protected] : @ongchinhwee

    : hweecat : https://ongchinhwee.me And get those slides here: http://bit.ly/fossasia-change-aware-data @ongchinhwee