Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FOSSASIA Summit 2023 - Modeling data fast, slow and back-in-time: Change-aware dimensional data modeling in the modern data stack

FOSSASIA Summit 2023 - Modeling data fast, slow and back-in-time: Change-aware dimensional data modeling in the modern data stack

Modeling data fast, slow and back-in-time: Change-aware dimensional data modeling in the modern data stack
Event: FOSSASIA Summit 2023
Date: 15 April 2023
Location: Lifelong Learning Institute, Singapore

Modeling data with versioning enables data teams to track and reproduce history for fast- and slow-changing data over time. In the era of read-optimized cloud data warehouses and ELT (Extract-Load-Transform) workflows, there is still value in integrating dimensional data modeling practices with modern functional approaches in the modern data stack to design efficient and reproducible data models for analytics reporting use cases that require accurate reporting of point-in-time historical dimensions. In this talk, I will be exploring how we can design history-preserving data models and reproducible data transformation workflows by adopting "change-aware" dimensional modeling practices in the modern data stack.

Ong Chin Hwee

April 15, 2023
Tweet

More Decks by Ong Chin Hwee

Other Decks in Programming

Transcript

  1. Modeling Data Fast, Slow and
    Back-In-Time
    Change-aware Dimensional Data Modeling in the
    Modern Data Stack
    By: Chin Hwee Ong (@ongchinhwee)
    15 April 2023
    FOSSASIA Summit 2023, Singapore

    View Slide

  2. About me
    Ong Chin Hwee 王敬惠
    ● Senior Data Engineer @ Grab
    ● Speaker and (occasional) writer on
    data engineering topics
    ● 90% self-taught (+10% from my
    engineering degrees)
    @ongchinhwee

    View Slide

  3. Before we begin, some disclaimers:
    @ongchinhwee
    ● The insights in this talk are based on my own technical
    expertise and does not in any way represent the views of my
    team and employer.
    ● Scope is primarily focused on structured tabular data
    ○ Similar concept with different implementation for unstructured data, but that’s not
    the main focus of this talk

    View Slide

  4. Imagine this scenario in your data team:
    @ongchinhwee
    ● Finance requests a regeneration of a business-critical report
    dated 11 months ago for Client A
    ● Client A has been upgraded from Tier 2 to Tier 1 3 months ago,
    which increases the rebate earn rate
    ● However, you need to generate the report based on the Tier 2
    earn rate (as Client A was a Tier 2 customer 11 months ago)

    View Slide

  5. Imagine this scenario in your data team:
    @ongchinhwee

    View Slide

  6. Time is an important dimension in your data
    @ongchinhwee
    ● Data is usually not static
    ○ State transitions during a business process
    ○ Attributes can change over time e.g. age, income, status

    View Slide

  7. What is data versioning?
    @ongchinhwee
    ● Concept of capturing state changes while keeping track of
    successive versions of data over time
    ● Creates a unique reference for a collection of data when
    changes occur while retaining previous versions
    ● Data versioning approaches:
    ○ Change data capture (CDC) for structured data
    ○ Data version control for unstructured data

    View Slide

  8. What is data versioning?
    @ongchinhwee
    ● Types of data versioning:
    ○ Change data capture (CDC) for structured data
    ○ Data version control for unstructured data

    View Slide

  9. Why does data versioning matter?
    @ongchinhwee
    ● Reproducibility for data governance and audit purposes

    View Slide

  10. Why does data versioning matter?
    @ongchinhwee
    ● Reproducibility for data governance and audit purposes
    ● Build data history with backward and forward compatibility
    ○ What if there is a change in transformation logic that only applies
    to a specific time range?

    View Slide

  11. Why does data versioning matter?
    @ongchinhwee
    ● Reproducibility for data governance and audit purposes
    ● Build data history with backward and forward compatibility
    ○ What if there is a change in transformation logic that only applies
    to a specific time range?
    ● When you need to use point-in-time values to track business
    metrics over time

    View Slide

  12. What is change data capture?
    @ongchinhwee
    ● “Data versioning for databases”
    ● Design patterns for capturing and tracking changes in data
    from upstream source systems over time
    ● Changes are captured in data warehouses / data lakes

    View Slide

  13. Design Patterns for Change Data Capture
    @ongchinhwee
    ● Data versioning based on a combination of:
    ○ Data version identifiers
    ○ Timestamps
    ○ Status indicators
    ● Log trigger / tuple versioning
    ○ Also known as Type 2 Slowly Changing Dimensions (SCDs)
    ● Transaction logs

    View Slide

  14. What do we mean by the “Modern Data Stack”?
    @ongchinhwee
    ● Cloud-based
    ● Built around the cloud data warehouse / lake
    ● Modular and customizable - “choose the best tool for a specific job”

    View Slide

  15. @ongchinhwee

    View Slide

  16. Data Warehousing in the Modern Data Stack
    @ongchinhwee
    ● Cloud-based compute and storage = more scalable

    View Slide

  17. Data Warehousing in the Modern Data Stack
    @ongchinhwee
    ● Cloud-based compute and storage = more scalable
    ● From ETL to ELT - transformation within the data warehouse

    View Slide

  18. Data Warehousing in the Modern Data Stack
    @ongchinhwee
    ● Cloud-based compute and storage = more scalable
    ● From ETL to ELT - transformation within the data warehouse
    ● Possible to store “snapshots” of data in a cloud data
    warehouse to capture historical changes

    View Slide

  19. Change data capture in the Modern Data Stack
    ● Some implementations of change data capture
    ○ Traditional (Kimball’s) Dimensional Modelling techniques
    ■ Slowly Changing Dimensions (SCDs)
    @ongchinhwee

    View Slide

  20. Change data capture in the Modern Data Stack
    ● Some implementations of change data capture
    ○ Traditional (Kimball’s) Dimensional Modelling techniques
    ■ Slowly Changing Dimensions (SCDs)
    ○ Modern approaches
    ■ Data snapshots
    ■ Incremental models
    @ongchinhwee

    View Slide

  21. Traditional (Kimball-style) Dimensional Data Modelling
    ● Developed by Kimball in 1996; updated in 2013 during the
    emergence of cloud data warehouses
    ● Introduced the concept of facts vs dimensions (star schema)
    ● Designed for storage and compute efficiency
    @ongchinhwee

    View Slide

  22. Traditional (Kimball-style) Dimensional Data Modelling
    ● Fundamental Concepts: Facts vs Dimensions
    ○ Fact Tables
    ■ Contains metrics and facts about a business process e.g. process
    time, transaction amount
    ■ Are typically fast-evolving during a business process event
    ■ Eventually reaches a final state at a point in time upon completion
    @ongchinhwee

    View Slide

  23. Traditional (Kimball-style) Dimensional Data Modelling
    ● Fundamental Concepts: Facts vs Dimensions
    ○ Dimension Tables
    ■ Describes the attributes of a business process e.g. customer details
    ■ Are typically slow-changing and updated over a longer period of time
    ■ Does not have a “final state”
    @ongchinhwee

    View Slide

  24. Traditional (Kimball-style) Dimensional Data Modelling
    ● Fundamental Concepts: Facts vs Dimensions
    ○ Dimension Tables
    ■ Describes the attributes of a business process e.g. customer details
    ■ Are typically slow-changing and updated over a longer period of time
    ■ Does not have a “final state”
    → problem of Slowly Changing Dimensions (SCDs)
    @ongchinhwee

    View Slide

  25. What are Slowly Changing Dimensions (SCDs)?
    ● Change tracking techniques to handle state changes in
    dimensions
    @ongchinhwee
    SCD Type How state changes are handled
    0 Ignore changes
    1 Overwrite
    2 Row versioning (w/ “validity period”)
    3 Previous value column
    4 History table (“mini-dimension”)

    View Slide

  26. What are Slowly Changing Dimensions (SCDs)?
    ● Type 0 Slowly Changing Dimensions (Type 0 SCD)
    ○ Fixed dimension e.g. account opening date
    ○ Ignores any changes
    ○ Assumes that the attribute will not change forever
    @ongchinhwee

    View Slide

  27. What are Slowly Changing Dimensions (SCDs)?
    ● Type 1 Slowly Changing Dimensions (Type 1 SCD)
    ○ Reflects the latest version of dimension attributes
    ○ Previous version of the value in the dimension row is overwritten
    with new value
    ○ Destroys history - not possible to rewind back to previous versions
    of the data
    @ongchinhwee

    View Slide

  28. What is Slowly Changing Dimensions (SCDs)?
    ● Type 2 Slowly Changing Dimensions (Type 2 SCD)
    ○ Implements row versioning for each dimension attribute
    ○ Concept of “validity period” for each version of the data
    ■ Row effective date / timestamp
    ■ Row expiration date / timestamp
    ■ Current row indicator
    @ongchinhwee

    View Slide

  29. What is Slowly Changing Dimensions (SCDs)?
    ● Type 2 Slowly Changing Dimensions (Type 2 SCD)
    ○ When a change is detected in a data record:
    ■ A new dimension row is added with updated attribute values for the
    data record
    @ongchinhwee

    View Slide

  30. What is Slowly Changing Dimensions (SCDs)?
    ● Type 2 Slowly Changing Dimensions (Type 2 SCD)
    ○ When a change is detected in a data record:
    ■ A new dimension row is added with updated attribute values for the
    data record
    ■ New primary surrogate key is assigned to new dimension row
    @ongchinhwee

    View Slide

  31. What is Slowly Changing Dimensions (SCDs)?
    ● Type 2 Slowly Changing Dimensions (Type 2 SCD)
    ○ When a change is detected in a data record:
    ■ A new dimension row is added with updated attribute values for the
    data record
    ■ New primary surrogate key is assigned to new dimension row
    ■ Previous version of the attribute is updated with row expiration
    timestamp
    @ongchinhwee

    View Slide

  32. Type 2 SCD: What is tuple versioning?
    ● Change data capture mechanism that records changes to a mutable
    upstream table over time
    ● Implements Type-2 Slowly Changing Dimensions on mutable table
    sources
    ● Detects changes based on an `updated_at` timestamp
    @ongchinhwee

    View Slide

  33. Components of tuple versioning
    ● Target schema
    ● Column for timestamp-based tracking
    ● Primary key / unique identifier(s) for record
    ● Whether to invalidate records no longer in source
    @ongchinhwee

    View Slide

  34. How does tuple versioning work?
    @ongchinhwee

    View Slide

  35. How does tuple versioning work?
    @ongchinhwee

    View Slide

  36. What is Slowly Changing Dimensions (SCDs)?
    ● Slowly Changing Dimensions (SCDs)
    ○ Type 3, Type 4 etc.
    ■ Less commonly used due to additional complexity without
    significant performance benefits in cloud data warehouses
    ● Type 3: add new column to store previous attribute value
    ● Type 4: add history table to keep record of state changes
    More information on Slowly Changing Dimension Techniques: Chapter 5 of The Data Warehouse Toolkit:
    The Definitive Guide to Dimensional Modeling, 3rd Edition by Ralph Kimball and Margy Ross
    @ongchinhwee

    View Slide

  37. Data Snapshots
    ● Read-only (immutable) copies of the state of a data source at a
    particular point in time
    ● Stored at the staging area of a data warehouse
    @ongchinhwee

    View Slide

  38. @ongchinhwee

    View Slide

  39. Incremental models
    ● Limiting the data transformation to a specified subset of
    source data
    ○ Usually on rows in source data that have been created or updated
    since the last scheduled run
    ● Significantly optimises runtime of transformations on large data +
    reduces compute cost
    @ongchinhwee

    View Slide

  40. Incremental models
    @ongchinhwee

    View Slide

  41. @ongchinhwee
    Anatomy of incremental
    materialization on dbt
    Adapted from: dbt-dummy project
    (https://github.com/gmyrianthous/dbt-dummy/)
    Condition for incremental load
    Where to get the load from?
    Where to insert incremental load?
    How to insert/update
    incremental load?

    View Slide

  42. @ongchinhwee
    Anatomy of incremental materialization
    with Type-2 SCD model as source
    Condition for incremental load
    Where to get the load from?
    Where to insert incremental load?
    How to insert/update
    incremental load?

    View Slide

  43. Anatomy of incremental models
    ● Does destination table already exist in data warehouse / lake?
    @ongchinhwee

    View Slide

  44. Anatomy of incremental models
    ● Does destination table already exist in data warehouse / lake?
    ● Full-refresh or incremental run?
    @ongchinhwee

    View Slide

  45. Anatomy of incremental models
    ● Does destination table already exist in data warehouse / lake?
    ● Full-refresh or incremental run?
    ● Incremental strategy
    ○ Data warehouse or data lake?
    @ongchinhwee

    View Slide

  46. Anatomy of incremental models
    ● Does destination table already exist in data warehouse / lake?
    ● Full-refresh or incremental run?
    ● Incremental strategy
    ○ Data warehouse or data lake?
    ● Columns to track changes
    @ongchinhwee

    View Slide

  47. Anatomy of incremental models
    ● Does destination table already exist in data warehouse / lake?
    ● Full-refresh or incremental run?
    ● Incremental strategy
    ○ Data warehouse or data lake?
    ● Columns to track changes
    ● Query filter for incremental run
    @ongchinhwee

    View Slide

  48. Is Kimball’s approach still relevant?
    (especially in the Modern Data Stack)
    @ongchinhwee

    View Slide

  49. Data Warehousing in the Modern Data Stack
    @ongchinhwee
    “Since storage and compute are dirt cheap, engineering time is
    expensive, why not snapshot all your data (and append new
    partitions for each ETL schedule)?”
    ● Does not apply for very large dimensions
    ● Does not preclude the importance of dimensional data modelling
    Related reading: Functional Data Enginering - a modern paradigm for batch data processing (and related
    talks) by Maxime Beauchemin, creator of Airflow and Superset

    View Slide

  50. Is Kimball’s Dimensional Data Modelling still relevant?
    ● Yes, in some ways.
    ● Dimensional data modelling is still needed for these use cases:
    ○ Aggregation of facts
    ○ Metrics drill-down based on dimensions
    ■ e.g. how many customers in Singapore spent > US$400 per month
    ○ Financial reporting and audit
    @ongchinhwee

    View Slide

  51. Tips and tricks on change-aware data modelling
    @ongchinhwee

    View Slide

  52. Snapshot all your upstream source data!
    ● What if assumptions about your data change?
    ○ Source data update mechanism (append-only vs overwrite)
    ○ Source table schema (columns added/altered/dropped)
    ○ Business logic
    ● Store data snapshots in staging area
    ○ Build SCDs from data snapshots
    @ongchinhwee

    View Slide

  53. Source-target schema reconciliation
    Detect schema diffs
    between upstream source
    system vs data
    warehouse
    ● Useful for detecting
    schema changes
    @ongchinhwee

    View Slide

  54. Use Type 2 SCDs + incremental model (in most cases)
    ● In most cases, Type 2 SCDs are sufficient for tracking and
    capturing data changes
    ● Incremental models may be slightly complex, but pays off in
    efficiency + cost when data scales quickly and changes (quickly
    and slowly) over time
    @ongchinhwee

    View Slide

  55. Strategies on designing incremental models
    ● Design with upstream data update mechanism in mind
    ○ Affects how incremental load is loaded onto target model
    ● Incremental strategy depends on data warehouse / lake
    ○ Specify a unique key (for data warehouse) or partitioning key
    (for data lake) to update existing records
    ● Performance tip: Filter rows early!
    @ongchinhwee

    View Slide

  56. Key Takeaways
    ● Dimensional data modeling techniques are still relevant in the
    Modern Data Stack
    ● Adopt a mixture of SCD-like approaches and incremental models
    when designing change-aware data models for non-retroactive
    business logic
    ● Data snapshots are useful in cases of where re-modelling is
    required due to changes in business logic
    @ongchinhwee

    View Slide

  57. Reach out to
    me!
    : ongchinhwee
    : [email protected]
    : @ongchinhwee
    : hweecat
    : https://ongchinhwee.me
    And get those slides here:
    http://bit.ly/fossasia-change-aware-data
    @ongchinhwee

    View Slide