FOSSASIA Summit 2023 - Modeling data fast, slow and back-in-time: Change-aware dimensional data modeling in the modern data stack

Modeling Data Fast, Slow and Back-In-Time Change-aware Dimensional Data Modeling
in the Modern Data Stack By: Chin Hwee Ong (@ongchinhwee) 15 April 2023 FOSSASIA Summit 2023, Singapore

About me Ong Chin Hwee 王敬惠 • Senior Data Engineer
@ Grab • Speaker and (occasional) writer on data engineering topics • 90% self-taught (+10% from my engineering degrees) @ongchinhwee

Before we begin, some disclaimers: @ongchinhwee • The insights in
this talk are based on my own technical expertise and does not in any way represent the views of my team and employer. • Scope is primarily focused on structured tabular data ◦ Similar concept with different implementation for unstructured data, but that’s not the main focus of this talk

Imagine this scenario in your data team: @ongchinhwee • Finance
requests a regeneration of a business-critical report dated 11 months ago for Client A • Client A has been upgraded from Tier 2 to Tier 1 3 months ago, which increases the rebate earn rate • However, you need to generate the report based on the Tier 2 earn rate (as Client A was a Tier 2 customer 11 months ago)

Imagine this scenario in your data team: @ongchinhwee

Time is an important dimension in your data @ongchinhwee •
Data is usually not static ◦ State transitions during a business process ◦ Attributes can change over time e.g. age, income, status

What is data versioning? @ongchinhwee • Concept of capturing state
changes while keeping track of successive versions of data over time • Creates a unique reference for a collection of data when changes occur while retaining previous versions • Data versioning approaches: ◦ Change data capture (CDC) for structured data ◦ Data version control for unstructured data

What is data versioning? @ongchinhwee • Types of data versioning:
◦ Change data capture (CDC) for structured data ◦ Data version control for unstructured data

Why does data versioning matter? @ongchinhwee • Reproducibility for data
governance and audit purposes

governance and audit purposes • Build data history with backward and forward compatibility ◦ What if there is a change in transformation logic that only applies to a speciﬁc time range?

governance and audit purposes • Build data history with backward and forward compatibility ◦ What if there is a change in transformation logic that only applies to a speciﬁc time range? • When you need to use point-in-time values to track business metrics over time

What is change data capture? @ongchinhwee • “Data versioning for
databases” • Design patterns for capturing and tracking changes in data from upstream source systems over time • Changes are captured in data warehouses / data lakes

Design Patterns for Change Data Capture @ongchinhwee • Data versioning
based on a combination of: ◦ Data version identiﬁers ◦ Timestamps ◦ Status indicators • Log trigger / tuple versioning ◦ Also known as Type 2 Slowly Changing Dimensions (SCDs) • Transaction logs

What do we mean by the “Modern Data Stack”? @ongchinhwee
• Cloud-based • Built around the cloud data warehouse / lake • Modular and customizable - “choose the best tool for a speciﬁc job”

@ongchinhwee

Data Warehousing in the Modern Data Stack @ongchinhwee • Cloud-based
compute and storage = more scalable

compute and storage = more scalable • From ETL to ELT - transformation within the data warehouse

compute and storage = more scalable • From ETL to ELT - transformation within the data warehouse • Possible to store “snapshots” of data in a cloud data warehouse to capture historical changes

Change data capture in the Modern Data Stack • Some
implementations of change data capture ◦ Traditional (Kimball’s) Dimensional Modelling techniques ▪ Slowly Changing Dimensions (SCDs) @ongchinhwee

Change data capture in the Modern Data Stack • Some
implementations of change data capture ◦ Traditional (Kimball’s) Dimensional Modelling techniques ▪ Slowly Changing Dimensions (SCDs) ◦ Modern approaches ▪ Data snapshots ▪ Incremental models @ongchinhwee

Traditional (Kimball-style) Dimensional Data Modelling • Developed by Kimball in
1996; updated in 2013 during the emergence of cloud data warehouses • Introduced the concept of facts vs dimensions (star schema) • Designed for storage and compute eﬃciency @ongchinhwee

Traditional (Kimball-style) Dimensional Data Modelling • Fundamental Concepts: Facts vs
Dimensions ◦ Fact Tables ▪ Contains metrics and facts about a business process e.g. process time, transaction amount ▪ Are typically fast-evolving during a business process event ▪ Eventually reaches a ﬁnal state at a point in time upon completion @ongchinhwee

Dimensions ◦ Dimension Tables ▪ Describes the attributes of a business process e.g. customer details ▪ Are typically slow-changing and updated over a longer period of time ▪ Does not have a “ﬁnal state” @ongchinhwee

Dimensions ◦ Dimension Tables ▪ Describes the attributes of a business process e.g. customer details ▪ Are typically slow-changing and updated over a longer period of time ▪ Does not have a “ﬁnal state” → problem of Slowly Changing Dimensions (SCDs) @ongchinhwee

What are Slowly Changing Dimensions (SCDs)? • Change tracking techniques
to handle state changes in dimensions @ongchinhwee SCD Type How state changes are handled 0 Ignore changes 1 Overwrite 2 Row versioning (w/ “validity period”) 3 Previous value column 4 History table (“mini-dimension”)

What are Slowly Changing Dimensions (SCDs)? • Type 0 Slowly
Changing Dimensions (Type 0 SCD) ◦ Fixed dimension e.g. account opening date ◦ Ignores any changes ◦ Assumes that the attribute will not change forever @ongchinhwee

What are Slowly Changing Dimensions (SCDs)? • Type 1 Slowly
Changing Dimensions (Type 1 SCD) ◦ Reﬂects the latest version of dimension attributes ◦ Previous version of the value in the dimension row is overwritten with new value ◦ Destroys history - not possible to rewind back to previous versions of the data @ongchinhwee

What is Slowly Changing Dimensions (SCDs)? • Type 2 Slowly
Changing Dimensions (Type 2 SCD) ◦ Implements row versioning for each dimension attribute ◦ Concept of “validity period” for each version of the data ▪ Row effective date / timestamp ▪ Row expiration date / timestamp ▪ Current row indicator @ongchinhwee

Changing Dimensions (Type 2 SCD) ◦ When a change is detected in a data record: ▪ A new dimension row is added with updated attribute values for the data record @ongchinhwee

Changing Dimensions (Type 2 SCD) ◦ When a change is detected in a data record: ▪ A new dimension row is added with updated attribute values for the data record ▪ New primary surrogate key is assigned to new dimension row @ongchinhwee

Changing Dimensions (Type 2 SCD) ◦ When a change is detected in a data record: ▪ A new dimension row is added with updated attribute values for the data record ▪ New primary surrogate key is assigned to new dimension row ▪ Previous version of the attribute is updated with row expiration timestamp @ongchinhwee

Type 2 SCD: What is tuple versioning? • Change data
capture mechanism that records changes to a mutable upstream table over time • Implements Type-2 Slowly Changing Dimensions on mutable table sources • Detects changes based on an `updated_at` timestamp @ongchinhwee

Components of tuple versioning • Target schema • Column for
timestamp-based tracking • Primary key / unique identiﬁer(s) for record • Whether to invalidate records no longer in source @ongchinhwee

How does tuple versioning work? @ongchinhwee

What is Slowly Changing Dimensions (SCDs)? • Slowly Changing Dimensions
(SCDs) ◦ Type 3, Type 4 etc. ▪ Less commonly used due to additional complexity without significant performance benefits in cloud data warehouses • Type 3: add new column to store previous attribute value • Type 4: add history table to keep record of state changes More information on Slowly Changing Dimension Techniques: Chapter 5 of The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition by Ralph Kimball and Margy Ross @ongchinhwee

Data Snapshots • Read-only (immutable) copies of the state of
a data source at a particular point in time • Stored at the staging area of a data warehouse @ongchinhwee

@ongchinhwee

Incremental models • Limiting the data transformation to a speciﬁed
subset of source data ◦ Usually on rows in source data that have been created or updated since the last scheduled run • Signiﬁcantly optimises runtime of transformations on large data + reduces compute cost @ongchinhwee

Incremental models @ongchinhwee

@ongchinhwee Anatomy of incremental materialization on dbt Adapted from: dbt-dummy
project (https://github.com/gmyrianthous/dbt-dummy/) Condition for incremental load Where to get the load from? Where to insert incremental load? How to insert/update incremental load?

@ongchinhwee Anatomy of incremental materialization with Type-2 SCD model as
source Condition for incremental load Where to get the load from? Where to insert incremental load? How to insert/update incremental load?

Anatomy of incremental models • Does destination table already exist
in data warehouse / lake? @ongchinhwee

in data warehouse / lake? • Full-refresh or incremental run? @ongchinhwee

in data warehouse / lake? • Full-refresh or incremental run? • Incremental strategy ◦ Data warehouse or data lake? @ongchinhwee

in data warehouse / lake? • Full-refresh or incremental run? • Incremental strategy ◦ Data warehouse or data lake? • Columns to track changes @ongchinhwee

in data warehouse / lake? • Full-refresh or incremental run? • Incremental strategy ◦ Data warehouse or data lake? • Columns to track changes • Query ﬁlter for incremental run @ongchinhwee

Is Kimball’s approach still relevant? (especially in the Modern Data
Stack) @ongchinhwee

Data Warehousing in the Modern Data Stack @ongchinhwee “Since storage
and compute are dirt cheap, engineering time is expensive, why not snapshot all your data (and append new partitions for each ETL schedule)?” • Does not apply for very large dimensions • Does not preclude the importance of dimensional data modelling Related reading: Functional Data Enginering - a modern paradigm for batch data processing (and related talks) by Maxime Beauchemin, creator of Airﬂow and Superset

Is Kimball’s Dimensional Data Modelling still relevant? • Yes, in
some ways. • Dimensional data modelling is still needed for these use cases: ◦ Aggregation of facts ◦ Metrics drill-down based on dimensions ▪ e.g. how many customers in Singapore spent > US$400 per month ◦ Financial reporting and audit @ongchinhwee

Tips and tricks on change-aware data modelling @ongchinhwee

Snapshot all your upstream source data! • What if assumptions
about your data change? ◦ Source data update mechanism (append-only vs overwrite) ◦ Source table schema (columns added/altered/dropped) ◦ Business logic • Store data snapshots in staging area ◦ Build SCDs from data snapshots @ongchinhwee

Source-target schema reconciliation Detect schema diffs between upstream source system
vs data warehouse • Useful for detecting schema changes @ongchinhwee

Use Type 2 SCDs + incremental model (in most cases)
• In most cases, Type 2 SCDs are suﬃcient for tracking and capturing data changes • Incremental models may be slightly complex, but pays off in eﬃciency + cost when data scales quickly and changes (quickly and slowly) over time @ongchinhwee

Strategies on designing incremental models • Design with upstream data
update mechanism in mind ◦ Affects how incremental load is loaded onto target model • Incremental strategy depends on data warehouse / lake ◦ Specify a unique key (for data warehouse) or partitioning key (for data lake) to update existing records • Performance tip: Filter rows early! @ongchinhwee

Key Takeaways • Dimensional data modeling techniques are still relevant
in the Modern Data Stack • Adopt a mixture of SCD-like approaches and incremental models when designing change-aware data models for non-retroactive business logic • Data snapshots are useful in cases of where re-modelling is required due to changes in business logic @ongchinhwee

Reach out to me! : ongchinhwee : [email protected] : @ongchinhwee
: hweecat : https://ongchinhwee.me And get those slides here: http://bit.ly/fossasia-change-aware-data @ongchinhwee

FOSSASIA Summit 2023 - Modeling data fast, slow...

FOSSASIA Summit 2023 - Modeling data fast, slow and back-in-time: Change-aware dimensional data modeling in the modern data stack

More Decks by Ong Chin Hwee

Other Decks in Programming

Featured

Transcript