FOSSASIA Summit 2023 - Modeling data fast, slow and back-in-time: Change-aware dimensional data modeling in the modern data stack

Slide 1

Slide 1 text

Modeling Data Fast, Slow and Back-In-Time Change-aware Dimensional Data Modeling in the Modern Data Stack By: Chin Hwee Ong (@ongchinhwee) 15 April 2023 FOSSASIA Summit 2023, Singapore

Slide 2

Slide 2 text

About me Ong Chin Hwee 王敬惠 ● Senior Data Engineer @ Grab ● Speaker and (occasional) writer on data engineering topics ● 90% self-taught (+10% from my engineering degrees) @ongchinhwee

Slide 3

Slide 3 text

Before we begin, some disclaimers: @ongchinhwee ● The insights in this talk are based on my own technical expertise and does not in any way represent the views of my team and employer. ● Scope is primarily focused on structured tabular data ○ Similar concept with different implementation for unstructured data, but that’s not the main focus of this talk

Slide 4

Slide 4 text

Imagine this scenario in your data team: @ongchinhwee ● Finance requests a regeneration of a business-critical report dated 11 months ago for Client A ● Client A has been upgraded from Tier 2 to Tier 1 3 months ago, which increases the rebate earn rate ● However, you need to generate the report based on the Tier 2 earn rate (as Client A was a Tier 2 customer 11 months ago)

Slide 5

Slide 5 text

Imagine this scenario in your data team: @ongchinhwee

Slide 6

Slide 6 text

Time is an important dimension in your data @ongchinhwee ● Data is usually not static ○ State transitions during a business process ○ Attributes can change over time e.g. age, income, status

Slide 7

Slide 7 text

What is data versioning? @ongchinhwee ● Concept of capturing state changes while keeping track of successive versions of data over time ● Creates a unique reference for a collection of data when changes occur while retaining previous versions ● Data versioning approaches: ○ Change data capture (CDC) for structured data ○ Data version control for unstructured data

Slide 8

Slide 8 text

What is data versioning? @ongchinhwee ● Types of data versioning: ○ Change data capture (CDC) for structured data ○ Data version control for unstructured data

Slide 9

Slide 9 text

Why does data versioning matter? @ongchinhwee ● Reproducibility for data governance and audit purposes

Slide 10

Slide 10 text

Why does data versioning matter? @ongchinhwee ● Reproducibility for data governance and audit purposes ● Build data history with backward and forward compatibility ○ What if there is a change in transformation logic that only applies to a speciﬁc time range?

Slide 11

Slide 11 text

Slide 12

Slide 12 text

What is change data capture? @ongchinhwee ● “Data versioning for databases” ● Design patterns for capturing and tracking changes in data from upstream source systems over time ● Changes are captured in data warehouses / data lakes

Slide 13

Slide 13 text

Design Patterns for Change Data Capture @ongchinhwee ● Data versioning based on a combination of: ○ Data version identiﬁers ○ Timestamps ○ Status indicators ● Log trigger / tuple versioning ○ Also known as Type 2 Slowly Changing Dimensions (SCDs) ● Transaction logs

Slide 14

Slide 14 text

What do we mean by the “Modern Data Stack”? @ongchinhwee ● Cloud-based ● Built around the cloud data warehouse / lake ● Modular and customizable - “choose the best tool for a speciﬁc job”

Slide 15

Slide 15 text

@ongchinhwee

Slide 16

Slide 16 text

Data Warehousing in the Modern Data Stack @ongchinhwee ● Cloud-based compute and storage = more scalable

Slide 17

Slide 17 text

Data Warehousing in the Modern Data Stack @ongchinhwee ● Cloud-based compute and storage = more scalable ● From ETL to ELT - transformation within the data warehouse

Slide 18

Slide 18 text

Data Warehousing in the Modern Data Stack @ongchinhwee ● Cloud-based compute and storage = more scalable ● From ETL to ELT - transformation within the data warehouse ● Possible to store “snapshots” of data in a cloud data warehouse to capture historical changes

Slide 19

Slide 19 text

Change data capture in the Modern Data Stack ● Some implementations of change data capture ○ Traditional (Kimball’s) Dimensional Modelling techniques ■ Slowly Changing Dimensions (SCDs) @ongchinhwee

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Traditional (Kimball-style) Dimensional Data Modelling ● Developed by Kimball in 1996; updated in 2013 during the emergence of cloud data warehouses ● Introduced the concept of facts vs dimensions (star schema) ● Designed for storage and compute eﬃciency @ongchinhwee

Slide 22

Slide 22 text

Traditional (Kimball-style) Dimensional Data Modelling ● Fundamental Concepts: Facts vs Dimensions ○ Fact Tables ■ Contains metrics and facts about a business process e.g. process time, transaction amount ■ Are typically fast-evolving during a business process event ■ Eventually reaches a ﬁnal state at a point in time upon completion @ongchinhwee

Slide 23

Slide 23 text

Traditional (Kimball-style) Dimensional Data Modelling ● Fundamental Concepts: Facts vs Dimensions ○ Dimension Tables ■ Describes the attributes of a business process e.g. customer details ■ Are typically slow-changing and updated over a longer period of time ■ Does not have a “ﬁnal state” @ongchinhwee

Slide 24

Slide 24 text

Slide 25

Slide 25 text

What are Slowly Changing Dimensions (SCDs)? ● Change tracking techniques to handle state changes in dimensions @ongchinhwee SCD Type How state changes are handled 0 Ignore changes 1 Overwrite 2 Row versioning (w/ “validity period”) 3 Previous value column 4 History table (“mini-dimension”)

Slide 26

Slide 26 text

What are Slowly Changing Dimensions (SCDs)? ● Type 0 Slowly Changing Dimensions (Type 0 SCD) ○ Fixed dimension e.g. account opening date ○ Ignores any changes ○ Assumes that the attribute will not change forever @ongchinhwee

Slide 27

Slide 27 text

What are Slowly Changing Dimensions (SCDs)? ● Type 1 Slowly Changing Dimensions (Type 1 SCD) ○ Reﬂects the latest version of dimension attributes ○ Previous version of the value in the dimension row is overwritten with new value ○ Destroys history - not possible to rewind back to previous versions of the data @ongchinhwee

Slide 28

Slide 28 text

What is Slowly Changing Dimensions (SCDs)? ● Type 2 Slowly Changing Dimensions (Type 2 SCD) ○ Implements row versioning for each dimension attribute ○ Concept of “validity period” for each version of the data ■ Row effective date / timestamp ■ Row expiration date / timestamp ■ Current row indicator @ongchinhwee

Slide 29

Slide 29 text

Slide 30

Slide 30 text

What is Slowly Changing Dimensions (SCDs)? ● Type 2 Slowly Changing Dimensions (Type 2 SCD) ○ When a change is detected in a data record: ■ A new dimension row is added with updated attribute values for the data record ■ New primary surrogate key is assigned to new dimension row @ongchinhwee

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Type 2 SCD: What is tuple versioning? ● Change data capture mechanism that records changes to a mutable upstream table over time ● Implements Type-2 Slowly Changing Dimensions on mutable table sources ● Detects changes based on an `updated_at` timestamp @ongchinhwee

Slide 33

Slide 33 text

Components of tuple versioning ● Target schema ● Column for timestamp-based tracking ● Primary key / unique identiﬁer(s) for record ● Whether to invalidate records no longer in source @ongchinhwee

Slide 34

Slide 34 text

How does tuple versioning work? @ongchinhwee

Slide 35

Slide 35 text

How does tuple versioning work? @ongchinhwee

Slide 36

Slide 36 text

What is Slowly Changing Dimensions (SCDs)? ● Slowly Changing Dimensions (SCDs) ○ Type 3, Type 4 etc. ■ Less commonly used due to additional complexity without significant performance benefits in cloud data warehouses ● Type 3: add new column to store previous attribute value ● Type 4: add history table to keep record of state changes More information on Slowly Changing Dimension Techniques: Chapter 5 of The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition by Ralph Kimball and Margy Ross @ongchinhwee

Slide 37

Slide 37 text

Data Snapshots ● Read-only (immutable) copies of the state of a data source at a particular point in time ● Stored at the staging area of a data warehouse @ongchinhwee

Slide 38

Slide 38 text

@ongchinhwee

Slide 39

Slide 39 text

Incremental models ● Limiting the data transformation to a speciﬁed subset of source data ○ Usually on rows in source data that have been created or updated since the last scheduled run ● Signiﬁcantly optimises runtime of transformations on large data + reduces compute cost @ongchinhwee

Slide 40

Slide 40 text

Incremental models @ongchinhwee

Slide 41

Slide 41 text

@ongchinhwee Anatomy of incremental materialization on dbt Adapted from: dbt-dummy project (https://github.com/gmyrianthous/dbt-dummy/) Condition for incremental load Where to get the load from? Where to insert incremental load? How to insert/update incremental load?

Slide 42

Slide 42 text

@ongchinhwee Anatomy of incremental materialization with Type-2 SCD model as source Condition for incremental load Where to get the load from? Where to insert incremental load? How to insert/update incremental load?

Slide 43

Slide 43 text

Anatomy of incremental models ● Does destination table already exist in data warehouse / lake? @ongchinhwee

Slide 44

Slide 44 text

Anatomy of incremental models ● Does destination table already exist in data warehouse / lake? ● Full-refresh or incremental run? @ongchinhwee

Slide 45

Slide 45 text

Anatomy of incremental models ● Does destination table already exist in data warehouse / lake? ● Full-refresh or incremental run? ● Incremental strategy ○ Data warehouse or data lake? @ongchinhwee

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Is Kimball’s approach still relevant? (especially in the Modern Data Stack) @ongchinhwee

Slide 49

Slide 49 text

Data Warehousing in the Modern Data Stack @ongchinhwee “Since storage and compute are dirt cheap, engineering time is expensive, why not snapshot all your data (and append new partitions for each ETL schedule)?” ● Does not apply for very large dimensions ● Does not preclude the importance of dimensional data modelling Related reading: Functional Data Enginering - a modern paradigm for batch data processing (and related talks) by Maxime Beauchemin, creator of Airﬂow and Superset

Slide 50

Slide 50 text

Is Kimball’s Dimensional Data Modelling still relevant? ● Yes, in some ways. ● Dimensional data modelling is still needed for these use cases: ○ Aggregation of facts ○ Metrics drill-down based on dimensions ■ e.g. how many customers in Singapore spent > US$400 per month ○ Financial reporting and audit @ongchinhwee

Slide 51

Slide 51 text

Tips and tricks on change-aware data modelling @ongchinhwee

Slide 52

Slide 52 text

Snapshot all your upstream source data! ● What if assumptions about your data change? ○ Source data update mechanism (append-only vs overwrite) ○ Source table schema (columns added/altered/dropped) ○ Business logic ● Store data snapshots in staging area ○ Build SCDs from data snapshots @ongchinhwee

Slide 53

Slide 53 text

Source-target schema reconciliation Detect schema diffs between upstream source system vs data warehouse ● Useful for detecting schema changes @ongchinhwee

Slide 54

Slide 54 text

Use Type 2 SCDs + incremental model (in most cases) ● In most cases, Type 2 SCDs are suﬃcient for tracking and capturing data changes ● Incremental models may be slightly complex, but pays off in eﬃciency + cost when data scales quickly and changes (quickly and slowly) over time @ongchinhwee

Slide 55

Slide 55 text

Strategies on designing incremental models ● Design with upstream data update mechanism in mind ○ Affects how incremental load is loaded onto target model ● Incremental strategy depends on data warehouse / lake ○ Specify a unique key (for data warehouse) or partitioning key (for data lake) to update existing records ● Performance tip: Filter rows early! @ongchinhwee

Slide 56

Slide 56 text

Key Takeaways ● Dimensional data modeling techniques are still relevant in the Modern Data Stack ● Adopt a mixture of SCD-like approaches and incremental models when designing change-aware data models for non-retroactive business logic ● Data snapshots are useful in cases of where re-modelling is required due to changes in business logic @ongchinhwee

Slide 57

Slide 57 text

Reach out to me! : ongchinhwee : [email protected] : @ongchinhwee : hweecat : https://ongchinhwee.me And get those slides here: http://bit.ly/fossasia-change-aware-data @ongchinhwee