Postgres Ibiza 2024 - Modeling data fast, slow and back-in-time: Change-aware dimensional data modeling in the modern data stack

Slide 1

Slide 1 text

Modeling Data Fast, Slow and Back-In-Time Change-aware Dimensional Data Modeling in the Modern Data Stack By: Chin Hwee Ong (@ongchinhwee) 9 September 2024 Postgres Ibiza 2024, Ibiza, Spain

Slide 2

Slide 2 text

About me Ong Chin Hwee 王敬惠 ● Senior Data Engineer @ Grab ● Speaker and (occasional) writer on data engineering topics ● 90% self-taught (+10% from my engineering degrees) @ongchinhwee

Slide 3

Slide 3 text

And also, here’s the slides: @ongchinhwee https://bit.ly/pgibz-change-aware-data

Slide 4

Slide 4 text

Before we begin, some disclaimers: @ongchinhwee ● The insights in this talk are based on my own technical expertise and does not in any way represent the views of my team and employer. ● Scope is primarily focused on structured tabular data ○ Similar concept with different implementation for unstructured data, but that’s not the main focus of this talk

Slide 5

Slide 5 text

Imagine this scenario in your data team: @ongchinhwee ● Finance requests a regeneration of a business-critical report dated 11 months ago for Client A ● Client A has been upgraded from Tier 2 to Tier 1 3 months ago, which increases the rebate earn rate ● However, you need to generate the report based on the Tier 2 earn rate (as Client A was a Tier 2 customer 11 months ago)

Slide 6

Slide 6 text

Imagine this scenario in your data team: @ongchinhwee

Slide 7

Slide 7 text

Time is an important dimension in your data @ongchinhwee ● Data is usually not static ○ State transitions during a business process ○ Attributes can change over time e.g. age, income, status

Slide 8

Slide 8 text

What is data versioning? @ongchinhwee ● Concept of capturing state changes while keeping track of successive versions of data over time ● Creates a unique reference for a collection of data when changes occur while retaining previous versions

Slide 9

Slide 9 text

What is data versioning? @ongchinhwee ● Types of data versioning: ○ Change data capture (CDC) for structured data ○ Data version control for unstructured data

Slide 10

Slide 10 text

Why does data versioning matter? @ongchinhwee ● Reproducibility for data governance and audit purposes

Slide 11

Slide 11 text

Why does data versioning matter? @ongchinhwee ● Reproducibility for data governance and audit purposes ● Build data history with backward and forward compatibility ○ What if there is a change in transformation logic that only applies to a speciﬁc time range?

Slide 12

Slide 12 text

Slide 13

Slide 13 text

What is change data capture? @ongchinhwee ● “Data versioning for databases” ● Design patterns for capturing and tracking changes in data from upstream source systems over time ● Changes are captured in data warehouses / data lakes

Slide 14

Slide 14 text

Design Patterns for Change Data Capture @ongchinhwee ● Data versioning based on a combination of: ○ Data version identiﬁers ○ Timestamps ○ Status indicators ● Log trigger / tuple versioning ○ Also known as Type 2 Slowly Changing Dimensions (SCDs) ● Transaction logs

Slide 15

Slide 15 text

What do we mean by the “Modern Data Stack”? @ongchinhwee ● Cloud-based ● Built around the cloud data warehouse / lake ● Modular and customizable - “choose the best tool for a speciﬁc job”

Slide 16

Slide 16 text

@ongchinhwee

Slide 17

Slide 17 text

Data Warehousing in the Modern Data Stack @ongchinhwee ● Cloud-based compute and storage = more scalable

Slide 18

Slide 18 text

Data Warehousing in the Modern Data Stack @ongchinhwee ● Cloud-based compute and storage = more scalable ● From ETL to ELT - transformation within the data warehouse

Slide 19

Slide 19 text

Data Warehousing in the Modern Data Stack @ongchinhwee ● Cloud-based compute and storage = more scalable ● From ETL to ELT - transformation within the data warehouse ● Possible to store “snapshots” of data in a cloud data warehouse to capture historical changes

Slide 20

Slide 20 text

Change data capture in the Modern Data Stack ● Some implementations of change data capture ○ Traditional (Kimball’s) Dimensional Modelling techniques ■ Slowly Changing Dimensions (SCDs) @ongchinhwee

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Traditional (Kimball-style) Dimensional Data Modelling ● Developed by Kimball in 1996; updated in 2013 during the emergence of cloud data warehouses ● Introduced the concept of facts vs dimensions (star schema) ● Designed for storage and compute eﬃciency @ongchinhwee

Slide 23

Slide 23 text

Traditional (Kimball-style) Dimensional Data Modelling ● Fundamental Concepts: Facts vs Dimensions ○ Fact Tables ■ Contains metrics and facts about a business process e.g. process time, transaction amount ■ Are typically fast-evolving during a business process event ■ Eventually reaches a ﬁnal state at a point in time upon completion @ongchinhwee

Slide 24

Slide 24 text

Traditional (Kimball-style) Dimensional Data Modelling ● Fundamental Concepts: Facts vs Dimensions ○ Dimension Tables ■ Describes the attributes of a business process e.g. customer details ■ Are typically slow-changing and updated over a longer period of time ■ Does not have a “ﬁnal state” @ongchinhwee

Slide 25

Slide 25 text

Slide 26

Slide 26 text

What are Slowly Changing Dimensions (SCDs)? ● Change tracking techniques to handle state changes in dimensions @ongchinhwee SCD Type How state changes are handled 0 Ignore changes 1 Overwrite 2 Row versioning (w/ “validity period”) 3 Previous value column 4 History table (“mini-dimension”)

Slide 27

Slide 27 text

What are Slowly Changing Dimensions (SCDs)? ● Type 0 Slowly Changing Dimensions (Type 0 SCD) ○ Fixed dimension e.g. account opening date ○ Ignores any changes ○ Assumes that the attribute will not change forever @ongchinhwee

Slide 28

Slide 28 text

What are Slowly Changing Dimensions (SCDs)? ● Type 1 Slowly Changing Dimensions (Type 1 SCD) ○ Reﬂects the latest version of dimension attributes ○ Previous version of the value in the dimension row is overwritten with new value ○ Destroys history - not possible to rewind back to previous versions of the data @ongchinhwee

Slide 29

Slide 29 text

What is Slowly Changing Dimensions (SCDs)? ● Type 2 Slowly Changing Dimensions (Type 2 SCD) ○ Implements row versioning for each dimension attribute ○ Concept of “validity period” for each version of the data ■ Row effective date / timestamp ■ Row expiration date / timestamp ■ Current row indicator @ongchinhwee

Slide 30

Slide 30 text

Slide 31

Slide 31 text

What is Slowly Changing Dimensions (SCDs)? ● Type 2 Slowly Changing Dimensions (Type 2 SCD) ○ When a change is detected in a data record: ■ A new dimension row is added with updated attribute values for the data record ■ New primary surrogate key is assigned to new dimension row @ongchinhwee

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Type 2 SCD: What is tuple versioning? ● Change data capture mechanism that records changes to a mutable upstream table over time ● Implements Type-2 Slowly Changing Dimensions on mutable table sources ● Detects changes based on an `updated_at` timestamp @ongchinhwee

Slide 34

Slide 34 text

Components of tuple versioning ● Target schema ● Column for timestamp-based tracking ● Primary key / unique identiﬁer(s) for record ● Whether to invalidate records no longer in source @ongchinhwee

Slide 35

Slide 35 text

How does tuple versioning work? @ongchinhwee

Slide 36

Slide 36 text

How does tuple versioning work? @ongchinhwee

Slide 37

Slide 37 text

What is Slowly Changing Dimensions (SCDs)? ● Slowly Changing Dimensions (SCDs) ○ Type 3, Type 4 etc. ■ Less commonly used due to additional complexity without significant performance benefits in cloud data warehouses ● Type 3: add new column to store previous attribute value ● Type 4: add history table to keep record of state changes More information on Slowly Changing Dimension Techniques: Chapter 5 of The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition by Ralph Kimball and Margy Ross @ongchinhwee

Slide 38

Slide 38 text

Data Snapshots ● Read-only (immutable) copies of the state of a data source at a particular point in time ● Stored at the staging area of a data warehouse @ongchinhwee

Slide 39

Slide 39 text

@ongchinhwee

Slide 40

Slide 40 text

Incremental models ● Limiting the data transformation to a speciﬁed subset of source data ○ Usually on rows in source data that have been created or updated since the last scheduled run ● Signiﬁcantly optimises runtime of transformations on large data + reduces compute cost @ongchinhwee

Slide 41

Slide 41 text

Incremental models @ongchinhwee

Slide 42

Slide 42 text

@ongchinhwee Anatomy of incremental materialization on dbt Adapted from: dbt-dummy project (https://github.com/gmyrianthous/dbt-dummy/) Condition for incremental load Where to get the load from? Where to insert incremental load? How to insert/update incremental load?

Slide 43

Slide 43 text

@ongchinhwee Anatomy of incremental materialization with Type-2 SCD model as source Condition for incremental load vs backfilling Where to get the load from? Where to insert incremental load? How to insert/update incremental load?

Slide 44

Slide 44 text

Anatomy of incremental models ● Does destination table already exist in data warehouse / lake? @ongchinhwee

Slide 45

Slide 45 text

Anatomy of incremental models ● Does destination table already exist in data warehouse / lake? ● Full-refresh or incremental run? @ongchinhwee

Slide 46

Slide 46 text

Anatomy of incremental models ● Does destination table already exist in data warehouse / lake? ● Full-refresh or incremental run? ● Incremental strategy ○ Data warehouse or data lake? @ongchinhwee

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Slide 49

Slide 49 text

What if we need to rewind history on the state of the data warehouse / data lake? @ongchinhwee

Slide 50

Slide 50 text

Imagine this scenario in your data team: @ongchinhwee ● On 1 August 2024, Audit team requests a review of trade valuations for 30 June 2024 as of 1 July 2024. ● Due to late-arriving market information, there was a retroactive update on one of the trade valuation records for 30 June 2024 on 15 July 2024. ● However, you need to provide the trade valuations based on what was known on 1 July 2024.

Slide 51

Slide 51 text

Imagine this scenario in your data team: @ongchinhwee

Slide 52

Slide 52 text

Bitemporal models for system versioning ● Fundamental Concepts: System Time vs Valid Time ○ Valid Time ■ Also known as actual time ■ Time when the event occured in the real world ■ Mutable depending on effective period determined by changes in event state @ongchinhwee

Slide 53

Slide 53 text

Bitemporal models for system versioning ● Fundamental Concepts: System Time vs Valid Time ○ System Time ■ Also known as record time (or “transaction time” in database) ■ Time when the data for the event is recorded in the system ■ Immutable by design based on current state of data @ongchinhwee

Slide 54

Slide 54 text

Bitemporal models for system versioning ● Why bitemporal models? ○ Retroactive changes to valid-time tuple versioning models ■ Usually requires full-refresh → overwrites event history! ○ Preserve history over time with time-travel for data audit and governance @ongchinhwee

Slide 55

Slide 55 text

Bitemporal models for system versioning ● Some implementations of bitemporal models ○ Data snapshots of tuple versioning models @ongchinhwee

Slide 56

Slide 56 text

Slide 57

Slide 57 text

Bitemporal models for system versioning ● Some implementations of bitemporal models ○ Data snapshots of tuple versioning models ○ History table for tracking state changes w/ system time partitioning ○ Change data capture (CDC) streams on event-aware data ■ Capture DML changes on data records with “valid time ranges” @ongchinhwee

Slide 58

Slide 58 text

Bitemporal models in other databases / data warehouse ● MariaDB ○ Support for both system-versioning tables and application-time periods ○ Includes the following SQL extensions: ■ PERIOD FOR syntax supports both system and event time periods ■ Immutable system time (vs. mutable time periods) ○ Design pattern: Time ranges for system time and event time Ref: Temporal Tables - MariaDB Knowledge Base @ongchinhwee

Slide 59

Slide 59 text

Bitemporal models in other databases / data warehouse ● CockroachDB ○ Support for time travel queries ○ Includes the following SQL extensions on top of Postgres: ■ Access historical data based on timestamp (“AS OF SYSTEM TIME”) ○ Design pattern: Tuple versioning (SCD-2) model based on valid time + time-travel data snapshots based on system time Ref: AS OF SYSTEM TIME - CockroachDB Documentation @ongchinhwee

Slide 60

Slide 60 text

Bitemporal models in other databases / data warehouse ● Snowﬂake Time Travel ○ Supports time travel retention up to 90 days prior (for Enterprise Edition) ○ Includes the following SQL extensions ■ Access historical data based on timestamp ■ UNDROP command for tables, schemas and databases ○ Design pattern: Tuple versioning (SCD-2) model based on valid time + time-travel data snapshots based on system time Ref: Understanding & Using Time Travel - Snowﬂake Documentation @ongchinhwee

Slide 61

Slide 61 text

Bitemporal models in Postgres? (Hint: it’s been brewing) @ongchinhwee

Slide 62

Slide 62 text

Native features and extensions in Postgres releases ● pg_bitemporal ○ Postgres extension by Henriatta (Hettie) Dombrovskaya ○ Leverages on GiST indexes with exclusion constraints for bitemporal models ● (Future) Temporal keys ○ Ongoing work by Peter Eisentraut and team to implement temporal keys with exclusion constraints as a native feature in Postgres ○ Current problem: empty ranges and multiranges @ongchinhwee

Slide 63

Slide 63 text

A re-look at the original design of Postgres ● Original design is intended to support time travel ○ The original Postgres storage system design is “no-overwrite” ■ Copy of data record w/ copy of write in secondary store ■ “Vacuuming” of historical records + archive data permanently into an immutable store ○ However, this design was deprecated due to technological constraints at that time. Ref: The Design of Postgres by Michael Stonebraker and Lawrence A. Lowe @ongchinhwee

Slide 64

Slide 64 text

Is Kimball’s approach still relevant? (especially in the Modern Data Stack) @ongchinhwee

Slide 65

Slide 65 text

Data Warehousing in the Modern Data Stack @ongchinhwee “Since storage and compute are dirt cheap, engineering time is expensive, why not snapshot all your data (and append new partitions for each ETL schedule)?” ● Does not apply for very large dimensions ● Does not preclude the importance of dimensional data modelling Related reading: Functional Data Enginering - a modern paradigm for batch data processing (and related talks) by Maxime Beauchemin, creator of Airﬂow and Superset

Slide 66

Slide 66 text

Is Kimball’s Dimensional Data Modelling still relevant? ● Yes, in some ways. ● Dimensional data modelling is still needed for these use cases: ○ Aggregation of facts ○ Metrics drill-down based on dimensions ■ e.g. how many customers in Singapore spent > US$400 per month ○ Financial reporting and audit @ongchinhwee

Slide 67

Slide 67 text

Tips and tricks on change-aware data modelling @ongchinhwee

Slide 68

Slide 68 text

Snapshot all your upstream source data! ● What if assumptions about your data change? ○ Source data update mechanism (append-only vs overwrite) ○ Source table schema (columns added/altered/dropped) ○ Business logic ● Store data snapshots of upstream source in staging area ○ Build SCDs from data snapshots @ongchinhwee

Slide 69

Slide 69 text

Snapshot the state of your data warehouse data! ● What if we need to “time travel” the data back to the past? ○ Restore data object changes (e.g. accidental table deletion) ○ Periodic data backups ○ Audit log of retroactive changes to data records ● Store database object snapshots in immutable store ○ Append-only snapshot record for each database transaction ○ Build SCDs from database object snapshots @ongchinhwee

Slide 70

Slide 70 text

Source-target schema reconciliation Detect schema diffs between upstream source system vs data warehouse ● Useful for detecting schema changes @ongchinhwee

Slide 71

Slide 71 text

Use Type 2 SCDs + incremental model (in most cases) ● In most cases, Type 2 SCDs are suﬃcient for tracking and capturing data changes ● Incremental models may be slightly complex, but pays off in eﬃciency + cost when data scales quickly and changes (quickly and slowly) over time @ongchinhwee

Slide 72

Slide 72 text

Strategies on designing incremental models ● Design with upstream data update mechanism in mind ○ Affects how incremental load is loaded onto target model ● Incremental strategy depends on data warehouse / lake ○ Specify a unique key (for data warehouse) or partitioning key (for data lake) to update existing records ● Performance tip: Filter rows early! @ongchinhwee

Slide 73

Slide 73 text

Key Takeaways ● Dimensional data modeling techniques are still relevant in the Modern Data Stack ● Adopt a mixture of SCD-like approaches and incremental models when designing change-aware data models for non-retroactive business logic ● Data snapshots are useful in cases of where re-modelling is required due to retroactive changes in the data @ongchinhwee

Slide 74

Slide 74 text

Reach out to me! : ongchinhwee : [email protected] : @ongchinhwee : hweecat : https://ongchinhwee.me And get these slides here: https://bit.ly/pgibz-change-aware-data @ongchinhwee