MDS Fest 3.0 - Your Data is Late Again! Handling Late-Arriving Data in the Modern Data Stack

Slide 1

Slide 1 text

Your Data is Late Again! Handling Late-Arriving Data in the Modern Data Stack By: Chin Hwee Ong (@ongchinhwee) 9 May 2025 MDS Fest 3.0, Online

Slide 2

Slide 2 text

About me Ong Chin Hwee 王敬惠 ● Senior Data Engineer @ Grab ● Speaker and (occasional) writer on data engineering topics ● Aerospace + business + computation @ongchinhwee

Slide 3

Slide 3 text

Before we begin, let’s set some expectations: @ongchinhwee ● The insights in this talk are based on my own technical expertise / “war stories” and does not in any way represent the views of my team and employer. ○ Any resemblance to actual events is purely coincidental.

Slide 4

Slide 4 text

“Latest” data in the data warehouse / lake ⇒ which timestamp is it? @ongchinhwee

Slide 5

Slide 5 text

“Latest” data in the data warehouse / lake ⇒ which timestamp is it? @ongchinhwee When the data was last updated in source? When the data reaches the data warehouse / lake? When the data event last happened?

Slide 6

Slide 6 text

Imagine this scenario in your data team: @ongchinhwee ● You have a data pipeline that ingests transactions and customer data from the database systems every hour. ● Customer data needs to be integrated with payments data for transactions monitoring throughout the day. ● There is an outage that caused the updated customer data to arrive 3hr later than scheduled.

Slide 7

Slide 7 text

Your data team’s objectives @ongchinhwee ● Ensure timely and accurate reporting for business operations ● Include late-arriving customer data in your monitoring reports ● Maintain audit log of data changes for data governance

Slide 8

Slide 8 text

Time is an important dimension in your data @ongchinhwee ● Data is usually not static ○ State transitions during a business process ○ Attributes can change over time e.g. age, income, status

Slide 9

Slide 9 text

Concept of time in data ● Fundamental Concepts: System Time vs Valid Time ○ Valid Time ■ Also known as actual time or event time ■ Time when the event occured in the real world ■ Mutable depending on effective period determined by changes in event state @ongchinhwee

Slide 10

Slide 10 text

Concept of time in data ● Fundamental Concepts: System Time vs Valid Time ○ System Time ■ Also known as record time (or “transaction time” in database) ■ Time when the data for the event is recorded in the system ■ Immutable by design based on current state of data @ongchinhwee

Slide 11

Slide 11 text

@ongchinhwee

Slide 12

Slide 12 text

Concept of time in data - Modern Data Stack @ongchinhwee ● Fundamental Concepts: System Time vs Valid Time ○ System Time ■ For data platforms, multiple systems’ timestamps to consider ● Source system time (database, object storage etc.) ● Ingestion system time ● Data warehouse / lake processing system time

Slide 13

Slide 13 text

Data Latency in Modern Data Stack @ongchinhwee ● Time lag between event time vs system time for data events ● Multiple causes: ○ Latency in upstream systems ○ Network delays during ingestion to data warehouse / lake ○ Out-of-order event generation ⇒ Change data capture for timely change tracking

Slide 14

Slide 14 text

What is change data capture? @ongchinhwee ● “Data versioning for databases” ● Design patterns for capturing and tracking changes in data from upstream source systems over time ● Changes are captured in data warehouses / data lakes

Slide 15

Slide 15 text

Change data capture in the Modern Data Stack ● Some implementations of change data capture ○ Modern approaches ■ Data snapshots ■ Incremental models ○ Traditional (Kimball’s) Dimensional Modelling techniques ■ Slowly Changing Dimensions (SCDs) @ongchinhwee

Slide 16

Slide 16 text

@ongchinhwee

Slide 17

Slide 17 text

Incremental models @ongchinhwee

Slide 18

Slide 18 text

@ongchinhwee Anatomy of incremental materialization Adapted from: dbt-dummy project (https://github.com/gmyrianthous/dbt-dummy/) 1. Table exists? Full-refresh or incremental 2. Condition for incremental load Where to get the load from? Where to insert incremental load? How to insert/update incremental load?

Slide 19

Slide 19 text

What are Slowly Changing Dimensions (SCDs)? ● Type 1 Slowly Changing Dimensions (Type 1 SCD) ○ Reﬂects the latest version of dimension attributes ○ Previous version of the value in the dimension row is overwritten with new value @ongchinhwee

Slide 20

Slide 20 text

What are Slowly Changing Dimensions (SCDs)? ● Type 2 Slowly Changing Dimensions (Type 2 SCD) ○ Implements row versioning for each dimension attribute ○ Concept of “validity period” for each version of the data @ongchinhwee

Slide 21

Slide 21 text

“Your data is late again! Do something before your users complain!” @ongchinhwee

Slide 22

Slide 22 text

First Layer of Defense: Preventing Downstream Impact @ongchinhwee

Slide 23

Slide 23 text

First Line of Defense: Preventing Downstream Impact @ongchinhwee ● Objective: Stop incomplete / inconsistent data from ﬂowing into destination tables ● Preventive approaches we can adopt: ○ Data reconciliation ○ Data integrity checkpoints

Slide 24

Slide 24 text

Data Reconciliation ● Comparing data from different sources to ensure: ○ Accuracy: values remain correct ○ Consistency: data between sources matches ○ Completeness: all records have been transferred @ongchinhwee

Slide 25

Slide 25 text

@ongchinhwee

Slide 26

Slide 26 text

Data Reconciliation ● Handling late-arriving dimensions in incremental load / stream ○ Streaming Reconciliation Pattern ■ Identiﬁcation ■ Reconciliation ■ Retryable stream (Ref: “How to Handle Late Arriving Dimensions with a Streaming Reconciliation Pattern” in Databricks blog) @ongchinhwee

Slide 27

Slide 27 text

@ongchinhwee (Ref: “How to Handle Late Arriving Dimensions with a Streaming Reconciliation Pattern” in Databricks blog)

Slide 28

Slide 28 text

Data Integrity Checkpoints ● Data validation checks to detect data quality issues from late data: ○ Referential integrity across different data sources ○ Accepted values to detect deviation from expected values ○ Missing or null values for late-arriving / missing dimensions after table joins ○ Data freshness checks for tables in staging area @ongchinhwee

Slide 29

Slide 29 text

Data Integrity Checkpoints ● How to implement data integrity checkpoints? ○ Use data testing frameworks / tools to deﬁne validation criteria ■ Examples: dbt (data build tool) data tests, Great Expectations ○ Automate validation checks with data pipeline orchestration ■ Schedule validation checks after ingestion + transformation ■ Examples: Airﬂow, Dagster @ongchinhwee

Slide 30

Slide 30 text

@ongchinhwee

Slide 31

Slide 31 text

The Limits of Incrementality (a.k.a. “how far can we look back in history”?) @ongchinhwee

Slide 32

Slide 32 text

Source: dbt discourse @ongchinhwee

Slide 33

Slide 33 text

Incremental Strategy for Late-arriving Data ● Is the table a fact or dimension? ● When can we expect late data to reach (in most cases)? ● Lookback “window” approach to handle late-arriving data ○ Setting maximum delay allowed to include late data in incremental data stream ○ Performance vs Accuracy trade-off @ongchinhwee

Slide 34

Slide 34 text

Case: Incremental Strategy for Late-arriving Facts ● Example: ○ Payment transactions stuck in intermediate state for more than a day ○ Average traﬃc: ~100 billion transactions / day ○ Event resolution time: ~3 days w/ outliers @ongchinhwee

Slide 35

Slide 35 text

Case: Incremental Strategy for Late-arriving Facts ● Design Question to Consider ○ Accuracy vs Performance ■ Choose accuracy? ● Scan a wider date range / more date partitions on source table to update the data in destination table ■ Choose performance? ● Small possibility of late data loss from late arrivals outside lookback “window” @ongchinhwee

Slide 36

Slide 36 text

Lookback “window” w/ maximum accepted delay of 3 days @ongchinhwee

Slide 37

Slide 37 text

“What if the dimension data is late?” @ongchinhwee

Slide 38

Slide 38 text

Case: Incremental Strategy for Late-arriving Dimensions ● Example: ○ Customer information for new customers arrives > 1 day later than transactions ○ 5000 customer records (out of 3 million customers) impacted ○ We need to join both sets of data for transactions monitoring @ongchinhwee

Slide 39

Slide 39 text

Case: Incremental Strategy for Late-arriving Dimensions ● Design Questions to Consider ○ Slowly-Changing Dimensions - which timestamp to use? ■ Use valid time (event timestamp)? ● Dimension data may arrive much later (~1 year?) in system ⇒ longer lookback when scanning dimension table. ■ Use system time (system updated_at timestamp)? ● System time != real world events e.g. late customer onboarding @ongchinhwee

Slide 40

Slide 40 text

Case: Incremental Strategy for Late-arriving Dimensions ● Design Questions to Consider ○ How to handle late-arriving dimensions in data streams? ■ Custom incremental logic ● Design objective: performant and “close enough” data ● Event lookback “window” on early-arriving facts to handle incremental updates on dimensions @ongchinhwee

Slide 41

Slide 41 text

Lookback “window” w/ maximum accepted delay of 3 days for facts and late-arriving dimensions @ongchinhwee

Slide 42

Slide 42 text

What if we want our data to be eventually accurate in data warehouse / data lake? @ongchinhwee

Slide 43

Slide 43 text

What if we want our data to be eventually accurate in data warehouse / data lake? Do a periodic batch refresh to rebuild your tables! @ongchinhwee

Slide 44

Slide 44 text

Key Takeaways ● Design proactive systems and modeling strategies for late-arriving data to ensure data quality in data platforms ● Manage accuracy vs performance tradeoffs with lookback “window” thresholds and periodic batch refreshes @ongchinhwee

Slide 45

Slide 45 text

Reach out to me! : ongchinhwee : ongchinhwee.me : @ongchinhwee : hweecat : https://ongchinhwee.me And get these slides here: https://bit.ly/mdsfest-late-arriving-data @ongchinhwee