Functional Data Engineering
- A Blueprint for adopting functional principles in data pipeline
Ananth Packkildurai
Slide 2
Slide 2 text
Slack
Data
Engineer
Zendesk
Principal Data
Engineer
Creator
Schemata -
Data Contract
Platform
Author
Data
Engineering
Weekly
Slide 3
Slide 3 text
Key Principles of
Functional Data
Engineering
Reproducibility
Re-Computability
1
2
Slide 4
Slide 4 text
The Modern Data Cloud =
LakeHouse & Warehouse
State of the Data 2023
Separation of storage and compute
Unlimited scale data repository
ACID transaction and mutation support
Slide 5
Slide 5 text
Schema Classification
Slide 6
Slide 6 text
Warehouse
LakeHouse
CREATE TABLE dw.user (
user_id BIGINT, user_name STRING, created_at DATE
) PARTITION BY (ds STRING)
# ds = date timestamp of the snapshot
s3://dw/user/2022-12-20/
s3://dw/user/2022-12-21/
DateTime Partition Table Design
Slide 7
Slide 7 text
Entity Modeling
Incremental Snapshot
Full Snapshot
1
2
Slide 8
Slide 8 text
Entity Modeling
CREATE
OR REPLACE VIEW dw.user_latest
AS
SELECT
user_id,
user_name,
created_at,
ds
FROM
dw.user
WHERE
ds =< current DateTime
partition >;
Slide 9
Slide 9 text
Event Modeling
Slide 10
Slide 10 text
Key Challenges
Late Arriving Data
Data Deletion
1
2
Slide 11
Slide 11 text
Hour T1 Data Hour T2 Data Hour T3 Data
Hour T1 Data
Hour T2 Data
Hour T3 Data
Hour T1 Data
Hour T2 Data
Tumbling Window
Hour T1 Pipeline Hour T2 Pipeline
Hour T3 Pipeline
Sliding Window
Apply Window Functions
Slide 12
Slide 12 text
Hour T1 Data Window Time
Hour T1 pipeline starts
Apply Watermark
Adopt Reconciliation
Hour T1 pipeline Hour T2 pipeline Hour T3 pipeline
Reconciliation pipeline