Slide 1

Slide 1 text

Functional Data Engineering - A Blueprint for adopting functional principles in data pipeline Ananth Packkildurai

Slide 2

Slide 2 text

Slack Data Engineer Zendesk Principal Data Engineer Creator Schemata - Data Contract Platform Author Data Engineering Weekly

Slide 3

Slide 3 text

Key Principles of Functional Data Engineering Reproducibility Re-Computability 1 2

Slide 4

Slide 4 text

The Modern Data Cloud = LakeHouse & Warehouse State of the Data 2023 Separation of storage and compute Unlimited scale data repository ACID transaction and mutation support

Slide 5

Slide 5 text

Schema Classification

Slide 6

Slide 6 text

Warehouse LakeHouse CREATE TABLE dw.user ( user_id BIGINT, user_name STRING, created_at DATE ) PARTITION BY (ds STRING) # ds = date timestamp of the snapshot s3://dw/user/2022-12-20/ s3://dw/user/2022-12-21/ DateTime Partition Table Design

Slide 7

Slide 7 text

Entity Modeling Incremental Snapshot Full Snapshot 1 2

Slide 8

Slide 8 text

Entity Modeling CREATE OR REPLACE VIEW dw.user_latest AS SELECT user_id, user_name, created_at, ds FROM dw.user WHERE ds =< current DateTime partition >;

Slide 9

Slide 9 text

Event Modeling

Slide 10

Slide 10 text

Key Challenges Late Arriving Data Data Deletion 1 2

Slide 11

Slide 11 text

Hour T1 Data Hour T2 Data Hour T3 Data Hour T1 Data Hour T2 Data Hour T3 Data Hour T1 Data Hour T2 Data Tumbling Window Hour T1 Pipeline Hour T2 Pipeline Hour T3 Pipeline Sliding Window Apply Window Functions

Slide 12

Slide 12 text

Hour T1 Data Window Time Hour T1 pipeline starts Apply Watermark Adopt Reconciliation Hour T1 pipeline Hour T2 pipeline Hour T3 pipeline Reconciliation pipeline

Slide 13

Slide 13 text

Choose your Confidence Window of Correctness

Slide 14

Slide 14 text

Data Deletion Reprocessing Deletion Audit Log 1 2

Slide 15

Slide 15 text

https://schemata.app https://www.linkedin.com/in/ananthdurai [email protected]