Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Functional Data Engineering - A Blueprint for adopting functional principles in data pipeline

Functional Data Engineering - A Blueprint for adopting functional principles in data pipeline

Maxime Beauchemin wrote an influential article, Functional Data Engineering — a modern paradigm for batch data processing. It is a significant step to bring Software Engineering concepts into Data Engineering. The principle utilizes the advancement from Hadoop.

Cloud object storage like S3 makes the storage a commodity.

The separate Storage & Compute, so both can scale independently. Yes, human life is too short for scaling storage and computing simultaneously.

Functional data engineering follows two key principles.

Reproducibility - Every task in the data pipeline should be deterministic and idempotent.

Re-Computability - Business logic changes over time, and bugs happen. The data pipeline should be able to recompute the desired state.

Ananth Packkildurai

January 22, 2023
Tweet

More Decks by Ananth Packkildurai

Other Decks in Programming

Transcript

  1. Functional Data Engineering - A Blueprint for adopting functional principles

    in data pipeline Ananth Packkildurai
  2. Slack Data Engineer Zendesk Principal Data Engineer Creator Schemata -

    Data Contract Platform Author Data Engineering Weekly
  3. Key Principles of Functional Data Engineering Reproducibility Re-Computability 1 2

  4. The Modern Data Cloud = LakeHouse & Warehouse State of

    the Data 2023 Separation of storage and compute Unlimited scale data repository ACID transaction and mutation support
  5. Schema Classification

  6. Warehouse LakeHouse CREATE TABLE dw.user ( user_id BIGINT, user_name STRING,

    created_at DATE ) PARTITION BY (ds STRING) # ds = date timestamp of the snapshot s3://dw/user/2022-12-20/<all users data at the time of snapshot> s3://dw/user/2022-12-21/<all users data at the time of snapshot> DateTime Partition Table Design
  7. Entity Modeling Incremental Snapshot Full Snapshot 1 2

  8. Entity Modeling CREATE OR REPLACE VIEW dw.user_latest AS SELECT user_id,

    user_name, created_at, ds FROM dw.user WHERE ds =< current DateTime partition >;
  9. Event Modeling

  10. Key Challenges Late Arriving Data Data Deletion 1 2

  11. Hour T1 Data Hour T2 Data Hour T3 Data Hour

    T1 Data Hour T2 Data Hour T3 Data Hour T1 Data Hour T2 Data Tumbling Window Hour T1 Pipeline Hour T2 Pipeline Hour T3 Pipeline Sliding Window Apply Window Functions
  12. Hour T1 Data Window Time Hour T1 pipeline starts Apply

    Watermark Adopt Reconciliation Hour T1 pipeline Hour T2 pipeline Hour T3 pipeline Reconciliation pipeline
  13. Choose your Confidence Window of Correctness

  14. Data Deletion Reprocessing Deletion Audit Log 1 2

  15. https://schemata.app https://www.linkedin.com/in/ananthdurai [email protected]