$30 off During Our Annual Pro Sale. View Details »

Functional Data Engineering - A Blueprint for adopting functional principles in data pipeline

Functional Data Engineering - A Blueprint for adopting functional principles in data pipeline

Maxime Beauchemin wrote an influential article, Functional Data Engineering — a modern paradigm for batch data processing. It is a significant step to bring Software Engineering concepts into Data Engineering. The principle utilizes the advancement from Hadoop.

Cloud object storage like S3 makes the storage a commodity.

The separate Storage & Compute, so both can scale independently. Yes, human life is too short for scaling storage and computing simultaneously.

Functional data engineering follows two key principles.

Reproducibility - Every task in the data pipeline should be deterministic and idempotent.

Re-Computability - Business logic changes over time, and bugs happen. The data pipeline should be able to recompute the desired state.

Ananth Packkildurai

January 22, 2023
Tweet

More Decks by Ananth Packkildurai

Other Decks in Programming

Transcript

  1. Functional Data Engineering
    - A Blueprint for adopting functional principles in data pipeline
    Ananth Packkildurai

    View Slide

  2. Slack
    Data
    Engineer
    Zendesk
    Principal Data
    Engineer
    Creator
    Schemata -
    Data Contract
    Platform
    Author
    Data
    Engineering
    Weekly

    View Slide

  3. Key Principles of
    Functional Data
    Engineering
    Reproducibility
    Re-Computability
    1
    2

    View Slide

  4. The Modern Data Cloud =
    LakeHouse & Warehouse
    State of the Data 2023
    Separation of storage and compute
    Unlimited scale data repository
    ACID transaction and mutation support

    View Slide

  5. Schema Classification

    View Slide

  6. Warehouse
    LakeHouse
    CREATE TABLE dw.user (
    user_id BIGINT, user_name STRING, created_at DATE
    ) PARTITION BY (ds STRING)
    # ds = date timestamp of the snapshot
    s3://dw/user/2022-12-20/snapshot>
    s3://dw/user/2022-12-21/snapshot>
    DateTime Partition Table Design

    View Slide

  7. Entity Modeling
    Incremental Snapshot
    Full Snapshot
    1
    2

    View Slide

  8. Entity Modeling
    CREATE
    OR REPLACE VIEW dw.user_latest
    AS
    SELECT
    user_id,
    user_name,
    created_at,
    ds
    FROM
    dw.user
    WHERE
    ds =< current DateTime
    partition >;

    View Slide

  9. Event Modeling

    View Slide

  10. Key Challenges
    Late Arriving Data
    Data Deletion
    1
    2

    View Slide

  11. Hour T1 Data Hour T2 Data Hour T3 Data
    Hour T1 Data
    Hour T2 Data
    Hour T3 Data
    Hour T1 Data
    Hour T2 Data
    Tumbling Window
    Hour T1 Pipeline Hour T2 Pipeline
    Hour T3 Pipeline
    Sliding Window
    Apply Window Functions

    View Slide

  12. Hour T1 Data Window Time
    Hour T1 pipeline starts
    Apply Watermark
    Adopt Reconciliation
    Hour T1 pipeline Hour T2 pipeline Hour T3 pipeline
    Reconciliation pipeline

    View Slide

  13. Choose your
    Confidence Window of
    Correctness

    View Slide

  14. Data Deletion
    Reprocessing
    Deletion Audit Log
    1
    2

    View Slide

  15. https://schemata.app
    https://www.linkedin.com/in/ananthdurai
    [email protected]

    View Slide