Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Lakehouse Architecture Evolution & Future

hueiyuan su
December 09, 2023

Data Lakehouse Architecture Evolution & Future

This talk i will start with sharing differecne and respective concept, benefits and drawbacks database, data lake and data warehouse. And based on these introduction, can quickly assist attendees pickup basic knowledge. And next, introduce architecture and importance of lakehouse. Then, i will prepare demo code (ex. python & delta lake) to present lakehouse operating principle so that attendees can deep understand it and image it. Lastly, describe one of usecase in my past production experience to know how to work in realworld.

By this talk. In addition to lakehouse application, i also introduce additional knowledge and concept, includes batch, streaming and MLOps. So attendees can fully learn and understand these skill and current data enginnering related important architecture and trend. Becuase have a good data architecture and pipeline, can do many interested and innovative application from data and lightening the load of data team member, like data analyst, data scientist or machine learning engineer so that establish the great data cycle.

hueiyuan su

December 09, 2023
Tweet

Other Decks in Technology

Transcript

  1. Represent Service About Lakehouse Data & Datastore 01 02 TABLE

    OF CONTENTS 03 Demo & UseCase 04 Future & Conclusion 05 03
  2. Data Categories Structured • Clearly define columns • Clearly define

    value type • Fixed column by record • Like Table Semi-Structured Unstructured • Flexible column by record • Mixed value type • JSON, XML • Ambiguously definition • Image, Audio, Video 05
  3. Data Storage Data Warehouse Data Lake • Delta Lake •

    Apache IceBerg • Apache Hudi Database • RDB • NoSQL ◦ Key-Value ◦ Timeseries ◦ Column-family ◦ Document ◦ Graph • Clickhouse • AWS Redshift • Snowflake • HDFS • AWS S3 • GCP GCS • Azure Blob • Databricks FS Lakehouse 06
  4. Lakehouse Unified Storage Using one storage to break Data Island,

    Reduce Cose Scaled traction data to ensure consistency ACID SImple Data Model Easy implement & Incremental ETL Data Government Control data and perform audits Data sharing & Lineage Maintain version history for dataset and track to data provenance. Adapt to changing business Easy to modify data schema or create table over time. 09
  5. Top3 Lakehouse • ACID Transaction • Data Version (Time travel)

    • Schema Evolution • Near-real-time Streaming/Batch 12
  6. Lakehouse - 3 Zone Architecture Bronze Silver Golden Raw Ingestion

    Filtered, Cleaned Augmented Business-Level Aggregates Streaming Analytics AI & Report Apache Kafka AWS Kinesis GCP PubSub Azure Eventhub CSV, JSON… etc Lakehouse (Delta, Iceberg, Hudi) 19
  7. CREDITS: This presentation template was created by Slidesgo, including icons

    by Flaticon, and infographics & images by Freepik Demo
  8. Delta Lake AWS S3 Real UseCase - 3 Zone Architecture

    Bronze Silver Golden Raw Ingestion Filtered, Cleaned Augmented Business-Level Aggregates Structured Streaming BU 1 BU 1 BU N 21 Schema Registry Great Expectation Great Expectation
  9. Lakehouse OverView STORAGE FORMATS Structured, Semi-Structured, Unstructured SECURITY & GOVERNMENT

    Data PREP SQL & BI AI / ML TIMESERIES GRAPH STEAMING … Analytics - SQL & BI SECURITY & GOVERNMENT TABLE METASTORE STORAGE FORMAT 23
  10. Summary Data Central + Mart Lineage & Security Schema version

    to fit application demand, like analytics or ML. ETL + ELT Hybrid data pipeline for Flexibility and Scalability Centralized and Decentralized Manage overall data & control permission for security Iteratively application 24
  11. CREDITS: This presentation template was created by Slidesgo, including icons

    by Flaticon, and infographics & images by Freepik THANKS