Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introducing Data Lakehouse: Apache Iceberg

Introducing Data Lakehouse: Apache Iceberg

One of the topics on DevFest Cloud Bangkok 2023

Burasakorn Sabyeying

December 21, 2023
Tweet

More Decks by Burasakorn Sabyeying

Other Decks in Technology

Transcript

  1. Introducing Data Lakehouse: Apache Iceberg Cloud Burasakorn Sabyeying (Mils) Data

    Engineer, CJ Express. Women Techmakers Ambassador, GDG Cloud Bangkok
  2. 3 Generation of analytic platform Cr. Databricks Data warehouse -

    Database for analytics - ACID guarantees - Support only structured data
  3. 3 Generation of analytic platform Cr. Databricks Data Lake •

    Store CSV, JSON, images, video, txt • Store data in Open Formats e.g. parquet, avro • Lower cost Lack of ACID Guarantees, metadata management, indexing, partitioning
  4. 3 Generation of analytic platform Cr. Databricks Data Lakehouse •

    ACID guarantees • Lower cost • Fewer copies mean less storage costs • Undo mistakes by using Snapshot isolation • metadata management, indexing, partitioning
  5. “Apache Iceberg is an open table format for huge analytic

    datasets” way to organize a dataset’s files to present them as a single “table”.
  6. 1. making it possible for engines like Spark, Trino, Flink,

    Presto, Hive and Impala to safely work with the same tables, at the same time and works just like a SQL table. 2. Upsert data 3. Schema Evolution 4. Partition evolution 5. Time Travel and Rollback What can do ?
  7. Checkpoints ✅ making it possible for engines like Spark to

    work just like a SQL table. 💜 Upsert data 💜 Schema Evolution 💜 Partition evolution 💜 Time Travel and Rollback
  8. Upsert Data (Insert + Update) Current data (target) new data

    (source) demo.nyc.new_data demo.nyc.taxis Demo.nyc.taxis (now)
  9. Schema Evolution • Change column type • Add new column

    • Drop existing column • Reorder column • Rename existing column • Add column comment = No need to create and write to new table But you can do it in-place !
  10. allows you to update your partitions for new data without

    rewriting data. Partition Evolution
  11. Checkpoints ✅ making it possible for engines like Spark to

    work just like a SQL table. ✅ Upsert data ✅ Schema Evolution ✅ Partition evolution 💜 Time Travel and Rollback
  12. Checkpoints ✅ making it possible for engines like Spark to

    work just like a SQL table. ✅ Upsert data ✅ Schema Evolution ✅ Partition evolution ✅ Time Travel and Rollback
  13. “Apache Iceberg is an open table format for huge analytic

    datasets” way to organize a dataset’s files to present them as a single “table”.