Slide 1

Slide 1 text

Introducing Data Lakehouse: Apache Iceberg Cloud Burasakorn Sabyeying (Mils) Data Engineer, CJ Express. Women Techmakers Ambassador, GDG Cloud Bangkok

Slide 2

Slide 2 text

เข้าใจ concept ของ Data Lakehouse Goal: (ผาน Iceberg)

Slide 3

Slide 3 text

What is Data Lakehouse? Cloud

Slide 4

Slide 4 text

3 Generation of analytic platform Cr. Databricks Data warehouse - Database for analytics - ACID guarantees - Support only structured data

Slide 5

Slide 5 text

3 Generation of analytic platform Cr. Databricks Data Lake ● Store CSV, JSON, images, video, txt ● Store data in Open Formats e.g. parquet, avro ● Lower cost Lack of ACID Guarantees, metadata management, indexing, partitioning

Slide 6

Slide 6 text

3 Generation of analytic platform Cr. Databricks Data Lakehouse ● ACID guarantees ● Lower cost ● Fewer copies mean less storage costs ● Undo mistakes by using Snapshot isolation ● metadata management, indexing, partitioning

Slide 7

Slide 7 text

How to build Data Lakehouse Paid Services Open Source

Slide 8

Slide 8 text

What is Apache Iceberg ? Cloud

Slide 9

Slide 9 text

“Apache Iceberg is an open table format for huge analytic datasets” way to organize a dataset’s files to present them as a single “table”.

Slide 10

Slide 10 text

Iceberg table

Slide 11

Slide 11 text

Iceberg table catalog data metadata parquet avro

Slide 12

Slide 12 text

1. making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time and works just like a SQL table. 2. Upsert data 3. Schema Evolution 4. Partition evolution 5. Time Travel and Rollback What can do ?

Slide 13

Slide 13 text

Create Apache Iceberg via Spark 1. PySpark Dataframe 2. Spark SQL

Slide 14

Slide 14 text

1. Via PySpark Dataframe df.writeTo(tableName).create()

Slide 15

Slide 15 text

df.writeTo(tableName).overwrite() or df.writeTo(tableName).append()

Slide 16

Slide 16 text

2. Via Spark SQL

Slide 17

Slide 17 text

Checkpoints ✅ making it possible for engines like Spark to work just like a SQL table. 💜 Upsert data 💜 Schema Evolution 💜 Partition evolution 💜 Time Travel and Rollback

Slide 18

Slide 18 text

Upsert Data (Insert + Update) Current data (target) new data (source) demo.nyc.new_data demo.nyc.taxis Demo.nyc.taxis (now)

Slide 19

Slide 19 text

Schema Evolution ● Change column type ● Add new column ● Drop existing column ● Reorder column ● Rename existing column ● Add column comment = No need to create and write to new table But you can do it in-place !

Slide 20

Slide 20 text

Rename Column

Slide 21

Slide 21 text

Add Column df = spark.table("demo_ja.nyc.taxis").show() Add ‘fare_per_distance’

Slide 22

Slide 22 text

Delete Column df = spark.table("demo.nyc.taxis").show() Delete `store_and_fwd_flag`

Slide 23

Slide 23 text

allows you to update your partitions for new data without rewriting data. Partition Evolution

Slide 24

Slide 24 text

Checkpoints ✅ making it possible for engines like Spark to work just like a SQL table. ✅ Upsert data ✅ Schema Evolution ✅ Partition evolution 💜 Time Travel and Rollback

Slide 25

Slide 25 text

Iceberg table catalog data metadata parquet avro

Slide 26

Slide 26 text

Iceberg table

Slide 27

Slide 27 text

v2.metadata.json

Slide 28

Slide 28 text

See snapshot id and manifest list

Slide 29

Slide 29 text

Rollback to specific snapshot before After

Slide 30

Slide 30 text

Checkpoints ✅ making it possible for engines like Spark to work just like a SQL table. ✅ Upsert data ✅ Schema Evolution ✅ Partition evolution ✅ Time Travel and Rollback

Slide 31

Slide 31 text

“Apache Iceberg is an open table format for huge analytic datasets” way to organize a dataset’s files to present them as a single “table”.

Slide 32

Slide 32 text

Apache Iceberg in BigQuery

Slide 33

Slide 33 text

Where to learn more - https://iceberg.apache.org/spark-quickstart/ - https://www.dremio.com/resources/guides/apache-iceberg/

Slide 34

Slide 34 text

Thank you ! Cloud https://bit.ly/burasakorn-mils