Introducing
Data Lakehouse:
Apache Iceberg
Cloud
Burasakorn Sabyeying (Mils)
Data Engineer, CJ Express.
Women Techmakers Ambassador, GDG Cloud Bangkok
Slide 2
Slide 2 text
เข้าใจ concept ของ
Data Lakehouse
Goal:
(ผาน Iceberg)
Slide 3
Slide 3 text
What is Data Lakehouse?
Cloud
Slide 4
Slide 4 text
3 Generation of analytic platform
Cr. Databricks
Data warehouse
- Database for analytics
- ACID guarantees
- Support only structured data
Slide 5
Slide 5 text
3 Generation of analytic platform
Cr. Databricks
Data Lake
● Store CSV, JSON, images,
video, txt
● Store data in Open Formats
e.g. parquet, avro
● Lower cost
Lack of ACID Guarantees,
metadata management,
indexing, partitioning
Slide 6
Slide 6 text
3 Generation of analytic platform
Cr. Databricks
Data Lakehouse
● ACID guarantees
● Lower cost
● Fewer copies mean less storage costs
● Undo mistakes by using Snapshot isolation
● metadata management, indexing,
partitioning
Slide 7
Slide 7 text
How to build Data Lakehouse
Paid Services Open Source
Slide 8
Slide 8 text
What is Apache Iceberg ?
Cloud
Slide 9
Slide 9 text
“Apache Iceberg is an
open table format for
huge analytic datasets”
way to organize a dataset’s files to
present them as a single “table”.
Slide 10
Slide 10 text
Iceberg table
Slide 11
Slide 11 text
Iceberg table
catalog
data metadata
parquet avro
Slide 12
Slide 12 text
1. making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to
safely work with the same tables, at the same time and works just like a SQL table.
2. Upsert data
3. Schema Evolution
4. Partition evolution
5. Time Travel and Rollback
What can do ?
1. Via PySpark Dataframe
df.writeTo(tableName).create()
Slide 15
Slide 15 text
df.writeTo(tableName).overwrite()
or
df.writeTo(tableName).append()
Slide 16
Slide 16 text
2. Via Spark SQL
Slide 17
Slide 17 text
Checkpoints
✅ making it possible for engines like Spark to work just like a SQL table.
💜 Upsert data
💜 Schema Evolution
💜 Partition evolution
💜 Time Travel and Rollback
Slide 18
Slide 18 text
Upsert Data (Insert + Update)
Current data (target)
new data (source)
demo.nyc.new_data
demo.nyc.taxis
Demo.nyc.taxis (now)
Slide 19
Slide 19 text
Schema Evolution
● Change column type
● Add new column
● Drop existing column
● Reorder column
● Rename existing column
● Add column comment
= No need to create and write to new table
But you can do it in-place !
allows you to update your partitions for new data without rewriting data.
Partition Evolution
Slide 24
Slide 24 text
Checkpoints
✅ making it possible for engines like Spark to work just like a SQL table.
✅ Upsert data
✅ Schema Evolution
✅ Partition evolution
💜 Time Travel and Rollback
Slide 25
Slide 25 text
Iceberg table
catalog
data metadata
parquet avro
Slide 26
Slide 26 text
Iceberg table
Slide 27
Slide 27 text
v2.metadata.json
Slide 28
Slide 28 text
See snapshot id and manifest list
Slide 29
Slide 29 text
Rollback to specific snapshot
before
After
Slide 30
Slide 30 text
Checkpoints
✅ making it possible for engines like Spark to work just like a SQL table.
✅ Upsert data
✅ Schema Evolution
✅ Partition evolution
✅ Time Travel and Rollback
Slide 31
Slide 31 text
“Apache Iceberg is an
open table format for
huge analytic datasets”
way to organize a dataset’s files to
present them as a single “table”.
Slide 32
Slide 32 text
Apache Iceberg in BigQuery
Slide 33
Slide 33 text
Where to learn more
- https://iceberg.apache.org/spark-quickstart/
- https://www.dremio.com/resources/guides/apache-iceberg/