How to integrate python tools with Apache Iceberg to build ETLT pipeline on Shift-Left Architecture

How to integrate python tools with Apache Iceberg to build
ETLT Pipeline on Shift-Left Architecture Mars Su

About Me Work Interest Experience Working in InfoSec Industry with
Data & AI Data Engineering / AI / Team Coaching 2025 Taipei DBT MeetUp 2024 itHome SRE Conference 2023 Sciwork Conference 2022 PyCon APAC Contact PyCon TW 2025

01 What are pain points in pipeline? Why Lakehouse? Introduction
02 Architecture Comparison. Shift-Left vs. Medallion 04 Introduce python tools to achieve it. Technical Detail 05 Takeaways & Recap Conclusion 03 Introduce OTF & Iceberg key features Iceberg Foundation 3 PyCon TW 2025

01 Introduction What are pain points in pipeline? Why Lakehouse?
PyCon TW 2025

Pain Points Brittle Pipeline ❖ Pipeline Drift ❖ Take long
time to response feature Cost Error & Recovery The Data Bottleneck Lake/Warehouse Dilemma ❖ High Dependency ❖ Do not focus infra design & Optimization ❖ Difficult to rollback error data ❖ Lack of version control ❖ Data Lake -> Data Swamp ❖ Data Warehouse -> Fixed Structure, Cost & Lack Feasibility PyCon TW 2025 5

The Solution Ref: The Great Shift Left: Embracing the Shift
Left Data Architecture PyCon TW 2025 6

Why Lakehouse? PyCon TW 2025 7

Lakehouse import Feature Provide Transaction Operation. Dynamical & Flexible Schema
Change. ACID Schema Evolution Can Rollback any data version of history. Data Version & Time Travel PyCon TW 2025 8 Dynamically change engine to write or query data. Separation of Storage & Computation

Why Lakehouse? PyCon TW 2025 9

02 Shift-Left vs. Medallion Architecture Comparison PyCon TW 2025

Medallion Ref: Medallion Architecture PyCon TW 2025 11

Medallion Will encounter… ❖ Multi-Hop & Higher Cost ❖ Lack
of Agility & High Latency ❖ Dependency & Bottleneck ❖ Centralized Bottleneck & Rigid Ownership Ref: Medallion Architecture PyCon TW 2025 12

Shift-Left Ref: Shift left to write data once, read as
tables or streams PyCon TW 2025 13 Validation, Quality, Ownership

03 Iceberg Foundation Introduce OTF & Iceberg key features PyCon
TW 2025

Open Table Format Ref: The History and Evolution of Open
Table Formats - Part II PyCon TW 2025 15

Layer Working Model Storage Layer OTF Layer Ingest/Query Engine Layer
Write Read Write data with rollback support Read consistent or older data PyCon TW 2025 16

v1.metadata.json | -- … | -- v3.metadata.json | -- snap…. | -- …avro | -- data/ | -- date=2025-09-05/ | -- qdqf.parquet | -- 10rk0.parquet | -- lmn1on.parquet Include table schema, partition spec, snapshot. All of metadata files are designed as linked list (chain) for history version. Manifest records row count, upper/lower bounds statistics data with avro file format, which assist with predicate pushdown, partition pruning. Snapshot is used to Time Travel, Rollback, ACID and Lineage. PyCon TW 2025 18

Iceberg – Schema Evolution user_id url 101 example.com/page-a 102 example.com/page-b
PyCon TW 2025 20

Iceberg – Schema Evolution user_id url category 101 example.com/page-a null
102 example.com/page-b null 101 example.com/cart e-commerce 103 example.com/reading reading PyCon TW 2025 21

Iceberg – Schema Evolution user_id url 101 example.com/page-a 102 example.com/page-b
101 example.com/cart 103 example.com/reading The operation is middle metadata operation, do not indeed delete data from data layer. Just mark as delete in metadata file. PyCon TW 2025 22

Iceberg – Schema Evolution user_id url category 101 example.com/page-a null
102 example.com/page-b null 101 example.com/cart e-commerce 103 example.com/reading reading PyCon TW 2025 23

Iceberg – Time Travel user_id balance updated_at 101 1500.0 2025-08-23
08:55:10 102 850.0 2025-08-23 08:55:10 103 3200.0 2025-08-23 08:55:10 user_id balance updated_at 101 0.0 2025-08-23 08:55:10 102 0.0 2025-08-23 08:55:10 103 0.0 2025-08-23 08:55:10 PyCon TW 2025 24

Iceberg – Time Travel user_id balance updated_at 101 1500.0 2025-08-23
08:55:10 102 850.0 2025-08-23 08:55:10 103 3200.0 2025-08-23 08:55:10 user_id balance updated_at 101 0.0 2025-08-23 08:55:10 102 0.0 2025-08-23 08:55:10 103 0.0 2025-08-23 08:55:10 PyCon TW 2025 25

Iceberg – Partition Evolution ❖ Have Unnecessary schema field. ❖
Not intuitive for query ❖ If forgot to filter dt, have performance issue. PyCon TW 2025 26

Iceberg – Partition Evolution ❖ Simplify Schema ❖ Intuitive for
query ❖ Partition Pruning automatically ❖ Partition rule always can be changed (day -> hour) PyCon TW 2025 27

Iceberg – Copy on Write PyCon TW 2025 28 ❖
Read-Heavy Workload ❖ Batch Operation

Iceberg – Merge on Read PyCon TW 2025 29 ❖
Write-Heavy Workload (Update/Delete) ❖ Streaming Ingestion Operation ❖ Additional Compaction

04 Technical Detail Introduce python tool to achieve it. PyCon
TW 2025

tables or streams PyCon TW 2025 31

Spark Configuration Setup object storage(MinIO) for read/write iceberg. Setup Catalog
(Nessie) to CRUD database or table for iceberg. PyCon TW 2025 33

Spark Configuration Setup Kafka configuration. All of configuration can refer
to confluentic/librdkafka Spark Structured Streaming to read event data from kafka, and process & write iceberg in process_batch function. PyCon TW 2025 34

Spark process & data quality Select required field from kafka
and process your technical logical transformation. PyCon TW 2025 35

Spark process & data quality Define our data quality check
function with greate_expectation. Check field is not null, device type is legal, etc. Apply data quality function. For legal data, would keep processing to downstream. PyCon TW 2025 36

Spark write to Iceberg Define iceberg schema db table with
catalog (Nessie). Can write to different silver iceberg table in the same spark structured streaming micro-batch. Basically, these data should have certain data quality & consistent. PyCon TW 2025 37

Trino connect to Iceberg Ref: Trino Based Architecture Trino is
a high-performance, distributed SQL query engine for big data analytics. Its key feature is query federation, which allows you to run a single SQL query to access, join, and analyze data from multiple diverse sources—such as data lakes, databases, and streaming systems. PyCon TW 2025 38

DBT to Connect Trino Ref: What is DBT? DBT (data
build tool) is an open-source transformation tool that enables the "T" in the ELT (Extract, Load, Transform) process. It allows data teams to transform raw data already inside a cloud data warehouse into reliable, analysis-ready datasets using simple SQL select statements. By bringing software engineering best practices—like version control, testing, and documentation—to the analytics workflow, DBT helps build trusted and maintainable data models. PyCon TW 2025 39

DBT to Connect Trino Through profile to connect corresponding env
Trino cluster Define ELT model for analysis PyCon TW 2025 40

DBT to Connect Trino DBT Generated Data Management Document Data
Lineage Graph to trace data PyCon TW 2025 41

Demo PyCon TW 2025 42 https://github.com/hueiyuan/PyCon2025-icberg-based-shift-left-ETLT

Ref: Shift left to write data once, read as tables
or streams PyCon TW 2025 44 Shift-Left Extend

Shift-Left Extend Ref: Shift left to write data once, read
as tables or streams PyCon TW 2025 45

05 Conclusion Takeaways & Recap PyCon TW 2025

Takeaways Architecture Shift-Left vs. Medallion Iceberg Iceberg OTF Logic &
key features ETLT Use common tools to achieve iceberg based Shift-Left pipeline PyCon TW 2025 47

Next Steps Data Orchestrator Next-Gen Python Data Tools Data Government
Data Observability Automation schedule, monitoring, error handling 48 Moving from knowing what broke to knowing what's breaking Handle of data policy, security, lineage and management. As trends evolve, a variety of new analytical tools will be developed. PyCon TW 2025

❖ What is Shift Left? ❖ Shift left to write
data once, read as tables or streams ❖ Shift Left: The Key to Faster, Smarter, and More Efficient Data Pipelines ❖ Apache Iceberg ❖ Introduction to the Iceberg Data Lakehouse ❖ Apache Iceberg and PySpark ❖ Introduction to Apache Iceberg In Trino ❖ First dbt-trino data pipeline 49 Resources PyCon TW 2025

Slidesgo Flaticon Freepik CREDITS: This presentation template was created by
Slidesgo, including icons by Flaticon, and infographics & images by Freepik. Thanks for Listening! PyCon TW 2025

How to integrate python tools with Apache Icebe...

How to integrate python tools with Apache Iceberg to build ETLT pipeline on Shift-Left Architecture

More Decks by hueiyuan su

Other Decks in Technology

Featured

Transcript