Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to integrate python tools with Apache Icebe...

Avatar for hueiyuan su hueiyuan su
September 07, 2025

How to integrate python tools with Apache Iceberg to build ETLT pipeline on Shift-Left Architecture

This session will begin by exploring why the industry is increasingly moving toward the lakehouse architecture, highlighting the core challenges it addresses in modern data processing. Using this as a foundation, we will introduce the ETLT pattern and compare two architectural approaches: the Shift-Left Architecture and the Medallion Architecture, outlining their key differences and applicable scenarios.

We will also cover the essential technologies used throughout the data pipeline, including Kafka, PySpark, Trino, and DBT for Left-Shit Architecture. For Python developers, we’ll demonstrate how to efficiently read and interact with Iceberg-formatted data, showcasing code snippets and configuration examples to provide practical guidance.

The talk will conclude with real-world best practices to help data professionals evaluate whether this architecture fits their use cases and how it can be leveraged to improve existing data workflows.

Avatar for hueiyuan su

hueiyuan su

September 07, 2025
Tweet

More Decks by hueiyuan su

Other Decks in Technology

Transcript

  1. How to integrate python tools with Apache Iceberg to build

    ETLT Pipeline on Shift-Left Architecture Mars Su
  2. About Me Work Interest Experience Working in InfoSec Industry with

    Data & AI Data Engineering / AI / Team Coaching 2025 Taipei DBT MeetUp 2024 itHome SRE Conference 2023 Sciwork Conference 2022 PyCon APAC Contact PyCon TW 2025
  3. 01 What are pain points in pipeline? Why Lakehouse? Introduction

    02 Architecture Comparison. Shift-Left vs. Medallion 04 Introduce python tools to achieve it. Technical Detail 05 Takeaways & Recap Conclusion 03 Introduce OTF & Iceberg key features Iceberg Foundation 3 PyCon TW 2025
  4. Pain Points Brittle Pipeline ❖ Pipeline Drift ❖ Take long

    time to response feature Cost Error & Recovery The Data Bottleneck Lake/Warehouse Dilemma ❖ High Dependency ❖ Do not focus infra design & Optimization ❖ Difficult to rollback error data ❖ Lack of version control ❖ Data Lake -> Data Swamp ❖ Data Warehouse -> Fixed Structure, Cost & Lack Feasibility PyCon TW 2025 5
  5. The Solution Ref: The Great Shift Left: Embracing the Shift

    Left Data Architecture PyCon TW 2025 6
  6. Lakehouse import Feature Provide Transaction Operation. Dynamical & Flexible Schema

    Change. ACID Schema Evolution Can Rollback any data version of history. Data Version & Time Travel PyCon TW 2025 8 Dynamically change engine to write or query data. Separation of Storage & Computation
  7. Medallion Will encounter… ❖ Multi-Hop & Higher Cost ❖ Lack

    of Agility & High Latency ❖ Dependency & Bottleneck ❖ Centralized Bottleneck & Rigid Ownership Ref: Medallion Architecture PyCon TW 2025 12
  8. Shift-Left Ref: Shift left to write data once, read as

    tables or streams PyCon TW 2025 13 Validation, Quality, Ownership
  9. Open Table Format Ref: The History and Evolution of Open

    Table Formats - Part II PyCon TW 2025 15
  10. Layer Working Model Storage Layer OTF Layer Ingest/Query Engine Layer

    Write Read Write data with rollback support Read consistent or older data PyCon TW 2025 16
  11. Apache Iceberg Storage Layer s3://your-bucket/some_table/ | -- metadata/ | --

    v1.metadata.json | -- … | -- v3.metadata.json | -- snap…. | -- …avro | -- data/ | -- date=2025-09-05/ | -- qdqf.parquet | -- 10rk0.parquet | -- lmn1on.parquet Through catalog to find corresponding table latest metadata file, also as connector for database & table. Ex. db.click_event PyCon TW 2025 17
  12. Apache Iceberg Storage Layer s3://your-bucket/some_table/ | -- metadata/ | --

    v1.metadata.json | -- … | -- v3.metadata.json | -- snap…. | -- …avro | -- data/ | -- date=2025-09-05/ | -- qdqf.parquet | -- 10rk0.parquet | -- lmn1on.parquet Include table schema, partition spec, snapshot. All of metadata files are designed as linked list (chain) for history version. Manifest records row count, upper/lower bounds statistics data with avro file format, which assist with predicate pushdown, partition pruning. Snapshot is used to Time Travel, Rollback, ACID and Lineage. PyCon TW 2025 18
  13. Apache Iceberg Storage Layer s3://your-bucket/some_table/ | -- metadata/ | --

    v1.metadata.json | -- … | -- v3.metadata.json | -- snap…. | -- …avro | -- data/ | -- date=2025-09-05/ | -- qdqf.parquet | -- 10rk0.parquet | -- lmn1on.parquet Real data storage, most format is parquet. Also can be ORC, Avro. PyCon TW 2025 19
  14. Iceberg – Schema Evolution user_id url category 101 example.com/page-a null

    102 example.com/page-b null 101 example.com/cart e-commerce 103 example.com/reading reading PyCon TW 2025 21
  15. Iceberg – Schema Evolution user_id url 101 example.com/page-a 102 example.com/page-b

    101 example.com/cart 103 example.com/reading The operation is middle metadata operation, do not indeed delete data from data layer. Just mark as delete in metadata file. PyCon TW 2025 22
  16. Iceberg – Schema Evolution user_id url category 101 example.com/page-a null

    102 example.com/page-b null 101 example.com/cart e-commerce 103 example.com/reading reading PyCon TW 2025 23
  17. Iceberg – Time Travel user_id balance updated_at 101 1500.0 2025-08-23

    08:55:10 102 850.0 2025-08-23 08:55:10 103 3200.0 2025-08-23 08:55:10 user_id balance updated_at 101 0.0 2025-08-23 08:55:10 102 0.0 2025-08-23 08:55:10 103 0.0 2025-08-23 08:55:10 PyCon TW 2025 24
  18. Iceberg – Time Travel user_id balance updated_at 101 1500.0 2025-08-23

    08:55:10 102 850.0 2025-08-23 08:55:10 103 3200.0 2025-08-23 08:55:10 user_id balance updated_at 101 0.0 2025-08-23 08:55:10 102 0.0 2025-08-23 08:55:10 103 0.0 2025-08-23 08:55:10 PyCon TW 2025 25
  19. Iceberg – Partition Evolution ❖ Have Unnecessary schema field. ❖

    Not intuitive for query ❖ If forgot to filter dt, have performance issue. PyCon TW 2025 26
  20. Iceberg – Partition Evolution ❖ Simplify Schema ❖ Intuitive for

    query ❖ Partition Pruning automatically ❖ Partition rule always can be changed (day -> hour) PyCon TW 2025 27
  21. Iceberg – Copy on Write PyCon TW 2025 28 ❖

    Read-Heavy Workload ❖ Batch Operation
  22. Iceberg – Merge on Read PyCon TW 2025 29 ❖

    Write-Heavy Workload (Update/Delete) ❖ Streaming Ingestion Operation ❖ Additional Compaction
  23. Shift-Left Ref: Shift left to write data once, read as

    tables or streams PyCon TW 2025 31
  24. Shift-Left Ref: Shift left to write data once, read as

    tables or streams PyCon TW 2025 32
  25. Spark Configuration Setup object storage(MinIO) for read/write iceberg. Setup Catalog

    (Nessie) to CRUD database or table for iceberg. PyCon TW 2025 33
  26. Spark Configuration Setup Kafka configuration. All of configuration can refer

    to confluentic/librdkafka Spark Structured Streaming to read event data from kafka, and process & write iceberg in process_batch function. PyCon TW 2025 34
  27. Spark process & data quality Select required field from kafka

    and process your technical logical transformation. PyCon TW 2025 35
  28. Spark process & data quality Define our data quality check

    function with greate_expectation. Check field is not null, device type is legal, etc. Apply data quality function. For legal data, would keep processing to downstream. PyCon TW 2025 36
  29. Spark write to Iceberg Define iceberg schema db table with

    catalog (Nessie). Can write to different silver iceberg table in the same spark structured streaming micro-batch. Basically, these data should have certain data quality & consistent. PyCon TW 2025 37
  30. Trino connect to Iceberg Ref: Trino Based Architecture Trino is

    a high-performance, distributed SQL query engine for big data analytics. Its key feature is query federation, which allows you to run a single SQL query to access, join, and analyze data from multiple diverse sources—such as data lakes, databases, and streaming systems. PyCon TW 2025 38
  31. DBT to Connect Trino Ref: What is DBT? DBT (data

    build tool) is an open-source transformation tool that enables the "T" in the ELT (Extract, Load, Transform) process. It allows data teams to transform raw data already inside a cloud data warehouse into reliable, analysis-ready datasets using simple SQL select statements. By bringing software engineering best practices—like version control, testing, and documentation—to the analytics workflow, DBT helps build trusted and maintainable data models. PyCon TW 2025 39
  32. DBT to Connect Trino Through profile to connect corresponding env

    Trino cluster Define ELT model for analysis PyCon TW 2025 40
  33. DBT to Connect Trino DBT Generated Data Management Document Data

    Lineage Graph to trace data PyCon TW 2025 41
  34. Shift-Left Ref: Shift left to write data once, read as

    tables or streams PyCon TW 2025 43
  35. Ref: Shift left to write data once, read as tables

    or streams PyCon TW 2025 44 Shift-Left Extend
  36. Shift-Left Extend Ref: Shift left to write data once, read

    as tables or streams PyCon TW 2025 45
  37. Takeaways Architecture Shift-Left vs. Medallion Iceberg Iceberg OTF Logic &

    key features ETLT Use common tools to achieve iceberg based Shift-Left pipeline PyCon TW 2025 47
  38. Next Steps Data Orchestrator Next-Gen Python Data Tools Data Government

    Data Observability Automation schedule, monitoring, error handling 48 Moving from knowing what broke to knowing what's breaking Handle of data policy, security, lineage and management. As trends evolve, a variety of new analytical tools will be developed. PyCon TW 2025
  39. ❖ What is Shift Left? ❖ Shift left to write

    data once, read as tables or streams ❖ Shift Left: The Key to Faster, Smarter, and More Efficient Data Pipelines ❖ Apache Iceberg ❖ Introduction to the Iceberg Data Lakehouse ❖ Apache Iceberg and PySpark ❖ Introduction to Apache Iceberg In Trino ❖ First dbt-trino data pipeline 49 Resources PyCon TW 2025
  40. Slidesgo Flaticon Freepik CREDITS: This presentation template was created by

    Slidesgo, including icons by Flaticon, and infographics & images by Freepik. Thanks for Listening! PyCon TW 2025