Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Paimon Demystified

Apache Paimon Demystified

Avatar for Open Data Driven

Open Data Driven

March 26, 2025
Tweet

More Decks by Open Data Driven

Other Decks in Technology

Transcript

  1. Self Introduction 肖 志彦 / Zhiyan Xiao 🏠 中国重慶 /

    Chongqing, China 8D Magic City, Hot Pot 👨🎓 香 港中 文 大 学 / The Chinese University of Hong Kong Math, Information Engineering 👨💻 Software Engineer @ LY Corporation Streaming Data Pipeline, Spark, Iceberg ✨ Improving big data systems with Rust Hadoop Client, Hive Client, SASL / Kerberos, InnoFile, InnoTable
  2. Afterwards, we can • Deep dive internals of Paimon •

    Consider how to utilize Paimon in real business
  3. Issues in directly handling data f iles • No explicit

    schema • Schema evolution is not well de f ined • Full data scanning is needed when • reading records with particular f ilters • deleting or updating certain records • Missing advanced features like ACID, Time Travel, etc.
  4. Bene f its of Table Format • De f ine

    schema explicitly • Support schema evolution • Accelerate querying by skipping unnecessary data f iles • Support other advanced features like ACID, Time Travel, etc.
  5. Popular Table Formats • Hive • Lakehouse Table Format •

    Iceberg • Delta Lake • Hudi • Realtime Lakehouse (Streamhouse) Table Format • Paimon
  6. Hive Table Format • Skip unnecessary data f iles with

    partition and bucket • Partition • Mapping “partition columns” to “data directories” • /table/year=2025/month=3 • Bucket • Mapping “bucket columns” to “data f iles with corresponding hash” • /table/year=2025/month=3/part-00000
  7. Issues of Hive Table Format • Dif f icult for

    Hive Metastore to support huge amount of partitions • Query optimization is limited to partition and bucket level • Missing advanced features like ACID, full schema evolution, etc.
  8. Lakehouse Table Format Take Iceberg as example • Skip unnecessary

    data f iles with metadata • metadata.json • manifest-list.avro (contains stats to skip manifest f iles) • manifest.avro (contains stats to skip data f iles) • data- f iles.parquet • Support advanced features like ACID, full schema evolution, etc.
  9. Issues of Lakehouse Table Format For high-throughput streaming write, •

    Iceberg • Commits are atomic and inef f icient for frequent writes • Delta Lake, Hudi • Update/delete relies on Merge-on-Read with expensive compaction
  10. Paimon Table Format • Skip unnecessary data f iles with

    metadata (manifest, index f ile) • Support high-throughput streaming write and low latency read • with LSM Tree (log-structured merge-tree) storage model
  11. Paimon Table Types • Table with Primary Key • Table

    without Primary Key • View • Format Table • Object Table • Materialized Table
  12. Usage of Paimon • High-throughput streaming write • Streaming read

    from change log • Uni f ied streaming-batch pipeline • (Realtime Lakehouse) Streamhouse Architecture
  13. Potential Future Actions • Comprehensively compare behaviors of Iceberg and

    Paimon • Create PoC with Paimon to demonstrate its bene f its • Smoothly upgrade from Lakehouse to Streamhouse
  14. Fin