Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Flink-based Iceberg Real-Time Data Lake in SmartNews

Qingyu Ji
October 11, 2023

Flink-based Iceberg Real-Time Data Lake in SmartNews

Qingyu Ji

October 11, 2023
Tweet

Other Decks in Technology

Transcript

  1. Flink-based Iceberg Real-Time Data Lake in SmartNews Qingyu Ji |

    Data Platform,Apache Iceberg Contributor
  2. 01 SmartNews Data Lake Introduction 02 Iceberg v1 Solution 03

    Flink-based (Iceberg v2) Solution 04 Small File Optimization 05 Summary
  3. Data Lake Dimension Click/Conversion Real-time Update Stored in Hive/Kafka Advertisier/Stastics

    Real-time/hourly Update Stored in MySQL/Hive Advertisement Data Lake
  4. Spark job to handle de-duplication by id and timestamp update

    Iceberg support concurrent write and read Hourly update Challenge Solved
  5. Too much infra resource Duplicated calculation - only 1% of

    rows to update Storage waste - Full data overwrite every batch update Concurrent write (lock in Iceberg) Problem
  6. Solution Comparision Spark + Iceberg v1 Flink + Iceberg v2

    Write Mode Overwrite Upsert Output File Number Controlled by Spark configuration Massive small files Calculation Full calculaiton Merge on Read, incremental calculation Effectiveness Hourly Minute
  7. Every update generates two records - Delete/Insert Storage waste Big

    pressure on Writer operator (always 100% CPU) Iceberg Sink - Upsert Mode
  8. Shuffle by record primary key Multiple writers handle the data

    under same partition EqualityFieldKeySelector
  9. Partition Record Volume New File Generated Every Hour ts=2022-10-01-23 xxx

    M 3(checkpoint interval) * 10(writer) * 3 files(data file/equality delete/position delete) ts=2022-10-01-22 xx M 90 … … … ts=2022-09-27-00 x K 90 Flink Checkpoint intervel is 20 mins,10 writers
  10. Shuffled by record partition key Only 1 writer under the

    same partition PartitionKeySelector
  11. Partition Record Volume Flink Checkpoint intervel is 20 mins,10 writers

    ts=2022-10-01-23 xxx M 3 * 3 (record with same partition will be shuffled to the same writer) ts=2022-10-01-22 xx M 9 ts=2022-10-01-21 x M 9 … … … ts=2022-09-27-00 x K 9 Flink Checkpoint intervel is 20 mins,10 writers BackPressure BackPressure
  12. Dynamic Shuffle Operator Partition Record Volume Shuffle Strategy ts=2022-10-01-23 xxx

    M EqualityFieldKeySelector ts=2022-10-01-22 xx M EqualityFieldKeySelector ts=2022-10-01-21 x M PartitionKeySelector … … … ts=2022-09-27-00 x K PartitionKeySelector
  13. Dynamicly allocate ShuffleStrategy by current partition Select ShuffleStrategy by historical

    stastics Make sure all subtasks under same Flink operator to use the same ShuffleStrategy DynamicShuffleKeySelector
  14. Benchmark the num/average size of new file generated during first

    24 hours Flink Parallelism is 20 Experiment Result
  15. TS Offset (Hour) No Shuffle Dynamic Shuffle +0 120 72

    +1 152 33 +2 108 9 +3 51 9 +4 36 9 +5 34 9 New File Number Every Hour
  16. TS Offset (Hour) No Shuffle Dynamic Shuffle +0 34 MB

    60 MB +1 24 MB 40 MB +2 1 MB 3 MB +3 100 KB 600 KB +4 60 KB 300 KB +5 10 KB 50 KB +6 20 KB 50 KB New File Average Size Every Hour
  17. Flink-based solution reduced 50% of the total infra Data effectivness

    improved from hourly to minutes DynamicShuffleOperator can be furthur optimized by writer’s throughput to allocate ShuffleStrategy Summary