Flink-based Iceberg Real-Time Data Lake in SmartNews

Flink-based Iceberg Real-Time Data Lake in SmartNews Qingyu Ji |
Data Platform，Apache Iceberg Contributor

01 SmartNews Data Lake Introduction 02 Iceberg v1 Solution 03
Flink-based （Iceberg v2） Solution 04 Small File Optimization 05 Summary

01 SmartNews Data Lake Introduction

Started in Tokyo 2012 New York/San Francisco/Palo Alto Office 2014
Shanghai/Beijing Office 2019

Data Lake Dimension Click/Conversion Real-time Update Stored in Hive/Kafka Advertisier/Stastics
Real-time/hourly Update Stored in MySQL/Hive Advertisement Data Lake

Challenge Deduplicate by AD ID Update click/conversion timestamp Downstream real-time
read

02 Iceberg v1 Solution

Spark job to handle de-duplication by id and timestamp update
Iceberg support concurrent write and read Hourly update Challenge Solved

Too much infra resource Duplicated calculation - only 1% of
rows to update Storage waste - Full data overwrite every batch update Concurrent write (lock in Iceberg) Problem

03 Flink-based Iceberg v2 Solution

Iceberg v2 support row-level update Flink real-time ingestion - Merge
On Read MySQL CDC dimension join Solution

Solution Comparision Spark + Iceberg v1 Flink + Iceberg v2
Write Mode Overwrite Upsert Output File Number Controlled by Spark configuration Massive small files Calculation Full calculaiton Merge on Read， incremental calculation Effectiveness Hourly Minute

04 Small File Optimization

Every update generates two records - Delete/Insert Storage waste Big
pressure on Writer operator (always 100% CPU) Iceberg Sink - Upsert Mode

Flink Generated RowData

Iceberg Flink Sink

Shuffle by record primary key Multiple writers handle the data
under same partition EqualityFieldKeySelector

Partition Record Volume New File Generated Every Hour ts=2022-10-01-23 xxx
M 3(checkpoint interval) * 10(writer) * 3 files(data file/equality delete/position delete) ts=2022-10-01-22 xx M 90 … … … ts=2022-09-27-00 x K 90 Flink Checkpoint intervel is 20 mins，10 writers

Shuffled by record partition key Only 1 writer under the
same partition PartitionKeySelector

Partition Record Volume Flink Checkpoint intervel is 20 mins，10 writers
ts=2022-10-01-23 xxx M 3 * 3 （record with same partition will be shuffled to the same writer） ts=2022-10-01-22 xx M 9 ts=2022-10-01-21 x M 9 … … … ts=2022-09-27-00 x K 9 Flink Checkpoint intervel is 20 mins，10 writers BackPressure BackPressure

Dynamic Shuffle Operator Partition Record Volume Shuffle Strategy ts=2022-10-01-23 xxx
M EqualityFieldKeySelector ts=2022-10-01-22 xx M EqualityFieldKeySelector ts=2022-10-01-21 x M PartitionKeySelector … … … ts=2022-09-27-00 x K PartitionKeySelector

Dynamic Shuffle Operator

Dynamicly allocate ShuffleStrategy by current partition Select ShuffleStrategy by historical
stastics Make sure all subtasks under same Flink operator to use the same ShuffleStrategy DynamicShuffleKeySelector

Benchmark the num/average size of new file generated during first
24 hours Flink Parallelism is 20 Experiment Result

TS Offset (Hour) No Shuffle Dynamic Shuffle +0 120 72
+1 152 33 +2 108 9 +3 51 9 +4 36 9 +5 34 9 New File Number Every Hour

TS Offset (Hour) No Shuffle Dynamic Shuffle +0 34 MB
60 MB +1 24 MB 40 MB +2 1 MB 3 MB +3 100 KB 600 KB +4 60 KB 300 KB +5 10 KB 50 KB +6 20 KB 50 KB New File Average Size Every Hour

05 Summary

Flink-based solution reduced 50% of the total infra Data effectivness
improved from hourly to minutes DynamicShuffleOperator can be furthur optimized by writer’s throughput to allocate ShuffleStrategy Summary

THANK YOU

Flink-based Iceberg Real-Time Data Lake in Sma...

Flink-based Iceberg Real-Time Data Lake in SmartNews

Qingyu Ji

Other Decks in Technology

Featured

Transcript