Slide 1

Slide 1 text

Flink-based Iceberg Real-Time Data Lake in SmartNews Qingyu Ji | Data Platform,Apache Iceberg Contributor

Slide 2

Slide 2 text

01 SmartNews Data Lake Introduction 02 Iceberg v1 Solution 03 Flink-based (Iceberg v2) Solution 04 Small File Optimization 05 Summary

Slide 3

Slide 3 text

01 SmartNews Data Lake Introduction

Slide 4

Slide 4 text

Started in Tokyo 2012 New York/San Francisco/Palo Alto Office 2014 Shanghai/Beijing Office 2019

Slide 5

Slide 5 text

Data Lake Dimension Click/Conversion Real-time Update Stored in Hive/Kafka Advertisier/Stastics Real-time/hourly Update Stored in MySQL/Hive Advertisement Data Lake

Slide 6

Slide 6 text

Challenge Deduplicate by AD ID Update click/conversion timestamp Downstream real-time read

Slide 7

Slide 7 text

02 Iceberg v1 Solution

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Spark job to handle de-duplication by id and timestamp update Iceberg support concurrent write and read Hourly update Challenge Solved

Slide 10

Slide 10 text

Too much infra resource Duplicated calculation - only 1% of rows to update Storage waste - Full data overwrite every batch update Concurrent write (lock in Iceberg) Problem

Slide 11

Slide 11 text

03 Flink-based Iceberg v2 Solution

Slide 12

Slide 12 text

Iceberg v2 support row-level update Flink real-time ingestion - Merge On Read MySQL CDC dimension join Solution

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Solution Comparision Spark + Iceberg v1 Flink + Iceberg v2 Write Mode Overwrite Upsert Output File Number Controlled by Spark configuration Massive small files Calculation Full calculaiton Merge on Read, incremental calculation Effectiveness Hourly Minute

Slide 15

Slide 15 text

04 Small File Optimization

Slide 16

Slide 16 text

Every update generates two records - Delete/Insert Storage waste Big pressure on Writer operator (always 100% CPU) Iceberg Sink - Upsert Mode

Slide 17

Slide 17 text

Flink Generated RowData

Slide 18

Slide 18 text

Iceberg Flink Sink

Slide 19

Slide 19 text

Shuffle by record primary key Multiple writers handle the data under same partition EqualityFieldKeySelector

Slide 20

Slide 20 text

Partition Record Volume New File Generated Every Hour ts=2022-10-01-23 xxx M 3(checkpoint interval) * 10(writer) * 3 files(data file/equality delete/position delete) ts=2022-10-01-22 xx M 90 … … … ts=2022-09-27-00 x K 90 Flink Checkpoint intervel is 20 mins,10 writers

Slide 21

Slide 21 text

Shuffled by record partition key Only 1 writer under the same partition PartitionKeySelector

Slide 22

Slide 22 text

Partition Record Volume Flink Checkpoint intervel is 20 mins,10 writers ts=2022-10-01-23 xxx M 3 * 3 (record with same partition will be shuffled to the same writer) ts=2022-10-01-22 xx M 9 ts=2022-10-01-21 x M 9 … … … ts=2022-09-27-00 x K 9 Flink Checkpoint intervel is 20 mins,10 writers BackPressure BackPressure

Slide 23

Slide 23 text

Dynamic Shuffle Operator Partition Record Volume Shuffle Strategy ts=2022-10-01-23 xxx M EqualityFieldKeySelector ts=2022-10-01-22 xx M EqualityFieldKeySelector ts=2022-10-01-21 x M PartitionKeySelector … … … ts=2022-09-27-00 x K PartitionKeySelector

Slide 24

Slide 24 text

Dynamic Shuffle Operator

Slide 25

Slide 25 text

Dynamicly allocate ShuffleStrategy by current partition Select ShuffleStrategy by historical stastics Make sure all subtasks under same Flink operator to use the same ShuffleStrategy DynamicShuffleKeySelector

Slide 26

Slide 26 text

Benchmark the num/average size of new file generated during first 24 hours Flink Parallelism is 20 Experiment Result

Slide 27

Slide 27 text

TS Offset (Hour) No Shuffle Dynamic Shuffle +0 120 72 +1 152 33 +2 108 9 +3 51 9 +4 36 9 +5 34 9 New File Number Every Hour

Slide 28

Slide 28 text

TS Offset (Hour) No Shuffle Dynamic Shuffle +0 34 MB 60 MB +1 24 MB 40 MB +2 1 MB 3 MB +3 100 KB 600 KB +4 60 KB 300 KB +5 10 KB 50 KB +6 20 KB 50 KB New File Average Size Every Hour

Slide 29

Slide 29 text

05 Summary

Slide 30

Slide 30 text

Flink-based solution reduced 50% of the total infra Data effectivness improved from hourly to minutes DynamicShuffleOperator can be furthur optimized by writer’s throughput to allocate ShuffleStrategy Summary

Slide 31

Slide 31 text

THANK YOU