Adopting Apache Iceberg on LINE Data Platform

by LINE DEVDAY 2021

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Speaker Tomoyuki Saito - Senior Software Engineer, Data Platform dept. - Entered LINE as a new grad in 2015 - Career - Bot-backend server dev - Log ingestion pipeline dev (2016~) - Data platform dev

Slide 3

Slide 3 text

Agenda - First part – Problems in query processing - LINE's Data Platform - Problems in query processing - A table format – Apache Iceberg - Second part – Revamping log data pipeline with Apache Iceberg

Slide 4

Slide 4 text

LINE's Data Platform

Slide 5

Slide 5 text

LINE's self-serve data platform Data Platform Services Data Science Machine Learning Governance ... Democratize data for business consumers

Slide 6

Slide 6 text

Typical data flow Data sources Ingest Storage / Metadata Serve Data consumers Process

Slide 7

Slide 7 text

LINE's data flow Data sources Ingest Storage / Metadata Serve Data consumers Process Big Big Big Big Big Big On-premise machines & data centers

Slide 8

Slide 8 text

Scale 5,000+ 40,000+ 290 PB+ Quantity Number of machines Stored data volume Hive tables

Slide 9

Slide 9 text

Scale 17.5 M+ 700+ 150,000+ Throughput / Activities Log ingestion rate Jobs executed Platform users records / s jobs / day

Slide 10

Slide 10 text

Challenges Data sources Ingest Storage / Metadata Serve Data consumers Process Throughput Quantity

Slide 11

Slide 11 text

Challenges Data sources Ingest Storage / Metadata Serve Data consumers Process Throughput Quantity

Slide 12

Slide 12 text

Problems in query processing

Slide 13

Slide 13 text

Data ETL with SQL Storage / Metadata Distributed SQL query engines SQL Spark Hive Trino Flink

Slide 14

Slide 14 text

Distributed SQL execution SELECT name FROM employee Execution Plan Parse Analysis Optimization Planning

Slide 15

Slide 15 text

Distributed SQL execution Execution Plan Parse Analysis Optimization Planning SELECT name FROM employee How to read / write data? What data files are in the table?

Slide 16

Slide 16 text

Distributed SQL execution Execution Plan Parse Analysis Optimization Planning SELECT name FROM employee How to read / write data? What data files are in the table? Table format determines how to answer those questions.

Slide 17

Slide 17 text

Table format Determines a way to organize data files to present them as a table. Answer what data files are in a table. Hide the complexity of finding relevant files. Data discovery Answer how to read and write data. Hide the complexity of underlying data structures and formats. Data abstraction

Slide 18

Slide 18 text

The de-facto standard – Hive table format Metastore DB Hive Metastore Thrift API Table metadata is manipulated and queried via Hive Metastore's Thrift interface. Table metadata is stored in Metastore DB (RDBMS). create_table get_partitions Stats Schema Format Serde Partition Data files are defined as the entire contents of directories. /table/date=2021-10-01/*

Slide 19

Slide 19 text

LINE's ETL infrastructure Storage SQL engines Spark Hive Trino Interactive Jobs HDFS Hive Metastore Scheduled / Batch Thrift API Metastore DB

Slide 20

Slide 20 text

LINE's ETL infrastructure Storage SQL engines Spark Hive Trino Interactive Jobs HDFS Hive Metastore Scheduled / Batch Thrift API Big Big Big Metastore DB

Slide 21

Slide 21 text

Metastore DB – Regular QPS over a week

Slide 22

Slide 22 text

Metastore DB – Regular CPU usage over a week

Slide 23

Slide 23 text

Metastore DB – Abnormal CPU usage Risk of outages Big blast radius Observations - Slow queries at Metastore DB - Scans over 11M+ rows, querying on 'PARTITIONS' table in Metastore DB - Corresponding to a query on a Hive table with 11M+ partitions, which were created unwantedly - A resolution: Reduce partitions

Slide 24

Slide 24 text

Hive table format limitation Hive Metastore Table with O(10K) partitions High load Memory pressure HIVE-13884 The format highly relies on Hive Metastore to store metadata. Partitions are stored as rows in DB. Metastore DB Limitation: Loading many partitions cause problems. Workaround is necessary, like reducing partitions.

Slide 25

Slide 25 text

Diagnostics – Hive table format Table metadata lookup performance is bound to the capacity of central Hive Metastore and Metastore DB instances. Coarse-grained partitioning leads to scans over more data, and more data to write on partition regeneration. Unable to enrich metadata further, like per-file statistics for query optimization. Bottleneck Inefficent data access Less opportunities for optimization

Slide 26

Slide 26 text

Apache Iceberg An open table format for huge analytic datasets

Slide 27

Slide 27 text

Storage File format Parquet ORC Avro HDFS S3 Table format SQL query engines Flink Spark Hive Trino Apache Iceberg An open table format for huge analytic datasets OSS

Slide 28

Slide 28 text

How Iceberg table looks like # Spark SQL create table sample (id int) using iceberg; insert into sample values (100); insert into sample values (200); select * from sample; # Files in HDFS sample ├── data │ ├── 00000-2-26bcfac0-91ba-4374-a879-b780cf0608c3-00001.parquet │ └── 00000-3-4bfb85d8-3283-48f7-980d-28ea115aed80-00001.parquet └── metadata ├── 00000-811eaf6e-b0f4-4bd7-8f87-a6df1d543b34.metadata.json ├── 00001-4041324f-1920-44f4-8ce6-6088ec663e0a.metadata.json ├── 00002-66aac2ec-8f9a-4de8-a679-428bb970b1ff.metadata.json ├── 2a67328f-8386-4d1a-873a-1034824e22f8-m0.avro ├── 91e78f4a-f1df-414f-835d-45488001bba9-m0.avro ├── snap-4758351318332926243-1-2a67328f-8386-4d1a-873a-1034824e22f8.avro └── snap-5465468679579016991-1-91e78f4a-f1df-414f-835d-45488001bba9.avro

Slide 29

Slide 29 text

Key concept s0 time s1 Data Snapshots Write & Commit Snapshot: State of a table at some time How Iceberg table tracks data files Partition Schema Format Stats File location t0 t1

Slide 30

Slide 30 text

Metadata files to track data Tracks the table schema, partitioning config, and snapshots. Stores metadata about manifests, including partition stats. Lists data files, along with each file’s partition data tuple, stats, and tracking information. Table metadata file Manifest list file Manifest file s0 s1 m0 m1 m2 m0 m1 d00 d01 m0 d00 d01 d10 m1 d20 m2 d10 d20 Data files Hive Metastore

Slide 31

Slide 31 text

Finding files necessary for a query 1. Find the manifest list file from the current snapshot. 2. Filter manifest files using partition value ranges stored in the manifest list file. 3. Read each manifest to get data files. manifest-list = ml1 For manifest m2 and partition p, range is [20, 29] d20 file path = hdfs://... s0 s1 m0 m1 m2 m0 m1 d00 d01 m0 d00 d01 d10 m1 d20 m2 d10 d20 Data files ml1

Slide 32

Slide 32 text

Further file filtering with per-file stats Manifest file stores per-file per-column stats calculated at the time of data write. s0 s1 m0 m1 m2 m0 m1 d00 d01 m0 d00 d01 d10 m1 d20 m2 d10 d20 ml1 file_path string Location URI with FS scheme lower_bounds map Map of column id to lower bound upper_bounds map Map of column id to upper bound

Slide 33

Slide 33 text

Differences & Benefits Finer partitioning granularity Hive Apache Iceberg Metadata stored at Hive Metastore Files Limited Unblocked Scalablility Efficiency Stats support Per-partition Per-file Performance

Slide 34

Slide 34 text

More benefits Serializable isolation Row-level deletes Incremental consumption Time travel Schema evolution Hidden partitioning

Slide 35

Slide 35 text

Summary Hive table format coupled with central metadata store poses scalability issues. Apache Iceberg table tracks data with files and solves the scalability issues. It provides additional benefits for analytic goals. We are considering adopting it widely in LINE Data Platform.

Slide 36

Slide 36 text

First part - End Thank you

Slide 37

Slide 37 text

Takeshi ONO Speaker - Joined LINE in March 2019 - Working on the development of ingestion pipeline - Data Engineering 1 team

Slide 38

Slide 38 text

Agenda - Overview of the existing log pipeline - Problems of the existing log pipeline - Solutions by adopting Iceberg - Details of Flink Iceberg application - Future works

Slide 39

Slide 39 text

Overview of the existing log pipeline

Slide 40

Slide 40 text

Flink Existing log pipeline Kafka RAW table - End-to-end pipeline is supported with Protobuf/JSON format serialization. - Flink writes Protobuf/JSON raw data into HDFS with SequenceFile format. - Support exactly-once delivery. - We developed Hive Protobuf Serde libraries for accessing the RAW table.

Slide 41

Slide 41 text

Flink Existing log pipeline Watcher Hiveserver 2 Tez on YARN Kafka RAW table ORC table - Watcher process detects RAW table updates. - Hive Tez engine converts SequenceFile format table to ORC using INSERT OVERWRITE statements. - For loose coupling with the data all Hive tables are 'external' tables.

Slide 42

Slide 42 text

Problems of the existing log pipeline

Slide 43

Slide 43 text

End-to-end high latency - Flink's BucketingSink supports truncating files to recover from the last checkpoint, so it can keep writing a single SequenceFile over multiple checkpoints. Avoid many small HDFS file creation User demand - To reduce this latency and to improve the data freshness, for making data-driven decision and building near real time data powered products. Current pipeline data processing latency is almost 2 hours for the end user accessible. For reasons, - Flush RAW table SequenceFile every hour to create hourly partitioned HDFS files. - Hourly ORC conversion for over 1k partitions of above RAW tables.

Slide 44

Slide 44 text

Flink End-to-end high latency (cont’d) Kafka ORC table Tried development of Flink app which directly writes ORC files - Use Flink’s StreamingFileSink and OrcWriter - Write ORC files directly every few minutes However, small files compaction is difficult to implement - It can delete the files just before user’s jobs read them - Too many partitions and various sending patterns, making it difficult to manage at low cost without missing compaction This proposal was not adopted.

Slide 45

Slide 45 text

Issues related to robustness - RAW table and ORC table - Create/Alter/Drop table, Add/Drop partition, Sync schema - Sometimes manual operations are required Need to manage two type of Hive external tables Too many metadata for partitions/files - Heavy partition scan against Hive Metastore - Heavy directory scan against NameNode Too many components or dependencies - Flink, Watcher, HDFS, Hive Metastore, Hiveserver2, Yarn

Slide 46

Slide 46 text

Limited schema evolution support - For backward compatibility - Set orc.force.positional.evolution=true Use position-based fields mapping for ORC files Users often request to - Drop deprecated fields - Insert new fields in the contextually appropriate position Cannot delete/insert/move fields - Support only adding fields of table/struct - Rename is possible, but not recommended for query engine compatibility

Slide 47

Slide 47 text

Solutions by adopting Iceberg

Slide 48

Slide 48 text

Flink New log pipeline Table Optimizer Kafka Iceberg table - Implementing with Iceberg 0.12, Flink 1.12 and Spark 3.0 - Flink job writes Iceberg file (ORC/Parquet) directly - Flush interval is 5 minutes - Use Spark actions for table maintenance - Table Optimizer process schedules Spark jobs Spark on YARN Merge small files Expire snapshots Rewrite manifests Remove orphans Delete expired records

Slide 49

Slide 49 text

End-to-end minimized latency - Flink job's checkpointing interval is 5 minutes - Iceberg's snapshot isolation enables ACID small files compaction 04 05 00 01 02 03 snapshot-010 = [00, 01, 02 ,03] snapshot-011 = [00, 01 ,02, 03, 04] snapshot-012 = [05, 04] snapshot-013 = [05, 04, 06] 06 Start compaction A table reader that starts reading at snapshot-011 will still be able to read 00,01,02,03,04 after compaction is complete and snapshot-012 is created.

Slide 50

Slide 50 text

Simple/Scalable architecture - To build query execution plans, it is no longer needed to scan NameNode/Hive Metastore, but only to read metadata files per table on HDFS - No such scan is needed for table-optimizer to check status of table/partition Reduce the load on NameNode, Hive Metastore Ease partition management tasks with Iceberg hidden partition feature - No need to add partitions - Just run DELETE WHERE statement to drop partitions No dependence on Hiveserver2, Tez on Yarn - Reduce sources of log pipeline troubles - Delay of table-optimizer jobs is not critical

Slide 51

Slide 51 text

Fully supported schema evolution - Support any changes in Protobuf schema - Based on field ID table.updateSchema() .addColumn("count", Types.LongType.get()) .commit(); table.updateSchema() .renameColumn("host", "host_name") .commit(); table.updateSchema() .deleteColumn("event_2020") .commit(); table.updateSchema() .addColumn("name_2", Types.StringType.get()) .moveAfter("name_2", "name") .commit(); "schema" : { "type" : "struct", "fields" : [ { "id" : 1, "name" : "timestamp", "required" : false, "type" : "timestamptz" }, { "id" : 2, "name" : ”count", "required" : false, "type" : "long” }, {

Slide 52

Slide 52 text

Summary RAW + ORC ORC Writer Iceberg Small files problem End-to-end latency Schema evolution Scalability

Slide 53

Slide 53 text

Details of Flink Iceberg application

Slide 54

Slide 54 text

Message format conversion - Convert Protobuf/JSON to Avro using org.apache.avro.protobuf libraries - Convert Avro to RowData using Iceberg FlinkAvroReader AS-IS TO-BE - Skip Avro conversion to improve performance and reduce complexity - Develop JSON converter by referring to the implementation of flink-formats/flink-json package - And for Protobuf, utilize flink-formats/flink-protobuf package (to be released in Flink 1.15) Convert Protobuf/JSON to Flink RowData - Write records into Iceberg tables with Flink's DataStream - Support for Iceberg-specific handling of Timestamp/Map/Enum types

Slide 55

Slide 55 text

Multiple Writers and single Committer - IcebergStreamWriter writes files - IcebergFilesCommitter commits snapshot - Support exactly-once delivery - FlinkSink is a wrapper of the Writer and Committer Files Committer Consumer Converter Stream Writer Parallelism: N Parallelism: N Parallelism: N Parallelism: 1

Slide 56

Slide 56 text

Dynamic DataStream Splitter - We don’t use FlinkSink to avoid shuffle - Tables with incoming data volume exceeding 1GB/sec - Automatically adjust the number of active writers according to the incoming rate - Splitter is implemented with Flink Side Outputs Converter Files Committer Consumer Splitter Stream Writer Parallelism: N Parallelism: N Parallelism: N Parallelism: 1 Converter Stream Writer Parallelism: N Converter Stream Writer Parallelism: N Converter Stream Writer Parallelism: N

Slide 57

Slide 57 text

Future works

Slide 58

Slide 58 text

Future works - Revise centralized data pipelines to generate Apache Iceberg tables Wide adoption Advanced applications - Incremental data consumption for lower end-to-end latency - CDC data pipeline and time travel access to manage state table Research and development - Develop and operate data pipelines with controlled data consumers reliably at scale - Make sure query engine integration works (Spark 3, Hive, Trino) - Build table migration strategies

Slide 59

Slide 59 text

We are hiring! Data Platform Engineer https://linecorp.com/ja/career/position/1750 Site Reliability Engineer https://linecorp.com/ja/career/position/1751 Data Engineer https://linecorp.com/ja/career/position/1749

Slide 60

Slide 60 text

Thank you