Adopting Apache Iceberg on LINE Data Platform

Speaker Tomoyuki Saito - Senior Software Engineer, Data Platform dept.
- Entered LINE as a new grad in 2015 - Career - Bot-backend server dev - Log ingestion pipeline dev (2016~) - Data platform dev

Agenda - First part – Problems in query processing -
LINE's Data Platform - Problems in query processing - A table format – Apache Iceberg - Second part – Revamping log data pipeline with Apache Iceberg

LINE's Data Platform

LINE's self-serve data platform Data Platform Services Data Science Machine
Learning Governance ... Democratize data for business consumers

Typical data flow Data sources Ingest Storage / Metadata Serve
Data consumers Process

LINE's data flow Data sources Ingest Storage / Metadata Serve
Data consumers Process Big Big Big Big Big Big On-premise machines & data centers

Scale 5,000+ 40,000+ 290 PB+ Quantity Number of machines Stored
data volume Hive tables

Scale 17.5 M+ 700+ 150,000+ Throughput / Activities Log ingestion
rate Jobs executed Platform users records / s jobs / day

Challenges Data sources Ingest Storage / Metadata Serve Data consumers
Process Throughput Quantity

Problems in query processing

Data ETL with SQL Storage / Metadata Distributed SQL query
engines SQL Spark Hive Trino Flink

Distributed SQL execution SELECT name FROM employee Execution Plan Parse
Analysis Optimization Planning

Distributed SQL execution Execution Plan Parse Analysis Optimization Planning SELECT
name FROM employee How to read / write data? What data files are in the table?

Distributed SQL execution Execution Plan Parse Analysis Optimization Planning SELECT
name FROM employee How to read / write data? What data files are in the table? Table format determines how to answer those questions.

Table format Determines a way to organize data files to
present them as a table. Answer what data files are in a table. Hide the complexity of finding relevant files. Data discovery Answer how to read and write data. Hide the complexity of underlying data structures and formats. Data abstraction

The de-facto standard – Hive table format Metastore DB Hive
Metastore Thrift API Table metadata is manipulated and queried via Hive Metastore's Thrift interface. Table metadata is stored in Metastore DB (RDBMS). create_table get_partitions Stats Schema Format Serde Partition Data files are defined as the entire contents of directories. /table/date=2021-10-01/*

LINE's ETL infrastructure Storage SQL engines Spark Hive Trino Interactive
Jobs HDFS Hive Metastore Scheduled / Batch Thrift API Metastore DB

LINE's ETL infrastructure Storage SQL engines Spark Hive Trino Interactive
Jobs HDFS Hive Metastore Scheduled / Batch Thrift API Big Big Big Metastore DB

Metastore DB – Regular QPS over a week

Metastore DB – Regular CPU usage over a week

Metastore DB – Abnormal CPU usage Risk of outages Big
blast radius Observations - Slow queries at Metastore DB - Scans over 11M+ rows, querying on 'PARTITIONS' table in Metastore DB - Corresponding to a query on a Hive table with 11M+ partitions, which were created unwantedly - A resolution: Reduce partitions

Hive table format limitation Hive Metastore Table with O(10K) partitions
High load Memory pressure HIVE-13884 The format highly relies on Hive Metastore to store metadata. Partitions are stored as rows in DB. Metastore DB Limitation: Loading many partitions cause problems. Workaround is necessary, like reducing partitions.

Diagnostics – Hive table format Table metadata lookup performance is
bound to the capacity of central Hive Metastore and Metastore DB instances. Coarse-grained partitioning leads to scans over more data, and more data to write on partition regeneration. Unable to enrich metadata further, like per-file statistics for query optimization. Bottleneck Inefficent data access Less opportunities for optimization

Apache Iceberg An open table format for huge analytic datasets

Storage File format Parquet ORC Avro HDFS S3 Table format
SQL query engines Flink Spark Hive Trino Apache Iceberg An open table format for huge analytic datasets OSS

How Iceberg table looks like # Spark SQL create table
sample (id int) using iceberg; insert into sample values (100); insert into sample values (200); select * from sample; # Files in HDFS sample ├── data │ ├── 00000-2-26bcfac0-91ba-4374-a879-b780cf0608c3-00001.parquet │ └── 00000-3-4bfb85d8-3283-48f7-980d-28ea115aed80-00001.parquet └── metadata ├── 00000-811eaf6e-b0f4-4bd7-8f87-a6df1d543b34.metadata.json ├── 00001-4041324f-1920-44f4-8ce6-6088ec663e0a.metadata.json ├── 00002-66aac2ec-8f9a-4de8-a679-428bb970b1ff.metadata.json ├── 2a67328f-8386-4d1a-873a-1034824e22f8-m0.avro ├── 91e78f4a-f1df-414f-835d-45488001bba9-m0.avro ├── snap-4758351318332926243-1-2a67328f-8386-4d1a-873a-1034824e22f8.avro └── snap-5465468679579016991-1-91e78f4a-f1df-414f-835d-45488001bba9.avro

Key concept s0 time s1 Data Snapshots Write & Commit
Snapshot: State of a table at some time How Iceberg table tracks data files Partition Schema Format Stats File location t0 t1

Metadata files to track data Tracks the table schema, partitioning
config, and snapshots. Stores metadata about manifests, including partition stats. Lists data files, along with each file’s partition data tuple, stats, and tracking information. Table metadata file Manifest list file Manifest file s0 s1 m0 m1 m2 m0 m1 d00 d01 m0 d00 d01 d10 m1 d20 m2 d10 d20 Data files Hive Metastore

Finding files necessary for a query 1. Find the manifest
list file from the current snapshot. 2. Filter manifest files using partition value ranges stored in the manifest list file. 3. Read each manifest to get data files. manifest-list = ml1 For manifest m2 and partition p, range is [20, 29] d20 file path = hdfs://... s0 s1 m0 m1 m2 m0 m1 d00 d01 m0 d00 d01 d10 m1 d20 m2 d10 d20 Data files ml1

Further file filtering with per-file stats Manifest file stores per-file
per-column stats calculated at the time of data write. s0 s1 m0 m1 m2 m0 m1 d00 d01 m0 d00 d01 d10 m1 d20 m2 d10 d20 ml1 file_path string Location URI with FS scheme lower_bounds map<int,binary> Map of column id to lower bound upper_bounds map<int,binary> Map of column id to upper bound

Differences & Benefits Finer partitioning granularity Hive Apache Iceberg Metadata
stored at Hive Metastore Files Limited Unblocked Scalablility Efficiency Stats support Per-partition Per-file Performance

More benefits Serializable isolation Row-level deletes Incremental consumption Time travel
Schema evolution Hidden partitioning

Summary Hive table format coupled with central metadata store poses
scalability issues. Apache Iceberg table tracks data with files and solves the scalability issues. It provides additional benefits for analytic goals. We are considering adopting it widely in LINE Data Platform.

First part - End Thank you

Takeshi ONO Speaker - Joined LINE in March 2019 -
Working on the development of ingestion pipeline - Data Engineering 1 team

Agenda - Overview of the existing log pipeline - Problems
of the existing log pipeline - Solutions by adopting Iceberg - Details of Flink Iceberg application - Future works

Overview of the existing log pipeline

Flink Existing log pipeline Kafka RAW table - End-to-end pipeline
is supported with Protobuf/JSON format serialization. - Flink writes Protobuf/JSON raw data into HDFS with SequenceFile format. - Support exactly-once delivery. - We developed Hive Protobuf Serde libraries for accessing the RAW table.

Flink Existing log pipeline Watcher Hiveserver 2 Tez on YARN
Kafka RAW table ORC table - Watcher process detects RAW table updates. - Hive Tez engine converts SequenceFile format table to ORC using INSERT OVERWRITE statements. - For loose coupling with the data all Hive tables are 'external' tables.

Problems of the existing log pipeline

End-to-end high latency - Flink's BucketingSink supports truncating files to
recover from the last checkpoint, so it can keep writing a single SequenceFile over multiple checkpoints. Avoid many small HDFS file creation User demand - To reduce this latency and to improve the data freshness, for making data-driven decision and building near real time data powered products. Current pipeline data processing latency is almost 2 hours for the end user accessible. For reasons, - Flush RAW table SequenceFile every hour to create hourly partitioned HDFS files. - Hourly ORC conversion for over 1k partitions of above RAW tables.

Flink End-to-end high latency (cont’d) Kafka ORC table Tried development
of Flink app which directly writes ORC files - Use Flink’s StreamingFileSink and OrcWriter - Write ORC files directly every few minutes However, small files compaction is difficult to implement - It can delete the files just before user’s jobs read them - Too many partitions and various sending patterns, making it difficult to manage at low cost without missing compaction This proposal was not adopted.

Issues related to robustness - RAW table and ORC table
- Create/Alter/Drop table, Add/Drop partition, Sync schema - Sometimes manual operations are required Need to manage two type of Hive external tables Too many metadata for partitions/files - Heavy partition scan against Hive Metastore - Heavy directory scan against NameNode Too many components or dependencies - Flink, Watcher, HDFS, Hive Metastore, Hiveserver2, Yarn

Limited schema evolution support - For backward compatibility - Set
orc.force.positional.evolution=true Use position-based fields mapping for ORC files Users often request to - Drop deprecated fields - Insert new fields in the contextually appropriate position Cannot delete/insert/move fields - Support only adding fields of table/struct - Rename is possible, but not recommended for query engine compatibility

Solutions by adopting Iceberg

Flink New log pipeline Table Optimizer Kafka Iceberg table -
Implementing with Iceberg 0.12, Flink 1.12 and Spark 3.0 - Flink job writes Iceberg file (ORC/Parquet) directly - Flush interval is 5 minutes - Use Spark actions for table maintenance - Table Optimizer process schedules Spark jobs Spark on YARN Merge small files Expire snapshots Rewrite manifests Remove orphans Delete expired records

End-to-end minimized latency - Flink job's checkpointing interval is 5
minutes - Iceberg's snapshot isolation enables ACID small files compaction 04 05 00 01 02 03 snapshot-010 = [00, 01, 02 ,03] snapshot-011 = [00, 01 ,02, 03, 04] snapshot-012 = [05, 04] snapshot-013 = [05, 04, 06] 06 Start compaction A table reader that starts reading at snapshot-011 will still be able to read 00,01,02,03,04 after compaction is complete and snapshot-012 is created.

Simple/Scalable architecture - To build query execution plans, it is
no longer needed to scan NameNode/Hive Metastore, but only to read metadata files per table on HDFS - No such scan is needed for table-optimizer to check status of table/partition Reduce the load on NameNode, Hive Metastore Ease partition management tasks with Iceberg hidden partition feature - No need to add partitions - Just run DELETE WHERE statement to drop partitions No dependence on Hiveserver2, Tez on Yarn - Reduce sources of log pipeline troubles - Delay of table-optimizer jobs is not critical

Fully supported schema evolution - Support any changes in Protobuf
schema - Based on field ID table.updateSchema() .addColumn("count", Types.LongType.get()) .commit(); table.updateSchema() .renameColumn("host", "host_name") .commit(); table.updateSchema() .deleteColumn("event_2020") .commit(); table.updateSchema() .addColumn("name_2", Types.StringType.get()) .moveAfter("name_2", "name") .commit(); "schema" : { "type" : "struct", "fields" : [ { "id" : 1, "name" : "timestamp", "required" : false, "type" : "timestamptz" }, { "id" : 2, "name" : ”count", "required" : false, "type" : "long” }, {

Summary RAW + ORC ORC Writer Iceberg Small files problem
End-to-end latency Schema evolution Scalability

Details of Flink Iceberg application

Message format conversion - Convert Protobuf/JSON to Avro using org.apache.avro.protobuf
libraries - Convert Avro to RowData using Iceberg FlinkAvroReader AS-IS TO-BE - Skip Avro conversion to improve performance and reduce complexity - Develop JSON converter by referring to the implementation of flink-formats/flink-json package - And for Protobuf, utilize flink-formats/flink-protobuf package (to be released in Flink 1.15) Convert Protobuf/JSON to Flink RowData - Write records into Iceberg tables with Flink's DataStream<RowData> - Support for Iceberg-specific handling of Timestamp/Map/Enum types

Multiple Writers and single Committer - IcebergStreamWriter writes files -
IcebergFilesCommitter commits snapshot - Support exactly-once delivery - FlinkSink is a wrapper of the Writer and Committer Files Committer Consumer Converter Stream Writer Parallelism: N Parallelism: N Parallelism: N Parallelism: 1

Dynamic DataStream Splitter - We don’t use FlinkSink to avoid
shuffle - Tables with incoming data volume exceeding 1GB/sec - Automatically adjust the number of active writers according to the incoming rate - Splitter is implemented with Flink Side Outputs Converter Files Committer Consumer Splitter Stream Writer Parallelism: N Parallelism: N Parallelism: N Parallelism: 1 Converter Stream Writer Parallelism: N Converter Stream Writer Parallelism: N Converter Stream Writer Parallelism: N

Future works

Future works - Revise centralized data pipelines to generate Apache
Iceberg tables Wide adoption Advanced applications - Incremental data consumption for lower end-to-end latency - CDC data pipeline and time travel access to manage state table Research and development - Develop and operate data pipelines with controlled data consumers reliably at scale - Make sure query engine integration works (Spark 3, Hive, Trino) - Build table migration strategies

We are hiring! Data Platform Engineer https://linecorp.com/ja/career/position/1750 Site Reliability Engineer
https://linecorp.com/ja/career/position/1751 Data Engineer https://linecorp.com/ja/career/position/1749

Thank you

Adopting Apache Iceberg on LINE Data Platform

Adopting Apache Iceberg on LINE Data Platform

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Featured

Transcript