$30 off During Our Annual Pro Sale. View Details »

Adopting Apache Iceberg on LINE Data Platform

Adopting Apache Iceberg on LINE Data Platform

Tomoyuki Saito
LINE / IU Tech Forward team / Software Engineer
Takeshi Ono
LINE / Data Engineering1 / Software Engineer

https://linedevday.linecorp.com/2021/ja/sessions/64
https://linedevday.linecorp.com/2021/en/sessions/64
https://linedevday.linecorp.com/2021/ko/sessions/64

LINE DEVDAY 2021
PRO

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. None
  2. Speaker Tomoyuki Saito - Senior Software Engineer, Data Platform dept.

    - Entered LINE as a new grad in 2015 - Career - Bot-backend server dev - Log ingestion pipeline dev (2016~) - Data platform dev
  3. Agenda - First part – Problems in query processing -

    LINE's Data Platform - Problems in query processing - A table format – Apache Iceberg - Second part – Revamping log data pipeline with Apache Iceberg
  4. LINE's Data Platform

  5. LINE's self-serve data platform Data Platform Services Data Science Machine

    Learning Governance ... Democratize data for business consumers
  6. Typical data flow Data sources Ingest Storage / Metadata Serve

    Data consumers Process
  7. LINE's data flow Data sources Ingest Storage / Metadata Serve

    Data consumers Process Big Big Big Big Big Big On-premise machines & data centers
  8. Scale 5,000+ 40,000+ 290 PB+ Quantity Number of machines Stored

    data volume Hive tables
  9. Scale 17.5 M+ 700+ 150,000+ Throughput / Activities Log ingestion

    rate Jobs executed Platform users records / s jobs / day
  10. Challenges Data sources Ingest Storage / Metadata Serve Data consumers

    Process Throughput Quantity
  11. Challenges Data sources Ingest Storage / Metadata Serve Data consumers

    Process Throughput Quantity
  12. Problems in query processing

  13. Data ETL with SQL Storage / Metadata Distributed SQL query

    engines SQL Spark Hive Trino Flink
  14. Distributed SQL execution SELECT name FROM employee Execution Plan Parse

    Analysis Optimization Planning
  15. Distributed SQL execution Execution Plan Parse Analysis Optimization Planning SELECT

    name FROM employee How to read / write data? What data files are in the table?
  16. Distributed SQL execution Execution Plan Parse Analysis Optimization Planning SELECT

    name FROM employee How to read / write data? What data files are in the table? Table format determines how to answer those questions.
  17. Table format Determines a way to organize data files to

    present them as a table. Answer what data files are in a table. Hide the complexity of finding relevant files. Data discovery Answer how to read and write data. Hide the complexity of underlying data structures and formats. Data abstraction
  18. The de-facto standard – Hive table format Metastore DB Hive

    Metastore Thrift API Table metadata is manipulated and queried via Hive Metastore's Thrift interface. Table metadata is stored in Metastore DB (RDBMS). create_table get_partitions Stats Schema Format Serde Partition Data files are defined as the entire contents of directories. /table/date=2021-10-01/*
  19. LINE's ETL infrastructure Storage SQL engines Spark Hive Trino Interactive

    Jobs HDFS Hive Metastore Scheduled / Batch Thrift API Metastore DB
  20. LINE's ETL infrastructure Storage SQL engines Spark Hive Trino Interactive

    Jobs HDFS Hive Metastore Scheduled / Batch Thrift API Big Big Big Metastore DB
  21. Metastore DB – Regular QPS over a week

  22. Metastore DB – Regular CPU usage over a week

  23. Metastore DB – Abnormal CPU usage Risk of outages Big

    blast radius Observations - Slow queries at Metastore DB - Scans over 11M+ rows, querying on 'PARTITIONS' table in Metastore DB - Corresponding to a query on a Hive table with 11M+ partitions, which were created unwantedly - A resolution: Reduce partitions
  24. Hive table format limitation Hive Metastore Table with O(10K) partitions

    High load Memory pressure HIVE-13884 The format highly relies on Hive Metastore to store metadata. Partitions are stored as rows in DB. Metastore DB Limitation: Loading many partitions cause problems. Workaround is necessary, like reducing partitions.
  25. Diagnostics – Hive table format Table metadata lookup performance is

    bound to the capacity of central Hive Metastore and Metastore DB instances. Coarse-grained partitioning leads to scans over more data, and more data to write on partition regeneration. Unable to enrich metadata further, like per-file statistics for query optimization. Bottleneck Inefficent data access Less opportunities for optimization
  26. Apache Iceberg An open table format for huge analytic datasets

  27. Storage File format Parquet ORC Avro HDFS S3 Table format

    SQL query engines Flink Spark Hive Trino Apache Iceberg An open table format for huge analytic datasets OSS
  28. How Iceberg table looks like # Spark SQL create table

    sample (id int) using iceberg; insert into sample values (100); insert into sample values (200); select * from sample; # Files in HDFS sample ├── data │ ├── 00000-2-26bcfac0-91ba-4374-a879-b780cf0608c3-00001.parquet │ └── 00000-3-4bfb85d8-3283-48f7-980d-28ea115aed80-00001.parquet └── metadata ├── 00000-811eaf6e-b0f4-4bd7-8f87-a6df1d543b34.metadata.json ├── 00001-4041324f-1920-44f4-8ce6-6088ec663e0a.metadata.json ├── 00002-66aac2ec-8f9a-4de8-a679-428bb970b1ff.metadata.json ├── 2a67328f-8386-4d1a-873a-1034824e22f8-m0.avro ├── 91e78f4a-f1df-414f-835d-45488001bba9-m0.avro ├── snap-4758351318332926243-1-2a67328f-8386-4d1a-873a-1034824e22f8.avro └── snap-5465468679579016991-1-91e78f4a-f1df-414f-835d-45488001bba9.avro
  29. Key concept s0 time s1 Data Snapshots Write & Commit

    Snapshot: State of a table at some time How Iceberg table tracks data files Partition Schema Format Stats File location t0 t1
  30. Metadata files to track data Tracks the table schema, partitioning

    config, and snapshots. Stores metadata about manifests, including partition stats. Lists data files, along with each file’s partition data tuple, stats, and tracking information. Table metadata file Manifest list file Manifest file s0 s1 m0 m1 m2 m0 m1 d00 d01 m0 d00 d01 d10 m1 d20 m2 d10 d20 Data files Hive Metastore
  31. Finding files necessary for a query 1. Find the manifest

    list file from the current snapshot. 2. Filter manifest files using partition value ranges stored in the manifest list file. 3. Read each manifest to get data files. manifest-list = ml1 For manifest m2 and partition p, range is [20, 29] d20 file path = hdfs://... s0 s1 m0 m1 m2 m0 m1 d00 d01 m0 d00 d01 d10 m1 d20 m2 d10 d20 Data files ml1
  32. Further file filtering with per-file stats Manifest file stores per-file

    per-column stats calculated at the time of data write. s0 s1 m0 m1 m2 m0 m1 d00 d01 m0 d00 d01 d10 m1 d20 m2 d10 d20 ml1 file_path string Location URI with FS scheme lower_bounds map<int,binary> Map of column id to lower bound upper_bounds map<int,binary> Map of column id to upper bound
  33. Differences & Benefits Finer partitioning granularity Hive Apache Iceberg Metadata

    stored at Hive Metastore Files Limited Unblocked Scalablility Efficiency Stats support Per-partition Per-file Performance
  34. More benefits Serializable isolation Row-level deletes Incremental consumption Time travel

    Schema evolution Hidden partitioning
  35. Summary Hive table format coupled with central metadata store poses

    scalability issues. Apache Iceberg table tracks data with files and solves the scalability issues. It provides additional benefits for analytic goals. We are considering adopting it widely in LINE Data Platform.
  36. First part - End Thank you

  37. Takeshi ONO Speaker - Joined LINE in March 2019 -

    Working on the development of ingestion pipeline - Data Engineering 1 team
  38. Agenda - Overview of the existing log pipeline - Problems

    of the existing log pipeline - Solutions by adopting Iceberg - Details of Flink Iceberg application - Future works
  39. Overview of the existing log pipeline

  40. Flink Existing log pipeline Kafka RAW table - End-to-end pipeline

    is supported with Protobuf/JSON format serialization. - Flink writes Protobuf/JSON raw data into HDFS with SequenceFile format. - Support exactly-once delivery. - We developed Hive Protobuf Serde libraries for accessing the RAW table.
  41. Flink Existing log pipeline Watcher Hiveserver 2 Tez on YARN

    Kafka RAW table ORC table - Watcher process detects RAW table updates. - Hive Tez engine converts SequenceFile format table to ORC using INSERT OVERWRITE statements. - For loose coupling with the data all Hive tables are 'external' tables.
  42. Problems of the existing log pipeline

  43. End-to-end high latency - Flink's BucketingSink supports truncating files to

    recover from the last checkpoint, so it can keep writing a single SequenceFile over multiple checkpoints. Avoid many small HDFS file creation User demand - To reduce this latency and to improve the data freshness, for making data-driven decision and building near real time data powered products. Current pipeline data processing latency is almost 2 hours for the end user accessible. For reasons, - Flush RAW table SequenceFile every hour to create hourly partitioned HDFS files. - Hourly ORC conversion for over 1k partitions of above RAW tables.
  44. Flink End-to-end high latency (cont’d) Kafka ORC table Tried development

    of Flink app which directly writes ORC files - Use Flink’s StreamingFileSink and OrcWriter - Write ORC files directly every few minutes However, small files compaction is difficult to implement - It can delete the files just before user’s jobs read them - Too many partitions and various sending patterns, making it difficult to manage at low cost without missing compaction This proposal was not adopted.
  45. Issues related to robustness - RAW table and ORC table

    - Create/Alter/Drop table, Add/Drop partition, Sync schema - Sometimes manual operations are required Need to manage two type of Hive external tables Too many metadata for partitions/files - Heavy partition scan against Hive Metastore - Heavy directory scan against NameNode Too many components or dependencies - Flink, Watcher, HDFS, Hive Metastore, Hiveserver2, Yarn
  46. Limited schema evolution support - For backward compatibility - Set

    orc.force.positional.evolution=true Use position-based fields mapping for ORC files Users often request to - Drop deprecated fields - Insert new fields in the contextually appropriate position Cannot delete/insert/move fields - Support only adding fields of table/struct - Rename is possible, but not recommended for query engine compatibility
  47. Solutions by adopting Iceberg

  48. Flink New log pipeline Table Optimizer Kafka Iceberg table -

    Implementing with Iceberg 0.12, Flink 1.12 and Spark 3.0 - Flink job writes Iceberg file (ORC/Parquet) directly - Flush interval is 5 minutes - Use Spark actions for table maintenance - Table Optimizer process schedules Spark jobs Spark on YARN Merge small files Expire snapshots Rewrite manifests Remove orphans Delete expired records
  49. End-to-end minimized latency - Flink job's checkpointing interval is 5

    minutes - Iceberg's snapshot isolation enables ACID small files compaction 04 05 00 01 02 03 snapshot-010 = [00, 01, 02 ,03] snapshot-011 = [00, 01 ,02, 03, 04] snapshot-012 = [05, 04] snapshot-013 = [05, 04, 06] 06 Start compaction A table reader that starts reading at snapshot-011 will still be able to read 00,01,02,03,04 after compaction is complete and snapshot-012 is created.
  50. Simple/Scalable architecture - To build query execution plans, it is

    no longer needed to scan NameNode/Hive Metastore, but only to read metadata files per table on HDFS - No such scan is needed for table-optimizer to check status of table/partition Reduce the load on NameNode, Hive Metastore Ease partition management tasks with Iceberg hidden partition feature - No need to add partitions - Just run DELETE WHERE statement to drop partitions No dependence on Hiveserver2, Tez on Yarn - Reduce sources of log pipeline troubles - Delay of table-optimizer jobs is not critical
  51. Fully supported schema evolution - Support any changes in Protobuf

    schema - Based on field ID table.updateSchema() .addColumn("count", Types.LongType.get()) .commit(); table.updateSchema() .renameColumn("host", "host_name") .commit(); table.updateSchema() .deleteColumn("event_2020") .commit(); table.updateSchema() .addColumn("name_2", Types.StringType.get()) .moveAfter("name_2", "name") .commit(); "schema" : { "type" : "struct", "fields" : [ { "id" : 1, "name" : "timestamp", "required" : false, "type" : "timestamptz" }, { "id" : 2, "name" : ”count", "required" : false, "type" : "long” }, {
  52. Summary RAW + ORC ORC Writer Iceberg Small files problem

    End-to-end latency Schema evolution Scalability
  53. Details of Flink Iceberg application

  54. Message format conversion - Convert Protobuf/JSON to Avro using org.apache.avro.protobuf

    libraries - Convert Avro to RowData using Iceberg FlinkAvroReader AS-IS TO-BE - Skip Avro conversion to improve performance and reduce complexity - Develop JSON converter by referring to the implementation of flink-formats/flink-json package - And for Protobuf, utilize flink-formats/flink-protobuf package (to be released in Flink 1.15) Convert Protobuf/JSON to Flink RowData - Write records into Iceberg tables with Flink's DataStream<RowData> - Support for Iceberg-specific handling of Timestamp/Map/Enum types
  55. Multiple Writers and single Committer - IcebergStreamWriter writes files -

    IcebergFilesCommitter commits snapshot - Support exactly-once delivery - FlinkSink is a wrapper of the Writer and Committer Files Committer Consumer Converter Stream Writer Parallelism: N Parallelism: N Parallelism: N Parallelism: 1
  56. Dynamic DataStream Splitter - We don’t use FlinkSink to avoid

    shuffle - Tables with incoming data volume exceeding 1GB/sec - Automatically adjust the number of active writers according to the incoming rate - Splitter is implemented with Flink Side Outputs Converter Files Committer Consumer Splitter Stream Writer Parallelism: N Parallelism: N Parallelism: N Parallelism: 1 Converter Stream Writer Parallelism: N Converter Stream Writer Parallelism: N Converter Stream Writer Parallelism: N
  57. Future works

  58. Future works - Revise centralized data pipelines to generate Apache

    Iceberg tables Wide adoption Advanced applications - Incremental data consumption for lower end-to-end latency - CDC data pipeline and time travel access to manage state table Research and development - Develop and operate data pipelines with controlled data consumers reliably at scale - Make sure query engine integration works (Spark 3, Hive, Trino) - Build table migration strategies
  59. We are hiring! Data Platform Engineer https://linecorp.com/ja/career/position/1750 Site Reliability Engineer

    https://linecorp.com/ja/career/position/1751 Data Engineer https://linecorp.com/ja/career/position/1749
  60. Thank you