Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adopting Apache Iceberg on LINE Data Platform

Adopting Apache Iceberg on LINE Data Platform

Tomoyuki Saito
LINE / IU Tech Forward team / Software Engineer
Takeshi Ono
LINE / Data Engineering1 / Software Engineer

https://linedevday.linecorp.com/2021/ja/sessions/64
https://linedevday.linecorp.com/2021/en/sessions/64
https://linedevday.linecorp.com/2021/ko/sessions/64

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Speaker
    Tomoyuki Saito
    - Senior Software Engineer, Data Platform dept.
    - Entered LINE as a new grad in 2015
    - Career
    - Bot-backend server dev
    - Log ingestion pipeline dev (2016~)
    - Data platform dev

    View full-size slide

  2. Agenda
    - First part – Problems in query processing
    - LINE's Data Platform
    - Problems in query processing
    - A table format – Apache Iceberg
    - Second part – Revamping log data pipeline with
    Apache Iceberg

    View full-size slide

  3. LINE's Data Platform

    View full-size slide

  4. LINE's self-serve data platform
    Data Platform
    Services
    Data Science
    Machine Learning
    Governance
    ...
    Democratize data for business consumers

    View full-size slide

  5. Typical data flow
    Data sources
    Ingest
    Storage / Metadata
    Serve
    Data consumers
    Process

    View full-size slide

  6. LINE's data flow
    Data sources
    Ingest
    Storage / Metadata
    Serve
    Data consumers
    Process
    Big
    Big
    Big Big
    Big
    Big
    On-premise machines & data centers

    View full-size slide

  7. Scale
    5,000+ 40,000+
    290 PB+
    Quantity
    Number of machines Stored data volume Hive tables

    View full-size slide

  8. Scale
    17.5 M+ 700+
    150,000+
    Throughput / Activities
    Log ingestion rate Jobs executed Platform users
    records / s jobs / day

    View full-size slide

  9. Challenges
    Data sources
    Ingest
    Storage / Metadata
    Serve
    Data consumers
    Process
    Throughput
    Quantity

    View full-size slide

  10. Challenges
    Data sources
    Ingest
    Storage / Metadata
    Serve
    Data consumers
    Process
    Throughput
    Quantity

    View full-size slide

  11. Problems in
    query processing

    View full-size slide

  12. Data ETL with SQL
    Storage / Metadata
    Distributed SQL query engines
    SQL
    Spark Hive Trino Flink

    View full-size slide

  13. Distributed SQL execution
    SELECT name
    FROM employee
    Execution
    Plan
    Parse
    Analysis
    Optimization
    Planning

    View full-size slide

  14. Distributed SQL execution
    Execution
    Plan
    Parse
    Analysis
    Optimization
    Planning
    SELECT name
    FROM employee
    How to
    read / write data?
    What data files
    are in the table?

    View full-size slide

  15. Distributed SQL execution
    Execution
    Plan
    Parse
    Analysis
    Optimization
    Planning
    SELECT name
    FROM employee
    How to
    read / write data?
    What data files
    are in the table?
    Table format determines how to answer those questions.

    View full-size slide

  16. Table format
    Determines a way to organize data files to present them as a table.
    Answer what data files are in a table.
    Hide the complexity of finding relevant files.
    Data
    discovery
    Answer how to read and write data.
    Hide the complexity of underlying data structures and formats.
    Data
    abstraction

    View full-size slide

  17. The de-facto standard – Hive table format
    Metastore DB
    Hive Metastore
    Thrift API
    Table metadata is manipulated and queried via
    Hive Metastore's Thrift interface.
    Table metadata is stored in Metastore DB (RDBMS).
    create_table
    get_partitions
    Stats
    Schema
    Format
    Serde
    Partition
    Data files are defined as the entire contents of directories.
    /table/date=2021-10-01/*

    View full-size slide

  18. LINE's ETL infrastructure
    Storage
    SQL
    engines
    Spark Hive Trino
    Interactive
    Jobs
    HDFS Hive Metastore
    Scheduled / Batch
    Thrift API
    Metastore DB

    View full-size slide

  19. LINE's ETL infrastructure
    Storage
    SQL
    engines
    Spark Hive Trino
    Interactive
    Jobs
    HDFS Hive Metastore
    Scheduled / Batch
    Thrift API
    Big Big
    Big
    Metastore DB

    View full-size slide

  20. Metastore DB – Regular QPS over a week

    View full-size slide

  21. Metastore DB – Regular CPU usage over a week

    View full-size slide

  22. Metastore DB – Abnormal CPU usage
    Risk of outages
    Big blast radius
    Observations - Slow queries at Metastore DB
    - Scans over 11M+ rows, querying on 'PARTITIONS' table in Metastore DB
    - Corresponding to a query on a Hive table with 11M+ partitions, which were created unwantedly
    - A resolution: Reduce partitions

    View full-size slide

  23. Hive table format limitation
    Hive Metastore
    Table with
    O(10K) partitions
    High load
    Memory pressure
    HIVE-13884
    The format highly relies on Hive Metastore to store
    metadata. Partitions are stored as rows in DB.
    Metastore DB
    Limitation: Loading many partitions cause problems.
    Workaround is necessary, like reducing partitions.

    View full-size slide

  24. Diagnostics – Hive table format
    Table metadata lookup performance is bound to
    the capacity of central Hive Metastore and Metastore DB instances.
    Coarse-grained partitioning leads to scans over more data,
    and more data to write on partition regeneration.
    Unable to enrich metadata further,
    like per-file statistics for query optimization.
    Bottleneck
    Inefficent
    data access
    Less opportunities
    for optimization

    View full-size slide

  25. Apache Iceberg
    An open table format for huge analytic datasets

    View full-size slide

  26. Storage
    File format Parquet ORC Avro
    HDFS S3
    Table format
    SQL query
    engines
    Flink Spark Hive Trino
    Apache Iceberg
    An open table format for huge analytic datasets
    OSS

    View full-size slide

  27. How Iceberg table looks like
    # Spark SQL
    create table sample (id int) using iceberg;
    insert into sample values (100); insert into sample values (200);
    select * from sample;
    # Files in HDFS
    sample
    ├── data
    │ ├── 00000-2-26bcfac0-91ba-4374-a879-b780cf0608c3-00001.parquet
    │ └── 00000-3-4bfb85d8-3283-48f7-980d-28ea115aed80-00001.parquet
    └── metadata
    ├── 00000-811eaf6e-b0f4-4bd7-8f87-a6df1d543b34.metadata.json
    ├── 00001-4041324f-1920-44f4-8ce6-6088ec663e0a.metadata.json
    ├── 00002-66aac2ec-8f9a-4de8-a679-428bb970b1ff.metadata.json
    ├── 2a67328f-8386-4d1a-873a-1034824e22f8-m0.avro
    ├── 91e78f4a-f1df-414f-835d-45488001bba9-m0.avro
    ├── snap-4758351318332926243-1-2a67328f-8386-4d1a-873a-1034824e22f8.avro
    └── snap-5465468679579016991-1-91e78f4a-f1df-414f-835d-45488001bba9.avro

    View full-size slide

  28. Key concept
    s0
    time
    s1
    Data
    Snapshots
    Write & Commit
    Snapshot: State of a table at some time
    How Iceberg table tracks data files
    Partition
    Schema
    Format
    Stats
    File
    location
    t0 t1

    View full-size slide

  29. Metadata files to track data
    Tracks the table schema, partitioning config,
    and snapshots.
    Stores metadata about manifests,
    including partition stats.
    Lists data files, along with each file’s
    partition data tuple, stats, and tracking information.
    Table metadata
    file
    Manifest list
    file
    Manifest
    file
    s0 s1
    m0
    m1
    m2
    m0
    m1
    d00
    d01
    m0
    d00 d01
    d10
    m1
    d20
    m2
    d10 d20 Data files
    Hive Metastore

    View full-size slide

  30. Finding files necessary for a query
    1. Find the manifest list file from the current snapshot.
    2. Filter manifest files using partition value ranges
    stored in the manifest list file.
    3. Read each manifest to get data files.
    manifest-list = ml1
    For manifest m2
    and partition p,
    range is [20, 29]
    d20 file path =
    hdfs://...
    s0 s1
    m0
    m1
    m2
    m0
    m1
    d00
    d01
    m0
    d00 d01
    d10
    m1
    d20
    m2
    d10 d20 Data files
    ml1

    View full-size slide

  31. Further file filtering with per-file stats
    Manifest file stores per-file per-column stats
    calculated at the time of data write.
    s0 s1
    m0
    m1
    m2
    m0
    m1
    d00
    d01
    m0
    d00 d01
    d10
    m1
    d20
    m2
    d10 d20
    ml1
    file_path string Location URI with FS scheme
    lower_bounds map Map of column id to lower bound
    upper_bounds map Map of column id to upper bound

    View full-size slide

  32. Differences & Benefits
    Finer partitioning
    granularity
    Hive Apache Iceberg
    Metadata stored at Hive Metastore Files
    Limited Unblocked
    Scalablility
    Efficiency
    Stats support Per-partition Per-file Performance

    View full-size slide

  33. More benefits
    Serializable isolation Row-level deletes
    Incremental
    consumption
    Time travel Schema evolution
    Hidden partitioning

    View full-size slide

  34. Summary
    Hive table format coupled with central metadata store
    poses scalability issues.
    Apache Iceberg table tracks data with files and solves the scalability issues.
    It provides additional benefits for analytic goals.
    We are considering adopting it widely in LINE Data Platform.

    View full-size slide

  35. First part - End
    Thank you

    View full-size slide

  36. Takeshi ONO
    Speaker
    - Joined LINE in March 2019
    - Working on the development of ingestion pipeline
    - Data Engineering 1 team

    View full-size slide

  37. Agenda
    - Overview of the existing log pipeline
    - Problems of the existing log pipeline
    - Solutions by adopting Iceberg
    - Details of Flink Iceberg application
    - Future works

    View full-size slide

  38. Overview of the existing log
    pipeline

    View full-size slide

  39. Flink
    Existing log pipeline
    Kafka
    RAW
    table
    - End-to-end pipeline is supported
    with Protobuf/JSON format
    serialization.
    - Flink writes Protobuf/JSON raw
    data into HDFS with
    SequenceFile format.
    - Support exactly-once delivery.
    - We developed Hive Protobuf
    Serde libraries for accessing the
    RAW table.

    View full-size slide

  40. Flink
    Existing log pipeline
    Watcher
    Hiveserver
    2
    Tez on
    YARN
    Kafka
    RAW
    table
    ORC
    table
    - Watcher process detects RAW
    table updates.
    - Hive Tez engine converts
    SequenceFile format table to
    ORC using INSERT
    OVERWRITE statements.
    - For loose coupling with the data
    all Hive tables are 'external'
    tables.

    View full-size slide

  41. Problems of the existing log
    pipeline

    View full-size slide

  42. End-to-end high latency
    - Flink's BucketingSink supports truncating files to recover from the last checkpoint, so it can keep
    writing a single SequenceFile over multiple checkpoints.
    Avoid many small HDFS file creation
    User demand
    - To reduce this latency and to improve the data freshness, for making data-driven decision and
    building near real time data powered products.
    Current pipeline data processing latency is almost 2 hours for the end user
    accessible. For reasons,
    - Flush RAW table SequenceFile every hour to create hourly partitioned HDFS files.
    - Hourly ORC conversion for over 1k partitions of above RAW tables.

    View full-size slide

  43. Flink
    End-to-end high latency (cont’d)
    Kafka
    ORC
    table
    Tried development of Flink app which directly writes ORC files
    - Use Flink’s StreamingFileSink and OrcWriter
    - Write ORC files directly every few minutes
    However, small files compaction is difficult to implement
    - It can delete the files just before user’s jobs read them
    - Too many partitions and various sending patterns, making it difficult to manage at low cost without
    missing compaction
    This proposal was not adopted.

    View full-size slide

  44. Issues related to robustness
    - RAW table and ORC table
    - Create/Alter/Drop table, Add/Drop partition, Sync schema
    - Sometimes manual operations are required
    Need to manage two type of Hive external tables
    Too many metadata for partitions/files
    - Heavy partition scan against Hive Metastore
    - Heavy directory scan against NameNode
    Too many components or dependencies
    - Flink, Watcher, HDFS, Hive Metastore, Hiveserver2, Yarn

    View full-size slide

  45. Limited schema evolution support
    - For backward compatibility
    - Set orc.force.positional.evolution=true
    Use position-based fields mapping for ORC files
    Users often request to
    - Drop deprecated fields
    - Insert new fields in the contextually appropriate position
    Cannot delete/insert/move fields
    - Support only adding fields of table/struct
    - Rename is possible, but not recommended for query engine compatibility

    View full-size slide

  46. Solutions by adopting
    Iceberg

    View full-size slide

  47. Flink
    New log pipeline
    Table
    Optimizer
    Kafka
    Iceberg
    table
    - Implementing with Iceberg 0.12, Flink 1.12 and
    Spark 3.0
    - Flink job writes Iceberg file (ORC/Parquet) directly
    - Flush interval is 5 minutes
    - Use Spark actions for table maintenance
    - Table Optimizer process schedules Spark jobs
    Spark on
    YARN
    Merge small files
    Expire snapshots
    Rewrite manifests
    Remove orphans
    Delete expired records

    View full-size slide

  48. End-to-end minimized latency
    - Flink job's checkpointing interval is
    5 minutes
    - Iceberg's snapshot isolation
    enables ACID small files
    compaction
    04
    05
    00 01 02 03
    snapshot-010 = [00, 01, 02 ,03]
    snapshot-011 = [00, 01 ,02, 03, 04]
    snapshot-012 = [05, 04]
    snapshot-013 = [05, 04, 06]
    06
    Start compaction
    A table reader that starts
    reading at snapshot-011 will
    still be able to read
    00,01,02,03,04 after compaction
    is complete and snapshot-012 is
    created.

    View full-size slide

  49. Simple/Scalable architecture
    - To build query execution plans, it is no longer needed to scan NameNode/Hive Metastore, but
    only to read metadata files per table on HDFS
    - No such scan is needed for table-optimizer to check status of table/partition
    Reduce the load on NameNode, Hive Metastore
    Ease partition management tasks with Iceberg hidden partition feature
    - No need to add partitions
    - Just run DELETE WHERE statement to drop partitions
    No dependence on Hiveserver2, Tez on Yarn
    - Reduce sources of log pipeline troubles
    - Delay of table-optimizer jobs is not critical

    View full-size slide

  50. Fully supported schema evolution
    - Support any changes in Protobuf
    schema
    - Based on field ID
    table.updateSchema()
    .addColumn("count", Types.LongType.get())
    .commit();
    table.updateSchema()
    .renameColumn("host", "host_name")
    .commit();
    table.updateSchema()
    .deleteColumn("event_2020")
    .commit();
    table.updateSchema()
    .addColumn("name_2", Types.StringType.get())
    .moveAfter("name_2", "name")
    .commit();
    "schema" : {
    "type" : "struct",
    "fields" : [ {
    "id" : 1,
    "name" : "timestamp",
    "required" : false,
    "type" : "timestamptz"
    }, {
    "id" : 2,
    "name" : ”count",
    "required" : false,
    "type" : "long”
    }, {

    View full-size slide

  51. Summary
    RAW + ORC ORC Writer Iceberg
    Small files
    problem
    End-to-end
    latency
    Schema
    evolution
    Scalability

    View full-size slide

  52. Details of Flink Iceberg
    application

    View full-size slide

  53. Message format conversion
    - Convert Protobuf/JSON to Avro using org.apache.avro.protobuf libraries
    - Convert Avro to RowData using Iceberg FlinkAvroReader
    AS-IS
    TO-BE
    - Skip Avro conversion to improve performance and reduce complexity
    - Develop JSON converter by referring to the implementation of flink-formats/flink-json package
    - And for Protobuf, utilize flink-formats/flink-protobuf package (to be released in Flink 1.15)
    Convert Protobuf/JSON to Flink RowData
    - Write records into Iceberg tables with Flink's DataStream
    - Support for Iceberg-specific handling of Timestamp/Map/Enum types

    View full-size slide

  54. Multiple Writers and single Committer
    - IcebergStreamWriter writes files
    - IcebergFilesCommitter commits
    snapshot
    - Support exactly-once delivery
    - FlinkSink is a wrapper of the
    Writer and Committer
    Files
    Committer
    Consumer Converter
    Stream
    Writer
    Parallelism: N Parallelism: N Parallelism: N Parallelism: 1

    View full-size slide

  55. Dynamic DataStream Splitter
    - We don’t use FlinkSink to avoid
    shuffle
    - Tables with incoming data volume
    exceeding 1GB/sec
    - Automatically adjust the number of
    active writers according to the
    incoming rate
    - Splitter is implemented with Flink
    Side Outputs
    Converter
    Files
    Committer
    Consumer Splitter
    Stream
    Writer
    Parallelism: N Parallelism: N Parallelism: N Parallelism: 1
    Converter
    Stream
    Writer
    Parallelism: N
    Converter
    Stream
    Writer
    Parallelism: N
    Converter
    Stream
    Writer
    Parallelism: N

    View full-size slide

  56. Future works

    View full-size slide

  57. Future works
    - Revise centralized data pipelines to generate Apache Iceberg tables
    Wide adoption
    Advanced applications
    - Incremental data consumption for lower end-to-end latency
    - CDC data pipeline and time travel access to manage state table
    Research and development
    - Develop and operate data pipelines with controlled data consumers reliably at scale
    - Make sure query engine integration works (Spark 3, Hive, Trino)
    - Build table migration strategies

    View full-size slide

  58. We are hiring!
    Data Platform Engineer
    https://linecorp.com/ja/career/position/1750
    Site Reliability Engineer
    https://linecorp.com/ja/career/position/1751
    Data Engineer
    https://linecorp.com/ja/career/position/1749

    View full-size slide