Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Presto Meetup] Optimizing Table Layout for Presto using Apache Hudi

Ahana
June 23, 2022
92

[Presto Meetup] Optimizing Table Layout for Presto using Apache Hudi

Optimizing Table Layout for Presto
using Apache Hudi

Ethan Guo & Vinoth Chandar - Apache Hudi/Onehouse

Ahana

June 23, 2022
Tweet

Transcript

  1. Optimizing Table Layout for Presto
    using Apache Hudi
    Ethan Guo & Vinoth Chandar
    Apache Hudi/Onehouse

    View Slide

  2. Speakers
    Vinoth Chandar
    ❏ CEO/Founder, onehouse.ai
    ❏ PMC Chair/Creator Apache Hudi
    ❏ Principal Eng@Uber
    (Data, Infra, Database, Networking)
    ❏ Principal Eng@Confluent
    (ksqlDB, Kafka, Streams)
    ❏ Staff Eng@Linkedin (Voldemort kv store)
    ❏ Sr Eng@Oracle (CDC/Goldengate/XStream)
    Ethan Guo
    ❏ Database [email protected]
    ❏ Committer@Apache Hudi
    ❏ Sr Eng@Uber
    (Data/Incremental Processing,
    Networking)

    View Slide

  3. Agenda
    1) Intro to Hudi
    2) Table Layout Optimizations
    3) Hudi Clustering
    4) Community
    5) Questions

    View Slide

  4. Hudi Intro
    Motivation, Concepts, Community

    View Slide

  5. Origins@Uber 2016
    Context
    ❏ Uber in hypergrowth
    ❏ Moving from warehouse to lake
    ❏ HDFS/Cloud storage is immutable
    Problems
    ❏ Extremely poor ingest performance
    ❏ Wasteful reading/writing
    ❏ Zero concurrency control or ACID

    View Slide

  6. Missing pieces: Upserts, Deletes & Incrementals

    View Slide

  7. Hudi Table APIs
    Transactional Data Lake a.k.a Lakehouse
    ❏ Serverless, transactional layer over
    lakes.
    ❏ Multi-engine, Decoupled storage
    from engine/compute
    ❏ Introduced Copy-On-Write and
    Merge-on-Read
    ❏ Change capture on lakes
    ❏ Ideas now heavily borrowed outside Cloud Storage
    Files Metadata Txn log
    Writers Queries
    https://eng.uber.com/hoodie/ Mar 2017

    View Slide

  8. Hudi Data Lake
    ❏ Database abstraction for cloud storage/hdfs
    ❏ Near real-time ingestion using ACID updates
    ❏ Incremental, Efficient ETL downstream
    ❏ Built-in table services

    View Slide

  9. Copy-On-Write Table
    Snapshot Query
    Incremental Query
    Insert: A, B, C, D, E Update: A => A’, D => D’ Update: A’ => A”, E => E’, Insert: F
    commit time=0 commit time=1 commit time=2
    A, B
    C, D
    E
    file1_t0.parquet
    file2_t0.parquet
    file3_t0.parquet
    A’, B
    C, D’
    file1_t1.parquet
    file2_t1.parquet
    A”, B
    E’,F
    file1_t2.parquet
    file3_t2.parquet
    A,B,C,D,E
    A,B,C,D,E
    A’,B,C,D’,E A”,B,C,D’,E’,F
    A’,D’ A”,E’,F

    View Slide

  10. Merge-On-Read Table
    Snapshot Query
    Incremental Query
    Insert: A, B, C, D, E Update: A => A’,
    D => D’
    Update: A’=>A”,
    E=>E’,Insert: F
    commit time=0 commit time=1 commit time=2
    A, B
    C, D
    E
    file1_t0.parquet
    file2_t0.parquet
    file3_t0.parquet
    A’
    D’
    .file1_t1.log
    .file2_t1.log
    A”
    E’
    .file1_t2.log
    .file3_t2.log
    A,B,C,D,E
    A,B,C,D,E
    A’,B,C,D’,E A”,B,C,D’,E’,F
    A’,D’ A”,E’,F
    Read Optimized Query A,B,C,D,E
    Compaction
    commit time=3
    A”, B
    C, D’
    E’,F
    file1_t3.parquet
    file2_t3.parquet
    file3_t3.parquet
    A”,B,C,D’,E’,F
    A”,E’,F
    A”,B,C,D’,E’,F
    A,B,C,D,E A,B,C,D,E

    View Slide

  11. Table Structure

    View Slide

  12. File Group Structure

    View Slide

  13. The Hudi Platform
    Lake Storage
    (Cloud Object Stores, HDFS, …)
    Open File/Data Formats
    (Parquet, HFile, Avro, Orc, …)
    Concurrency Control
    (OCC, MVCC, Non-blocking, Lock
    providers, Orchestration, Scheduling...)
    Table Services
    (cleaning, compaction, clustering,
    indexing, file sizing,...)
    Indexes
    (Bloom filter, HBase, Bucket
    index, Hash based, Lucene..)
    Table Format
    (Schema, File listings, Stats,
    Evolution, …)
    Lake Cache
    (Columnar, transactional,
    mutable, WIP,...)
    Metaserver
    (Stats, table service coordination,...)
    SQL Query Engines
    (Spark, Flink, Hive, Presto, Trino, Impala,
    Redshift, BigQuery, Snowflake,..)
    Platform Services
    (Streaming/Batch ingest,
    various sources, Catalog sync,
    Admin CLI, Data Quality,...)
    Transactional
    Database
    Layer
    Execution/Runtimes

    View Slide

  14. Hudi Ecosystem

    View Slide

  15. Table Layout
    Optimizations
    Primer, Clustering, Space-filling curves

    View Slide

  16. Factors affecting Query Performance
    ❏ Efficient metadata fetching -> Table Formats (file
    listings, column stats) +Metastores
    ❏ Quality of plans -> SQL optimizers
    ❏ Speed of SQL -> Engine specific (vectorized reading,
    serialization, shuffle algorithms..)
    Focus for this talk : How to read fewer bytes of Input?
    Can result in orders of magnitude speed-up when
    implemented right.

    View Slide

  17. Reading fewer bytes from Input Tables
    Indexes
    ❏ Helpful for selective queries i.e needles in haystacks
    ❏ B-trees, bloom-filters, bit-maps..
    Caching
    ❏ Eliminate access to storage in the common case
    ❏ Read-through, write-through, columnar vs row based
    Storage Layout
    ❏ Control how data is physically organized in storage
    ❏ Bucketing, Clustering

    View Slide

  18. Clustering to micro partition data

    View Slide

  19. Clustering - Basic Idea

    View Slide

  20. Handling Multi-dimensional data
    ❏ Simplest clustering algorithm : sort the data by
    a set of fields f1, f2, .. fn.
    ❏ Most effective for queries with
    ❏ f1 as predicate
    ❏ f1, f2 as predicates
    ❏ f1, f2, f3 as predicates
    ❏ …
    ❏ Effectiveness decreases right to left
    ❏ e.g with f3 as predicate
    f1 f2 f3

    View Slide

  21. Space Curves
    ❏ Basic idea : Multi-dimensional ordering/sorting
    ❏ Map multiple dimensions to single dimension
    ❏ About dozen exist in literature, over few decades
    Z-Order Curves
    ❏ Interleaving binary representation of the points
    ❏ Resulting z-value’s order depends on all fields
    Hilbert Curves
    ❏ Better ordering properties for high dimensions
    ❏ More expensive to build, for higher orders

    View Slide

  22. Hudi Clustering
    Clustering configs, Presto benchmarks

    View Slide

  23. Hudi Clustering Goals
    Optimize data layout alongside ingestion
    ❏ Problem 1: faster ingestion -> smaller
    file sizes
    ❏ Problem 2: data locality for query
    (e.g., by city)
    ≠ ingestion order (e.g., trips by time)
    ❏ Auto sizing, reorg data, no
    compromise on ingestion

    View Slide

  24. Hudi Clustering Service
    Self-managed table service
    ❏ Scheduling: identify target data,
    generate plan in timeline
    ❏ Running: execute plan with pluggable
    strategy
    ❏ Reorg data with linear sorting,
    Z-order, Hilbert, etc.
    ❏ “REPLACE” commit in timeline

    View Slide

  25. Hudi Deployment Models
    ❏ Schedule and run clustering inline or async, in different models

    View Slide

  26. Hudi ⇔ Presto Integration
    ❏ Hudi supported via Hive and Hudi connector (new)
    PrestoDB and Apache Hudi: https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
    ❏ Hive connector
    ❏ COW and MOR, snapshot and read optimized queries
    ❏ Recognizing Hudi file layout through native integration
    ❏ File listing based on Hudi table metadata
    ❏ New Hudi connector (merged on master)
    ❏ COW (Snapshot) and MOR (Read optimized) querying
    ❏ Open up more optimization opportunities
    ❏ E.g data skipping, multi-modal indexes
    ❏ Optimized storage layout in Hudi tables readily available

    View Slide

  27. Experiment Setup
    ❏ Dataset: Github Archive Dataset, 12 months (21/04~22/03), 478GB (928M records)
    ❏ Cardinality: “type”/15, “repo_id”/64.9M, “hour_of_day”/24
    ❏ Hudi tables
    ❏ COW, partitioned by year and month, bulk inserts (followed by clustering)
    ❏ Sorting columns: “repo_id”, “type”, “hour_of_day” (in order)
    ❏ Different strategy: no clustering, linear sorting, Z-ordering
    ❏ Presto on EKS, 0.274-SNAPSHOT, with r5.4xlarge instances
    ❏ coordinator: 15 vCPU, 124 GB memory
    ❏ 4 workers: each has 15 vCPU, 124 GB memory
    ❏ query.max-memory=360GB, query.max-memory-per-node=90GB

    View Slide

  28. Configuring Hudi Clustering
    Spark data source as an example
    Enable inline clustering
    Control clustering frequency
    Set sorting columns
    Hudi clustering docs: https://hudi.apache.org/docs/clustering/
    Configure sizes
    Select layout optimization strategy

    View Slide

  29. Benchmark
    ❏ Q1: How many activities for a particular
    repo?
    ❏ select count(*) from github_archive
    where repo_id = '76474200';
    ❏ Q2: How many issue comments?
    ❏ select count(*) from github_archive
    where type = 'IssueCommentEvent';
    Sorting columns: “repo_id”, “type”, “hour_of_day”
    Takeaways
    ❏ Linear sorting suits better for
    hierarchical data
    ❏ Z-ordering balances the locality for
    multi-dimensional data

    View Slide

  30. Community
    Adoption, Roadmap, Engaging with us

    View Slide

  31. What is the industry doing today?
    Uber rides - 250+ Petabytes from 24h+ to minutes latency
    https:/
    /eng.uber.com/uber-big-data-platform/
    Package deliveries - real-time event analytics at PB scale
    https:/
    /aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/
    TikTok/Bytedance recommendation system - at Exabyte scale
    http:/
    /hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance
    Trading transactions - Near real-time CDC from 4000+ postgres
    https:/
    /robinhood.engineering/author-balaji-varadarajan-e3f496815ebf
    Real-time advertising for 20M+ concurrent viewers
    https:/
    /www.youtube.com/watch?v=mFpqrVxxwKc
    Store transactions - CDC & Warehousing
    https:/
    /searchdatamanagement.techtarget.com/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar

    View Slide

  32. What is the industry doing today?
    Lake House Architecture @ Halodoc: Data Platform 2.0
    https:/
    /blogs.halodoc.io/lake-house-architecture-halodoc-data-platform-2-0/
    Incremental, Multi region data lake platform
    https:/
    /aws.amazon.com/blogs/architecture/insights-for-ctos-part-3-growing-your-business-with-modern-data-capabilities/
    Unified, batch + streaming data lake on Hudi
    https:/
    /developpaper.com/apache-hudi-x-pulsar-meetup-hangzhou-station-is-hot-and-the-practice-dry-goods-are-waiting-for-you/
    150 source systems, ETL processing for 10,000+ tables
    https:/
    /aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/
    Streaming data lake for device data
    https:/
    /www.youtube.com/watch?v=8Q0kM-emMyo
    Near real-time grocery delivery tracking
    https:/
    /lambda.blinkit.com/origins-of-data-lake-at-grofers-6c011f94b86c

    View Slide

  33. The Community
    2100+
    Slack Members
    225+
    Contributors
    1000+
    GH Engagers
    20+
    Committers
    Pre-installed on 5 cloud providers
    Diverse PMC/Committers
    1M DLs/month
    (400% YoY)
    800B+
    Records/Day
    (from even just 1 customer!)
    Rich community of participants

    View Slide

  34. How We Operate?
    Friendly and diverse community
    - 20+ PMCs/Committers from 10+
    organizations
    Developers
    - Propose new RFCs, Github label label:rfcs
    - Dev list discussions, JIRA for issue tracking.
    - Monthly minor releases, quarterly Major
    Users
    - Weekly community on-call, office hours
    - Issue triage, bug filing process on Github
    ~2100
    Slack
    250+
    Contributors
    3000+
    GH Engagers
    ~20
    PRs/week
    20+
    Committers
    10+
    PMCs

    View Slide

  35. Roadmap
    Ongoing 0.12 (July 2022) 0.12+... (Q3 2022) 1.0 & Beyond (Q4 2022)
    Txns/Database layer
    - 0.11 stabilization, post
    follow-ups
    - Reliability, usability low
    hanging fruits
    - Non-keyed tables
    - Harden async indexing
    Engine Integrations
    - Presto/Trino connectors
    landing
    - Spark SQL performance
    Platform Services
    - Hardening continuous
    mode, Kafka Connect
    Txns/Database layer
    - Engine native writing
    - Indexed columns
    - Record level index
    - Eager conflict detection
    - new CDC format
    Engine Integrations
    - Extend ORC to all
    engines
    Platform Services
    - Snowflake external
    tables
    - Table management
    service
    - Kinesis, Pulsar sources
    Txns/Database layer
    - Lock free concurrency control
    - Federated storage layout
    - Lucene, bitmaps, … diverse types
    of indexes
    Engine Integrations
    - CDC integration with all engines
    - Multi -modal index full
    integration
    - Schema evol GA
    - New file group reader
    Platform Services
    - Reliable incremental ingest for
    GCS
    - Error tables
    Txns/Database layer
    - Infinite retention on the timeline
    - Time-travel updates, deletes
    - General purpose multi table txns
    Engine Integrations
    - DML support from Hive, Presto, Trino etc.
    - Early integrations with Dask, Ray, Rust, Python,C++
    Platform Services
    - Materialized views with Flink
    - Caching (experimental)
    - Metaserver GA
    - Airbyte integration

    View Slide

  36. Metaserver (Coming in 2022)
    Interesting fact : Hudi has a metaserver
    already
    - Runs on Spark driver; Serves
    FileSystem RPCs + queries on
    timeline
    - Backed by rocksDB/pluggable
    - updated incrementally on every
    timeline action
    - Very useful in streaming jobs
    Data lakes need a new metaserver
    - Flat file metastores are cool? (really?)
    - Speed up planning by orders of
    magnitude

    View Slide

  37. Lake Cache (Coming in 2022)
    LRU Cache ala DB Buffer Pool
    Frequent Commits => Small objects/blocks
    - Today : Aggressively table services
    - Tomorrow : File Group/Hudi file model
    aware caching
    - Mutable data => FileSystem/Block level
    caches are not that effective.
    Benefits
    - Great performance for CDC tables
    - Avoid open/close costs for small objects

    View Slide

  38. Engage With Our Community
    User Docs : https://hudi.apache.org
    Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI
    Github : https://github.com/apache/hudi/
    Twitter : https://twitter.com/apachehudi
    Mailing list(s) : [email protected] (send an empty email to subscribe)
    [email protected] (actual mailing list)
    Slack : https://join.slack.com/t/apache-hudi/signup
    Community Syncs : https://hudi.apache.org/community/syncs

    View Slide

  39. Thanks!
    Questions?

    View Slide