Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Presto Meetup] Optimizing Table Layout for Presto using Apache Hudi

June 23, 2022

[Presto Meetup] Optimizing Table Layout for Presto using Apache Hudi

Optimizing Table Layout for Presto
using Apache Hudi

Ethan Guo & Vinoth Chandar - Apache Hudi/Onehouse


June 23, 2022


  1. Speakers Vinoth Chandar ❏ CEO/Founder, onehouse.ai ❏ PMC Chair/Creator Apache

    Hudi ❏ Principal Eng@Uber (Data, Infra, Database, Networking) ❏ Principal Eng@Confluent (ksqlDB, Kafka, Streams) ❏ Staff Eng@Linkedin (Voldemort kv store) ❏ Sr Eng@Oracle (CDC/Goldengate/XStream) Ethan Guo ❏ Database [email protected] ❏ Committer@Apache Hudi ❏ Sr Eng@Uber (Data/Incremental Processing, Networking)
  2. Agenda 1) Intro to Hudi 2) Table Layout Optimizations 3)

    Hudi Clustering 4) Community 5) Questions
  3. Origins@Uber 2016 Context ❏ Uber in hypergrowth ❏ Moving from

    warehouse to lake ❏ HDFS/Cloud storage is immutable Problems ❏ Extremely poor ingest performance ❏ Wasteful reading/writing ❏ Zero concurrency control or ACID
  4. Hudi Table APIs Transactional Data Lake a.k.a Lakehouse ❏ Serverless,

    transactional layer over lakes. ❏ Multi-engine, Decoupled storage from engine/compute ❏ Introduced Copy-On-Write and Merge-on-Read ❏ Change capture on lakes ❏ Ideas now heavily borrowed outside Cloud Storage Files Metadata Txn log Writers Queries https://eng.uber.com/hoodie/ Mar 2017
  5. Hudi Data Lake ❏ Database abstraction for cloud storage/hdfs ❏

    Near real-time ingestion using ACID updates ❏ Incremental, Efficient ETL downstream ❏ Built-in table services
  6. Copy-On-Write Table Snapshot Query Incremental Query Insert: A, B, C,

    D, E Update: A => A’, D => D’ Update: A’ => A”, E => E’, Insert: F commit time=0 commit time=1 commit time=2 A, B C, D E file1_t0.parquet file2_t0.parquet file3_t0.parquet A’, B C, D’ file1_t1.parquet file2_t1.parquet A”, B E’,F file1_t2.parquet file3_t2.parquet A,B,C,D,E A,B,C,D,E A’,B,C,D’,E A”,B,C,D’,E’,F A’,D’ A”,E’,F
  7. Merge-On-Read Table Snapshot Query Incremental Query Insert: A, B, C,

    D, E Update: A => A’, D => D’ Update: A’=>A”, E=>E’,Insert: F commit time=0 commit time=1 commit time=2 A, B C, D E file1_t0.parquet file2_t0.parquet file3_t0.parquet A’ D’ .file1_t1.log .file2_t1.log A” E’ .file1_t2.log .file3_t2.log A,B,C,D,E A,B,C,D,E A’,B,C,D’,E A”,B,C,D’,E’,F A’,D’ A”,E’,F Read Optimized Query A,B,C,D,E Compaction commit time=3 A”, B C, D’ E’,F file1_t3.parquet file2_t3.parquet file3_t3.parquet A”,B,C,D’,E’,F A”,E’,F A”,B,C,D’,E’,F A,B,C,D,E A,B,C,D,E
  8. The Hudi Platform Lake Storage (Cloud Object Stores, HDFS, …)

    Open File/Data Formats (Parquet, HFile, Avro, Orc, …) Concurrency Control (OCC, MVCC, Non-blocking, Lock providers, Orchestration, Scheduling...) Table Services (cleaning, compaction, clustering, indexing, file sizing,...) Indexes (Bloom filter, HBase, Bucket index, Hash based, Lucene..) Table Format (Schema, File listings, Stats, Evolution, …) Lake Cache (Columnar, transactional, mutable, WIP,...) Metaserver (Stats, table service coordination,...) SQL Query Engines (Spark, Flink, Hive, Presto, Trino, Impala, Redshift, BigQuery, Snowflake,..) Platform Services (Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...) Transactional Database Layer Execution/Runtimes
  9. Factors affecting Query Performance ❏ Efficient metadata fetching -> Table

    Formats (file listings, column stats) +Metastores ❏ Quality of plans -> SQL optimizers ❏ Speed of SQL -> Engine specific (vectorized reading, serialization, shuffle algorithms..) Focus for this talk : How to read fewer bytes of Input? Can result in orders of magnitude speed-up when implemented right.
  10. Reading fewer bytes from Input Tables Indexes ❏ Helpful for

    selective queries i.e needles in haystacks ❏ B-trees, bloom-filters, bit-maps.. Caching ❏ Eliminate access to storage in the common case ❏ Read-through, write-through, columnar vs row based Storage Layout ❏ Control how data is physically organized in storage ❏ Bucketing, Clustering
  11. Handling Multi-dimensional data ❏ Simplest clustering algorithm : sort the

    data by a set of fields f1, f2, .. fn. ❏ Most effective for queries with ❏ f1 as predicate ❏ f1, f2 as predicates ❏ f1, f2, f3 as predicates ❏ … ❏ Effectiveness decreases right to left ❏ e.g with f3 as predicate f1 f2 f3
  12. Space Curves ❏ Basic idea : Multi-dimensional ordering/sorting ❏ Map

    multiple dimensions to single dimension ❏ About dozen exist in literature, over few decades Z-Order Curves ❏ Interleaving binary representation of the points ❏ Resulting z-value’s order depends on all fields Hilbert Curves ❏ Better ordering properties for high dimensions ❏ More expensive to build, for higher orders
  13. Hudi Clustering Goals Optimize data layout alongside ingestion ❏ Problem

    1: faster ingestion -> smaller file sizes ❏ Problem 2: data locality for query (e.g., by city) ≠ ingestion order (e.g., trips by time) ❏ Auto sizing, reorg data, no compromise on ingestion
  14. Hudi Clustering Service Self-managed table service ❏ Scheduling: identify target

    data, generate plan in timeline ❏ Running: execute plan with pluggable strategy ❏ Reorg data with linear sorting, Z-order, Hilbert, etc. ❏ “REPLACE” commit in timeline
  15. Hudi ⇔ Presto Integration ❏ Hudi supported via Hive and

    Hudi connector (new) PrestoDB and Apache Hudi: https://prestodb.io/blog/2020/08/04/prestodb-and-hudi ❏ Hive connector ❏ COW and MOR, snapshot and read optimized queries ❏ Recognizing Hudi file layout through native integration ❏ File listing based on Hudi table metadata ❏ New Hudi connector (merged on master) ❏ COW (Snapshot) and MOR (Read optimized) querying ❏ Open up more optimization opportunities ❏ E.g data skipping, multi-modal indexes ❏ Optimized storage layout in Hudi tables readily available
  16. Experiment Setup ❏ Dataset: Github Archive Dataset, 12 months (21/04~22/03),

    478GB (928M records) ❏ Cardinality: “type”/15, “repo_id”/64.9M, “hour_of_day”/24 ❏ Hudi tables ❏ COW, partitioned by year and month, bulk inserts (followed by clustering) ❏ Sorting columns: “repo_id”, “type”, “hour_of_day” (in order) ❏ Different strategy: no clustering, linear sorting, Z-ordering ❏ Presto on EKS, 0.274-SNAPSHOT, with r5.4xlarge instances ❏ coordinator: 15 vCPU, 124 GB memory ❏ 4 workers: each has 15 vCPU, 124 GB memory ❏ query.max-memory=360GB, query.max-memory-per-node=90GB
  17. Configuring Hudi Clustering Spark data source as an example Enable

    inline clustering Control clustering frequency Set sorting columns Hudi clustering docs: https://hudi.apache.org/docs/clustering/ Configure sizes Select layout optimization strategy
  18. Benchmark ❏ Q1: How many activities for a particular repo?

    ❏ select count(*) from github_archive where repo_id = '76474200'; ❏ Q2: How many issue comments? ❏ select count(*) from github_archive where type = 'IssueCommentEvent'; Sorting columns: “repo_id”, “type”, “hour_of_day” Takeaways ❏ Linear sorting suits better for hierarchical data ❏ Z-ordering balances the locality for multi-dimensional data
  19. What is the industry doing today? Uber rides - 250+

    Petabytes from 24h+ to minutes latency https:/ /eng.uber.com/uber-big-data-platform/ Package deliveries - real-time event analytics at PB scale https:/ /aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/ TikTok/Bytedance recommendation system - at Exabyte scale http:/ /hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance Trading transactions - Near real-time CDC from 4000+ postgres https:/ /robinhood.engineering/author-balaji-varadarajan-e3f496815ebf Real-time advertising for 20M+ concurrent viewers https:/ /www.youtube.com/watch?v=mFpqrVxxwKc Store transactions - CDC & Warehousing https:/ /searchdatamanagement.techtarget.com/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar
  20. What is the industry doing today? Lake House Architecture @

    Halodoc: Data Platform 2.0 https:/ /blogs.halodoc.io/lake-house-architecture-halodoc-data-platform-2-0/ Incremental, Multi region data lake platform https:/ /aws.amazon.com/blogs/architecture/insights-for-ctos-part-3-growing-your-business-with-modern-data-capabilities/ Unified, batch + streaming data lake on Hudi https:/ /developpaper.com/apache-hudi-x-pulsar-meetup-hangzhou-station-is-hot-and-the-practice-dry-goods-are-waiting-for-you/ 150 source systems, ETL processing for 10,000+ tables https:/ /aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/ Streaming data lake for device data https:/ /www.youtube.com/watch?v=8Q0kM-emMyo Near real-time grocery delivery tracking https:/ /lambda.blinkit.com/origins-of-data-lake-at-grofers-6c011f94b86c
  21. The Community 2100+ Slack Members 225+ Contributors 1000+ GH Engagers

    20+ Committers Pre-installed on 5 cloud providers Diverse PMC/Committers 1M DLs/month (400% YoY) 800B+ Records/Day (from even just 1 customer!) Rich community of participants
  22. How We Operate? Friendly and diverse community - 20+ PMCs/Committers

    from 10+ organizations Developers - Propose new RFCs, Github label label:rfcs - Dev list discussions, JIRA for issue tracking. - Monthly minor releases, quarterly Major Users - Weekly community on-call, office hours - Issue triage, bug filing process on Github ~2100 Slack 250+ Contributors 3000+ GH Engagers ~20 PRs/week 20+ Committers 10+ PMCs
  23. Roadmap Ongoing 0.12 (July 2022) 0.12+... (Q3 2022) 1.0 &

    Beyond (Q4 2022) Txns/Database layer - 0.11 stabilization, post follow-ups - Reliability, usability low hanging fruits - Non-keyed tables - Harden async indexing Engine Integrations - Presto/Trino connectors landing - Spark SQL performance Platform Services - Hardening continuous mode, Kafka Connect Txns/Database layer - Engine native writing - Indexed columns - Record level index - Eager conflict detection - new CDC format Engine Integrations - Extend ORC to all engines Platform Services - Snowflake external tables - Table management service - Kinesis, Pulsar sources Txns/Database layer - Lock free concurrency control - Federated storage layout - Lucene, bitmaps, … diverse types of indexes Engine Integrations - CDC integration with all engines - Multi -modal index full integration - Schema evol GA - New file group reader Platform Services - Reliable incremental ingest for GCS - Error tables Txns/Database layer - Infinite retention on the timeline - Time-travel updates, deletes - General purpose multi table txns Engine Integrations - DML support from Hive, Presto, Trino etc. - Early integrations with Dask, Ray, Rust, Python,C++ Platform Services - Materialized views with Flink - Caching (experimental) - Metaserver GA - Airbyte integration
  24. Metaserver (Coming in 2022) Interesting fact : Hudi has a

    metaserver already - Runs on Spark driver; Serves FileSystem RPCs + queries on timeline - Backed by rocksDB/pluggable - updated incrementally on every timeline action - Very useful in streaming jobs Data lakes need a new metaserver - Flat file metastores are cool? (really?) - Speed up planning by orders of magnitude
  25. Lake Cache (Coming in 2022) LRU Cache ala DB Buffer

    Pool Frequent Commits => Small objects/blocks - Today : Aggressively table services - Tomorrow : File Group/Hudi file model aware caching - Mutable data => FileSystem/Block level caches are not that effective. Benefits - Great performance for CDC tables - Avoid open/close costs for small objects
  26. Engage With Our Community User Docs : https://hudi.apache.org Technical Wiki

    : https://cwiki.apache.org/confluence/display/HUDI Github : https://github.com/apache/hudi/ Twitter : https://twitter.com/apachehudi Mailing list(s) : [email protected] (send an empty email to subscribe) [email protected] (actual mailing list) Slack : https://join.slack.com/t/apache-hudi/signup Community Syncs : https://hudi.apache.org/community/syncs