[Presto Meetup] Optimizing Table Layout for Presto using Apache Hudi

Optimizing Table Layout for Presto using Apache Hudi Ethan Guo
& Vinoth Chandar Apache Hudi/Onehouse

Speakers Vinoth Chandar ❏ CEO/Founder, onehouse.ai ❏ PMC Chair/Creator Apache
Hudi ❏ Principal Eng@Uber (Data, Infra, Database, Networking) ❏ Principal Eng@Conﬂuent (ksqlDB, Kafka, Streams) ❏ Staﬀ Eng@Linkedin (Voldemort kv store) ❏ Sr Eng@Oracle (CDC/Goldengate/XStream) Ethan Guo ❏ Database [email protected] ❏ Committer@Apache Hudi ❏ Sr Eng@Uber (Data/Incremental Processing, Networking)

Agenda 1) Intro to Hudi 2) Table Layout Optimizations 3)
Hudi Clustering 4) Community 5) Questions

Hudi Intro Motivation, Concepts, Community

Origins@Uber 2016 Context ❏ Uber in hypergrowth ❏ Moving from
warehouse to lake ❏ HDFS/Cloud storage is immutable Problems ❏ Extremely poor ingest performance ❏ Wasteful reading/writing ❏ Zero concurrency control or ACID

Missing pieces: Upserts, Deletes & Incrementals

Hudi Table APIs Transactional Data Lake a.k.a Lakehouse ❏ Serverless,
transactional layer over lakes. ❏ Multi-engine, Decoupled storage from engine/compute ❏ Introduced Copy-On-Write and Merge-on-Read ❏ Change capture on lakes ❏ Ideas now heavily borrowed outside Cloud Storage Files Metadata Txn log Writers Queries https://eng.uber.com/hoodie/ Mar 2017

Hudi Data Lake ❏ Database abstraction for cloud storage/hdfs ❏
Near real-time ingestion using ACID updates ❏ Incremental, Eﬃcient ETL downstream ❏ Built-in table services

Copy-On-Write Table Snapshot Query Incremental Query Insert: A, B, C,
D, E Update: A => A’, D => D’ Update: A’ => A”, E => E’, Insert: F commit time=0 commit time=1 commit time=2 A, B C, D E file1_t0.parquet file2_t0.parquet file3_t0.parquet A’, B C, D’ file1_t1.parquet file2_t1.parquet A”, B E’,F file1_t2.parquet file3_t2.parquet A,B,C,D,E A,B,C,D,E A’,B,C,D’,E A”,B,C,D’,E’,F A’,D’ A”,E’,F

Merge-On-Read Table Snapshot Query Incremental Query Insert: A, B, C,
D, E Update: A => A’, D => D’ Update: A’=>A”, E=>E’,Insert: F commit time=0 commit time=1 commit time=2 A, B C, D E file1_t0.parquet file2_t0.parquet file3_t0.parquet A’ D’ .file1_t1.log .file2_t1.log A” E’ .file1_t2.log .file3_t2.log A,B,C,D,E A,B,C,D,E A’,B,C,D’,E A”,B,C,D’,E’,F A’,D’ A”,E’,F Read Optimized Query A,B,C,D,E Compaction commit time=3 A”, B C, D’ E’,F file1_t3.parquet file2_t3.parquet file3_t3.parquet A”,B,C,D’,E’,F A”,E’,F A”,B,C,D’,E’,F A,B,C,D,E A,B,C,D,E

Table Structure

File Group Structure

The Hudi Platform Lake Storage (Cloud Object Stores, HDFS, …)
Open File/Data Formats (Parquet, HFile, Avro, Orc, …) Concurrency Control (OCC, MVCC, Non-blocking, Lock providers, Orchestration, Scheduling...) Table Services (cleaning, compaction, clustering, indexing, file sizing,...) Indexes (Bloom filter, HBase, Bucket index, Hash based, Lucene..) Table Format (Schema, File listings, Stats, Evolution, …) Lake Cache (Columnar, transactional, mutable, WIP,...) Metaserver (Stats, table service coordination,...) SQL Query Engines (Spark, Flink, Hive, Presto, Trino, Impala, Redshift, BigQuery, Snowflake,..) Platform Services (Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...) Transactional Database Layer Execution/Runtimes

Hudi Ecosystem

Table Layout Optimizations Primer, Clustering, Space-ﬁlling curves

Factors affecting Query Performance ❏ Efficient metadata fetching -> Table
Formats (file listings, column stats) +Metastores ❏ Quality of plans -> SQL optimizers ❏ Speed of SQL -> Engine specific (vectorized reading, serialization, shuffle algorithms..) Focus for this talk : How to read fewer bytes of Input? Can result in orders of magnitude speed-up when implemented right.

Reading fewer bytes from Input Tables Indexes ❏ Helpful for
selective queries i.e needles in haystacks ❏ B-trees, bloom-ﬁlters, bit-maps.. Caching ❏ Eliminate access to storage in the common case ❏ Read-through, write-through, columnar vs row based Storage Layout ❏ Control how data is physically organized in storage ❏ Bucketing, Clustering

Clustering to micro partition data

Clustering - Basic Idea

Handling Multi-dimensional data ❏ Simplest clustering algorithm : sort the
data by a set of fields f1, f2, .. fn. ❏ Most effective for queries with ❏ f1 as predicate ❏ f1, f2 as predicates ❏ f1, f2, f3 as predicates ❏ … ❏ Effectiveness decreases right to left ❏ e.g with f3 as predicate f1 f2 f3

Space Curves ❏ Basic idea : Multi-dimensional ordering/sorting ❏ Map
multiple dimensions to single dimension ❏ About dozen exist in literature, over few decades Z-Order Curves ❏ Interleaving binary representation of the points ❏ Resulting z-value’s order depends on all ﬁelds Hilbert Curves ❏ Better ordering properties for high dimensions ❏ More expensive to build, for higher orders

Hudi Clustering Clustering conﬁgs, Presto benchmarks

Hudi Clustering Goals Optimize data layout alongside ingestion ❏ Problem
1: faster ingestion -> smaller ﬁle sizes ❏ Problem 2: data locality for query (e.g., by city) ≠ ingestion order (e.g., trips by time) ❏ Auto sizing, reorg data, no compromise on ingestion

Hudi Clustering Service Self-managed table service ❏ Scheduling: identify target
data, generate plan in timeline ❏ Running: execute plan with pluggable strategy ❏ Reorg data with linear sorting, Z-order, Hilbert, etc. ❏ “REPLACE” commit in timeline

Hudi Deployment Models ❏ Schedule and run clustering inline or
async, in diﬀerent models

Hudi ⇔ Presto Integration ❏ Hudi supported via Hive and
Hudi connector (new) PrestoDB and Apache Hudi: https://prestodb.io/blog/2020/08/04/prestodb-and-hudi ❏ Hive connector ❏ COW and MOR, snapshot and read optimized queries ❏ Recognizing Hudi ﬁle layout through native integration ❏ File listing based on Hudi table metadata ❏ New Hudi connector (merged on master) ❏ COW (Snapshot) and MOR (Read optimized) querying ❏ Open up more optimization opportunities ❏ E.g data skipping, multi-modal indexes ❏ Optimized storage layout in Hudi tables readily available

Experiment Setup ❏ Dataset: Github Archive Dataset, 12 months (21/04~22/03),
478GB (928M records) ❏ Cardinality: “type”/15, “repo_id”/64.9M, “hour_of_day”/24 ❏ Hudi tables ❏ COW, partitioned by year and month, bulk inserts (followed by clustering) ❏ Sorting columns: “repo_id”, “type”, “hour_of_day” (in order) ❏ Diﬀerent strategy: no clustering, linear sorting, Z-ordering ❏ Presto on EKS, 0.274-SNAPSHOT, with r5.4xlarge instances ❏ coordinator: 15 vCPU, 124 GB memory ❏ 4 workers: each has 15 vCPU, 124 GB memory ❏ query.max-memory=360GB, query.max-memory-per-node=90GB

Conﬁguring Hudi Clustering Spark data source as an example Enable
inline clustering Control clustering frequency Set sorting columns Hudi clustering docs: https://hudi.apache.org/docs/clustering/ Conﬁgure sizes Select layout optimization strategy

Benchmark ❏ Q1: How many activities for a particular repo?
❏ select count(*) from github_archive where repo_id = '76474200'; ❏ Q2: How many issue comments? ❏ select count(*) from github_archive where type = 'IssueCommentEvent'; Sorting columns: “repo_id”, “type”, “hour_of_day” Takeaways ❏ Linear sorting suits better for hierarchical data ❏ Z-ordering balances the locality for multi-dimensional data

Community Adoption, Roadmap, Engaging with us

What is the industry doing today? Uber rides - 250+
Petabytes from 24h+ to minutes latency https:/ /eng.uber.com/uber-big-data-platform/ Package deliveries - real-time event analytics at PB scale https:/ /aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/ TikTok/Bytedance recommendation system - at Exabyte scale http:/ /hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance Trading transactions - Near real-time CDC from 4000+ postgres https:/ /robinhood.engineering/author-balaji-varadarajan-e3f496815ebf Real-time advertising for 20M+ concurrent viewers https:/ /www.youtube.com/watch?v=mFpqrVxxwKc Store transactions - CDC & Warehousing https:/ /searchdatamanagement.techtarget.com/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar

What is the industry doing today? Lake House Architecture @
Halodoc: Data Platform 2.0 https:/ /blogs.halodoc.io/lake-house-architecture-halodoc-data-platform-2-0/ Incremental, Multi region data lake platform https:/ /aws.amazon.com/blogs/architecture/insights-for-ctos-part-3-growing-your-business-with-modern-data-capabilities/ Uniﬁed, batch + streaming data lake on Hudi https:/ /developpaper.com/apache-hudi-x-pulsar-meetup-hangzhou-station-is-hot-and-the-practice-dry-goods-are-waiting-for-you/ 150 source systems, ETL processing for 10,000+ tables https:/ /aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/ Streaming data lake for device data https:/ /www.youtube.com/watch?v=8Q0kM-emMyo Near real-time grocery delivery tracking https:/ /lambda.blinkit.com/origins-of-data-lake-at-grofers-6c011f94b86c

The Community 2100+ Slack Members 225+ Contributors 1000+ GH Engagers
20+ Committers Pre-installed on 5 cloud providers Diverse PMC/Committers 1M DLs/month (400% YoY) 800B+ Records/Day (from even just 1 customer!) Rich community of participants

How We Operate? Friendly and diverse community - 20+ PMCs/Committers
from 10+ organizations Developers - Propose new RFCs, Github label label:rfcs - Dev list discussions, JIRA for issue tracking. - Monthly minor releases, quarterly Major Users - Weekly community on-call, oﬃce hours - Issue triage, bug ﬁling process on Github ~2100 Slack 250+ Contributors 3000+ GH Engagers ~20 PRs/week 20+ Committers 10+ PMCs

Roadmap Ongoing 0.12 (July 2022) 0.12+... (Q3 2022) 1.0 &
Beyond (Q4 2022) Txns/Database layer - 0.11 stabilization, post follow-ups - Reliability, usability low hanging fruits - Non-keyed tables - Harden async indexing Engine Integrations - Presto/Trino connectors landing - Spark SQL performance Platform Services - Hardening continuous mode, Kafka Connect Txns/Database layer - Engine native writing - Indexed columns - Record level index - Eager conflict detection - new CDC format Engine Integrations - Extend ORC to all engines Platform Services - Snowflake external tables - Table management service - Kinesis, Pulsar sources Txns/Database layer - Lock free concurrency control - Federated storage layout - Lucene, bitmaps, … diverse types of indexes Engine Integrations - CDC integration with all engines - Multi -modal index full integration - Schema evol GA - New file group reader Platform Services - Reliable incremental ingest for GCS - Error tables Txns/Database layer - Infinite retention on the timeline - Time-travel updates, deletes - General purpose multi table txns Engine Integrations - DML support from Hive, Presto, Trino etc. - Early integrations with Dask, Ray, Rust, Python,C++ Platform Services - Materialized views with Flink - Caching (experimental) - Metaserver GA - Airbyte integration

Metaserver (Coming in 2022) Interesting fact : Hudi has a
metaserver already - Runs on Spark driver; Serves FileSystem RPCs + queries on timeline - Backed by rocksDB/pluggable - updated incrementally on every timeline action - Very useful in streaming jobs Data lakes need a new metaserver - Flat ﬁle metastores are cool? (really?) - Speed up planning by orders of magnitude

Lake Cache (Coming in 2022) LRU Cache ala DB Buffer
Pool Frequent Commits => Small objects/blocks - Today : Aggressively table services - Tomorrow : File Group/Hudi file model aware caching - Mutable data => FileSystem/Block level caches are not that effective. Benefits - Great performance for CDC tables - Avoid open/close costs for small objects

Engage With Our Community User Docs : https://hudi.apache.org Technical Wiki
: https://cwiki.apache.org/conﬂuence/display/HUDI Github : https://github.com/apache/hudi/ Twitter : https://twitter.com/apachehudi Mailing list(s) : [email protected] (send an empty email to subscribe) [email protected] (actual mailing list) Slack : https://join.slack.com/t/apache-hudi/signup Community Syncs : https://hudi.apache.org/community/syncs

Thanks! Questions?

[Presto Meetup] Optimizing Table Layout for Pre...

[Presto Meetup] Optimizing Table Layout for Presto using Apache Hudi

More Decks by Ahana

Featured

Transcript