Slide 1

Slide 1 text

Optimizing Table Layout for Presto using Apache Hudi Ethan Guo & Vinoth Chandar Apache Hudi/Onehouse

Slide 2

Slide 2 text

Speakers Vinoth Chandar ❏ CEO/Founder, onehouse.ai ❏ PMC Chair/Creator Apache Hudi ❏ Principal Eng@Uber (Data, Infra, Database, Networking) ❏ Principal Eng@Confluent (ksqlDB, Kafka, Streams) ❏ Staff Eng@Linkedin (Voldemort kv store) ❏ Sr Eng@Oracle (CDC/Goldengate/XStream) Ethan Guo ❏ Database [email protected] ❏ Committer@Apache Hudi ❏ Sr Eng@Uber (Data/Incremental Processing, Networking)

Slide 3

Slide 3 text

Agenda 1) Intro to Hudi 2) Table Layout Optimizations 3) Hudi Clustering 4) Community 5) Questions

Slide 4

Slide 4 text

Hudi Intro Motivation, Concepts, Community

Slide 5

Slide 5 text

Origins@Uber 2016 Context ❏ Uber in hypergrowth ❏ Moving from warehouse to lake ❏ HDFS/Cloud storage is immutable Problems ❏ Extremely poor ingest performance ❏ Wasteful reading/writing ❏ Zero concurrency control or ACID

Slide 6

Slide 6 text

Missing pieces: Upserts, Deletes & Incrementals

Slide 7

Slide 7 text

Hudi Table APIs Transactional Data Lake a.k.a Lakehouse ❏ Serverless, transactional layer over lakes. ❏ Multi-engine, Decoupled storage from engine/compute ❏ Introduced Copy-On-Write and Merge-on-Read ❏ Change capture on lakes ❏ Ideas now heavily borrowed outside Cloud Storage Files Metadata Txn log Writers Queries https://eng.uber.com/hoodie/ Mar 2017

Slide 8

Slide 8 text

Hudi Data Lake ❏ Database abstraction for cloud storage/hdfs ❏ Near real-time ingestion using ACID updates ❏ Incremental, Efficient ETL downstream ❏ Built-in table services

Slide 9

Slide 9 text

Copy-On-Write Table Snapshot Query Incremental Query Insert: A, B, C, D, E Update: A => A’, D => D’ Update: A’ => A”, E => E’, Insert: F commit time=0 commit time=1 commit time=2 A, B C, D E file1_t0.parquet file2_t0.parquet file3_t0.parquet A’, B C, D’ file1_t1.parquet file2_t1.parquet A”, B E’,F file1_t2.parquet file3_t2.parquet A,B,C,D,E A,B,C,D,E A’,B,C,D’,E A”,B,C,D’,E’,F A’,D’ A”,E’,F

Slide 10

Slide 10 text

Merge-On-Read Table Snapshot Query Incremental Query Insert: A, B, C, D, E Update: A => A’, D => D’ Update: A’=>A”, E=>E’,Insert: F commit time=0 commit time=1 commit time=2 A, B C, D E file1_t0.parquet file2_t0.parquet file3_t0.parquet A’ D’ .file1_t1.log .file2_t1.log A” E’ .file1_t2.log .file3_t2.log A,B,C,D,E A,B,C,D,E A’,B,C,D’,E A”,B,C,D’,E’,F A’,D’ A”,E’,F Read Optimized Query A,B,C,D,E Compaction commit time=3 A”, B C, D’ E’,F file1_t3.parquet file2_t3.parquet file3_t3.parquet A”,B,C,D’,E’,F A”,E’,F A”,B,C,D’,E’,F A,B,C,D,E A,B,C,D,E

Slide 11

Slide 11 text

Table Structure

Slide 12

Slide 12 text

File Group Structure

Slide 13

Slide 13 text

The Hudi Platform Lake Storage (Cloud Object Stores, HDFS, …) Open File/Data Formats (Parquet, HFile, Avro, Orc, …) Concurrency Control (OCC, MVCC, Non-blocking, Lock providers, Orchestration, Scheduling...) Table Services (cleaning, compaction, clustering, indexing, file sizing,...) Indexes (Bloom filter, HBase, Bucket index, Hash based, Lucene..) Table Format (Schema, File listings, Stats, Evolution, …) Lake Cache (Columnar, transactional, mutable, WIP,...) Metaserver (Stats, table service coordination,...) SQL Query Engines (Spark, Flink, Hive, Presto, Trino, Impala, Redshift, BigQuery, Snowflake,..) Platform Services (Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...) Transactional Database Layer Execution/Runtimes

Slide 14

Slide 14 text

Hudi Ecosystem

Slide 15

Slide 15 text

Table Layout Optimizations Primer, Clustering, Space-filling curves

Slide 16

Slide 16 text

Factors affecting Query Performance ❏ Efficient metadata fetching -> Table Formats (file listings, column stats) +Metastores ❏ Quality of plans -> SQL optimizers ❏ Speed of SQL -> Engine specific (vectorized reading, serialization, shuffle algorithms..) Focus for this talk : How to read fewer bytes of Input? Can result in orders of magnitude speed-up when implemented right.

Slide 17

Slide 17 text

Reading fewer bytes from Input Tables Indexes ❏ Helpful for selective queries i.e needles in haystacks ❏ B-trees, bloom-filters, bit-maps.. Caching ❏ Eliminate access to storage in the common case ❏ Read-through, write-through, columnar vs row based Storage Layout ❏ Control how data is physically organized in storage ❏ Bucketing, Clustering

Slide 18

Slide 18 text

Clustering to micro partition data

Slide 19

Slide 19 text

Clustering - Basic Idea

Slide 20

Slide 20 text

Handling Multi-dimensional data ❏ Simplest clustering algorithm : sort the data by a set of fields f1, f2, .. fn. ❏ Most effective for queries with ❏ f1 as predicate ❏ f1, f2 as predicates ❏ f1, f2, f3 as predicates ❏ … ❏ Effectiveness decreases right to left ❏ e.g with f3 as predicate f1 f2 f3

Slide 21

Slide 21 text

Space Curves ❏ Basic idea : Multi-dimensional ordering/sorting ❏ Map multiple dimensions to single dimension ❏ About dozen exist in literature, over few decades Z-Order Curves ❏ Interleaving binary representation of the points ❏ Resulting z-value’s order depends on all fields Hilbert Curves ❏ Better ordering properties for high dimensions ❏ More expensive to build, for higher orders

Slide 22

Slide 22 text

Hudi Clustering Clustering configs, Presto benchmarks

Slide 23

Slide 23 text

Hudi Clustering Goals Optimize data layout alongside ingestion ❏ Problem 1: faster ingestion -> smaller file sizes ❏ Problem 2: data locality for query (e.g., by city) ≠ ingestion order (e.g., trips by time) ❏ Auto sizing, reorg data, no compromise on ingestion

Slide 24

Slide 24 text

Hudi Clustering Service Self-managed table service ❏ Scheduling: identify target data, generate plan in timeline ❏ Running: execute plan with pluggable strategy ❏ Reorg data with linear sorting, Z-order, Hilbert, etc. ❏ “REPLACE” commit in timeline

Slide 25

Slide 25 text

Hudi Deployment Models ❏ Schedule and run clustering inline or async, in different models

Slide 26

Slide 26 text

Hudi ⇔ Presto Integration ❏ Hudi supported via Hive and Hudi connector (new) PrestoDB and Apache Hudi: https://prestodb.io/blog/2020/08/04/prestodb-and-hudi ❏ Hive connector ❏ COW and MOR, snapshot and read optimized queries ❏ Recognizing Hudi file layout through native integration ❏ File listing based on Hudi table metadata ❏ New Hudi connector (merged on master) ❏ COW (Snapshot) and MOR (Read optimized) querying ❏ Open up more optimization opportunities ❏ E.g data skipping, multi-modal indexes ❏ Optimized storage layout in Hudi tables readily available

Slide 27

Slide 27 text

Experiment Setup ❏ Dataset: Github Archive Dataset, 12 months (21/04~22/03), 478GB (928M records) ❏ Cardinality: “type”/15, “repo_id”/64.9M, “hour_of_day”/24 ❏ Hudi tables ❏ COW, partitioned by year and month, bulk inserts (followed by clustering) ❏ Sorting columns: “repo_id”, “type”, “hour_of_day” (in order) ❏ Different strategy: no clustering, linear sorting, Z-ordering ❏ Presto on EKS, 0.274-SNAPSHOT, with r5.4xlarge instances ❏ coordinator: 15 vCPU, 124 GB memory ❏ 4 workers: each has 15 vCPU, 124 GB memory ❏ query.max-memory=360GB, query.max-memory-per-node=90GB

Slide 28

Slide 28 text

Configuring Hudi Clustering Spark data source as an example Enable inline clustering Control clustering frequency Set sorting columns Hudi clustering docs: https://hudi.apache.org/docs/clustering/ Configure sizes Select layout optimization strategy

Slide 29

Slide 29 text

Benchmark ❏ Q1: How many activities for a particular repo? ❏ select count(*) from github_archive where repo_id = '76474200'; ❏ Q2: How many issue comments? ❏ select count(*) from github_archive where type = 'IssueCommentEvent'; Sorting columns: “repo_id”, “type”, “hour_of_day” Takeaways ❏ Linear sorting suits better for hierarchical data ❏ Z-ordering balances the locality for multi-dimensional data

Slide 30

Slide 30 text

Community Adoption, Roadmap, Engaging with us

Slide 31

Slide 31 text

What is the industry doing today? Uber rides - 250+ Petabytes from 24h+ to minutes latency https:/ /eng.uber.com/uber-big-data-platform/ Package deliveries - real-time event analytics at PB scale https:/ /aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/ TikTok/Bytedance recommendation system - at Exabyte scale http:/ /hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance Trading transactions - Near real-time CDC from 4000+ postgres https:/ /robinhood.engineering/author-balaji-varadarajan-e3f496815ebf Real-time advertising for 20M+ concurrent viewers https:/ /www.youtube.com/watch?v=mFpqrVxxwKc Store transactions - CDC & Warehousing https:/ /searchdatamanagement.techtarget.com/feature/Hudi-powering-data-lake-efforts-at-Walmart-and-Disney-Hotstar

Slide 32

Slide 32 text

What is the industry doing today? Lake House Architecture @ Halodoc: Data Platform 2.0 https:/ /blogs.halodoc.io/lake-house-architecture-halodoc-data-platform-2-0/ Incremental, Multi region data lake platform https:/ /aws.amazon.com/blogs/architecture/insights-for-ctos-part-3-growing-your-business-with-modern-data-capabilities/ Unified, batch + streaming data lake on Hudi https:/ /developpaper.com/apache-hudi-x-pulsar-meetup-hangzhou-station-is-hot-and-the-practice-dry-goods-are-waiting-for-you/ 150 source systems, ETL processing for 10,000+ tables https:/ /aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-native-data-pipelines-at-enterprise-scale-using-the-aws-platform/ Streaming data lake for device data https:/ /www.youtube.com/watch?v=8Q0kM-emMyo Near real-time grocery delivery tracking https:/ /lambda.blinkit.com/origins-of-data-lake-at-grofers-6c011f94b86c

Slide 33

Slide 33 text

The Community 2100+ Slack Members 225+ Contributors 1000+ GH Engagers 20+ Committers Pre-installed on 5 cloud providers Diverse PMC/Committers 1M DLs/month (400% YoY) 800B+ Records/Day (from even just 1 customer!) Rich community of participants

Slide 34

Slide 34 text

How We Operate? Friendly and diverse community - 20+ PMCs/Committers from 10+ organizations Developers - Propose new RFCs, Github label label:rfcs - Dev list discussions, JIRA for issue tracking. - Monthly minor releases, quarterly Major Users - Weekly community on-call, office hours - Issue triage, bug filing process on Github ~2100 Slack 250+ Contributors 3000+ GH Engagers ~20 PRs/week 20+ Committers 10+ PMCs

Slide 35

Slide 35 text

Roadmap Ongoing 0.12 (July 2022) 0.12+... (Q3 2022) 1.0 & Beyond (Q4 2022) Txns/Database layer - 0.11 stabilization, post follow-ups - Reliability, usability low hanging fruits - Non-keyed tables - Harden async indexing Engine Integrations - Presto/Trino connectors landing - Spark SQL performance Platform Services - Hardening continuous mode, Kafka Connect Txns/Database layer - Engine native writing - Indexed columns - Record level index - Eager conflict detection - new CDC format Engine Integrations - Extend ORC to all engines Platform Services - Snowflake external tables - Table management service - Kinesis, Pulsar sources Txns/Database layer - Lock free concurrency control - Federated storage layout - Lucene, bitmaps, … diverse types of indexes Engine Integrations - CDC integration with all engines - Multi -modal index full integration - Schema evol GA - New file group reader Platform Services - Reliable incremental ingest for GCS - Error tables Txns/Database layer - Infinite retention on the timeline - Time-travel updates, deletes - General purpose multi table txns Engine Integrations - DML support from Hive, Presto, Trino etc. - Early integrations with Dask, Ray, Rust, Python,C++ Platform Services - Materialized views with Flink - Caching (experimental) - Metaserver GA - Airbyte integration

Slide 36

Slide 36 text

Metaserver (Coming in 2022) Interesting fact : Hudi has a metaserver already - Runs on Spark driver; Serves FileSystem RPCs + queries on timeline - Backed by rocksDB/pluggable - updated incrementally on every timeline action - Very useful in streaming jobs Data lakes need a new metaserver - Flat file metastores are cool? (really?) - Speed up planning by orders of magnitude

Slide 37

Slide 37 text

Lake Cache (Coming in 2022) LRU Cache ala DB Buffer Pool Frequent Commits => Small objects/blocks - Today : Aggressively table services - Tomorrow : File Group/Hudi file model aware caching - Mutable data => FileSystem/Block level caches are not that effective. Benefits - Great performance for CDC tables - Avoid open/close costs for small objects

Slide 38

Slide 38 text

Engage With Our Community User Docs : https://hudi.apache.org Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI Github : https://github.com/apache/hudi/ Twitter : https://twitter.com/apachehudi Mailing list(s) : [email protected] (send an empty email to subscribe) [email protected] (actual mailing list) Slack : https://join.slack.com/t/apache-hudi/signup Community Syncs : https://hudi.apache.org/community/syncs

Slide 39

Slide 39 text

Thanks! Questions?