Building an Open Data Lakehouse on AWS with Presto and Apache Hudi

HANDS-ON VIRTUAL LAB Building an Open Data Lakehouse with Presto,
Hudi, and AWS S3

Let’s make some introductions 2 Nadine Farah Dev Rel at
Onehouse Rohan Pednekar Sr. Product Manager at Ahana

Over the next 2 hours you will: Explore and understand
how to build a Data Lakehouse using Presto, Hudi, and S3 in a Hands-On Lab Environment Objective for Today

Agenda 1) Introductions (10 mins) 2) Understand the Technology (20
mins) a) What is a Data Lakehouse? b) Presto and Hudi overview 3) Getting your hands dirty (80 mins) a) Lab1: Get started with Ahana console b) Lab2: Use Presto to query your AWS S3 Data Lake c) Optional Lab3: Federated Query with MySQL d) Lab4: Set up Apache Hudi on Presto e) Lab5: Hudi COW f) Lab6: Hudi MOR 4) Summary and Close Out (10 mins)

Guest Speaker . Pratyaksh Sharma Software Engineer, Ahana Apache Hudi
Committer

Understanding The Technology Presto & Hudi

7 Overview • Open Data Lakehouse: Data lake + Data
warehouse • Hudi (Hadoop Upserts, Deletes and Incrementals) • Presto: Distributed SQL query engine for federated queries

Open Data Lakehouse Data Science, ML, & AI Reporting and
Dashboarding Data Warehouse Proprietary Storage Proprietary SQL Query Processing ML and AI Frameworks SQL Query Processing Cloud Data Lake Open Formats Storage Governance, Discovery, Quality & Security Reporting and Dashboarding The Open Data Lakehouse

9 Checklist: Building your Open Data Lakehouse ❏ Moving data
from OLTP databases to distributed storage - incremental or batch? ❏ Scattered data across various data sources ❏ Incremental pull for downstream pipelines ❏ Upserts and deletes ❏ Schema enforcement/evolution ❏ Late arriving updates ❏ Deduplicate incoming batch ❏ Latency vs data correctness? We need both ❏ Read and write amplification ❏ Efficient querying ❏ Interactive queries in real time ❏ ACID transactions ❏ Disaster recovery (duplicate data, wrong schema evolution etc) ❏ Storage management and file organization

Apache Hudi

What is Apache Hudi? • Pioneer of the lakehouse storage
architecture • Provides database like semantics over lake storage. • Completely serverless • Mainly built to solve for incremental workloads in batch fashion • Originally developed at Uber • Efficient use of compute resources. • Not a table format, complete data platform solution • Built for data lake workloads • Both streaming + batch style pipelines • Table services to manage storage and metadata • Introduced notions of Copy on Write vs Merge on Read • Different ways to think about the tradeoff between data freshness and query performance. • Terms heavily borrowed outside.

12 10,000 ft View of Hudi .

13 Apache Hudi Layout .

14 Copy on Write .

15 Merge on Read .

Presto

What is Presto? • Open source, distributed MPP SQL query
engine • Query in Place • Federated Querying • ANSI SQL Compliant • Designed ground up for fast analytic queries against data of any size • Originally developed at Facebook • Proven on petabytes of data • SQL-On-Anything • Federated pluggable architecture to support many connector • Opensource, hosted on github • https://github.com/prestodb

Presto Architecture Presto Cluster Coordinator Worker Worker Worker Worker

19 Presto Architecture .

20 A Typical CDC Pipeline .

Presto + Apache Hudi

22 Evolution of presto-hudi integration .

23 Presto-hudi connector - Snapshot queries for CoW tables -
Snapshot and read optimized queries for MoR tables - Incremental and time travel queries not supported. - No write support

24 Presto-hudi connector .

25 Query Execution: CoW • Query flow similar to a
regular parquet table • ParquetPageSource used for reading records.

26 Method call Upcoming features

27 Query Execution: MoR • Read Optimized queries similar to
CoW. • Relevant classes present in hudi-hadoop-mr module - HoodieParquetRealtimeInputFormat - RealtimeCompactedRecordReader - HoodieMergedLogRecordScanner

28 Method call

29 Record Reader: MoR • hoodie.realtime.merge.skip=false by default • Log
files and parquet files merged on the fly.

30 Upcoming features • Schema evolution for CoW (https://github.com/prestodb/presto/pull/18557) •
Improve hudi partition pruning (https://github.com/prestodb/presto/pull/18482) • Data skipping, filter pushdown (https://github.com/prestodb/presto/pull/18606) • Asynchronous split generation (https://github.com/prestodb/presto/pull/18210)

Ahana Cloud Fully-Managed Presto Service

Managing Presto Is Complex Hadoop complexity ▪ /etc/presto/config.properties ▪ /etc/presto/node.properties
▪ /etc/presto/jvm.config Many hidden parameters – difficult to tune Infra complexity ▪ No built-in catalog – users need to manage Hive metastore or AWS Glue ▪ No data lake S3 integration Poor out-of-box perf ▪ No tuning ▪ No high-performance indexing ▪ Basic optimizations for even for common queries

Ahana Cloud For Presto 1. Zero to Presto in 30
Minutes. Managed cloud service: No installation and conﬁguration. 2. Built for data teams of all experience level. 3. Moderate level of control of deployment without complexity. 4. Dedicated support from Presto experts.

Ahana Cloud for Presto Ahana Console (Control Plane) CLUSTER ORCHESTRATION
CONSOLIDATED LOGGING SECURITY & ACCESS BILLING & SUPPORT In-VPC Presto Clusters (Compute Plane) AD HOC CLUSTER 1 TEST CLUSTER 2 PROD CLUSTER N Glue S3 RDS Elasticsearch Ahana Cloud Account Ahana console oversees and manages every Presto cluster Customer Cloud Account In-VPC orchestration of Presto clusters, where metadata, monitoring, and data sources reside

Getting Your Hands Dirty Lab Guide: https://onehouse.readme.io/docs/read-apache-hudi-tables-from-presto

Wrapping Up Conclusion

Conclusion In this hands-on workshop you have: 1. Learned about
Ahana Cloud the easiest to use Presto experience in the cloud 2. Use Presto to query your AWS S3 Data Lake 3. Learned the Why and Hows of Apache HUDI

Next Steps for You... • Ahana Cloud is available on
the AWS Marketplace • Try the sample code for your own tables • Sign-up for a 14-day free trial here: https://ahana.io/sign-up • Community edition • PrestoCon Day 2023 CFPs are open!

Thank you! Stay Up-to-Date with Ahana Website: https://ahana.io/ Blogs: https://ahana.io/blog/
Twitter: @ahanaio

How to get involved with Presto Join the Slack channel!
prestodb.slack.com Write a blog for prestodb.io! prestodb.io/blog Join the virtual meetup group & present! meetup.com/prestodb Contribute to the project! github.com/prestodb

How to get involved with Apache Hudi Join the Hudi
Slack channel https://bit.ly/hudi-slack-channel Create Hudi content hudi.apache.org/blog hudi.apache.org/videos Join the virtual meetup group & present! meetup.com/seattle-data-engineering-meetup-group/ Contribute to the project! github.com/apache/hudi

Questions?

SQL Editor Uniﬁed SQL Engine Ingestion Cloud Data Lakes CREATE
TABLE, Ingest Data, Any Changes via Spark Open Data Lakehouse Hudi Tables

Building an Open Data Lakehouse on AWS with Pre...

Building an Open Data Lakehouse on AWS with Presto and Apache Hudi

More Decks by Ahana

Other Decks in Technology

Featured

Transcript