Building an Open Data Lakehouse on AWS with Presto and Apache Hudi

Slide 1

Slide 1 text

HANDS-ON VIRTUAL LAB Building an Open Data Lakehouse with Presto, Hudi, and AWS S3

Slide 2

Slide 2 text

Let’s make some introductions 2 Nadine Farah Dev Rel at Onehouse Rohan Pednekar Sr. Product Manager at Ahana

Slide 3

Slide 3 text

Over the next 2 hours you will: Explore and understand how to build a Data Lakehouse using Presto, Hudi, and S3 in a Hands-On Lab Environment Objective for Today

Slide 4

Slide 4 text

Agenda 1) Introductions (10 mins) 2) Understand the Technology (20 mins) a) What is a Data Lakehouse? b) Presto and Hudi overview 3) Getting your hands dirty (80 mins) a) Lab1: Get started with Ahana console b) Lab2: Use Presto to query your AWS S3 Data Lake c) Optional Lab3: Federated Query with MySQL d) Lab4: Set up Apache Hudi on Presto e) Lab5: Hudi COW f) Lab6: Hudi MOR 4) Summary and Close Out (10 mins)

Slide 5

Slide 5 text

Guest Speaker . Pratyaksh Sharma Software Engineer, Ahana Apache Hudi Committer

Slide 6

Slide 6 text

Understanding The Technology Presto & Hudi

Slide 7

Slide 7 text

7 Overview • Open Data Lakehouse: Data lake + Data warehouse • Hudi (Hadoop Upserts, Deletes and Incrementals) • Presto: Distributed SQL query engine for federated queries

Slide 8

Slide 8 text

Open Data Lakehouse Data Science, ML, & AI Reporting and Dashboarding Data Warehouse Proprietary Storage Proprietary SQL Query Processing ML and AI Frameworks SQL Query Processing Cloud Data Lake Open Formats Storage Governance, Discovery, Quality & Security Reporting and Dashboarding The Open Data Lakehouse

Slide 9

Slide 9 text

9 Checklist: Building your Open Data Lakehouse ❏ Moving data from OLTP databases to distributed storage - incremental or batch? ❏ Scattered data across various data sources ❏ Incremental pull for downstream pipelines ❏ Upserts and deletes ❏ Schema enforcement/evolution ❏ Late arriving updates ❏ Deduplicate incoming batch ❏ Latency vs data correctness? We need both ❏ Read and write amplification ❏ Efficient querying ❏ Interactive queries in real time ❏ ACID transactions ❏ Disaster recovery (duplicate data, wrong schema evolution etc) ❏ Storage management and file organization

Slide 10

Slide 10 text

Apache Hudi

Slide 11

Slide 11 text

What is Apache Hudi? • Pioneer of the lakehouse storage architecture • Provides database like semantics over lake storage. • Completely serverless • Mainly built to solve for incremental workloads in batch fashion • Originally developed at Uber • Efficient use of compute resources. • Not a table format, complete data platform solution • Built for data lake workloads • Both streaming + batch style pipelines • Table services to manage storage and metadata • Introduced notions of Copy on Write vs Merge on Read • Different ways to think about the tradeoff between data freshness and query performance. • Terms heavily borrowed outside.

Slide 12

Slide 12 text

12 10,000 ft View of Hudi .

Slide 13

Slide 13 text

13 Apache Hudi Layout .

Slide 14

Slide 14 text

14 Copy on Write .

Slide 15

Slide 15 text

15 Merge on Read .

Slide 16

Slide 16 text

Presto

Slide 17

Slide 17 text

What is Presto? • Open source, distributed MPP SQL query engine • Query in Place • Federated Querying • ANSI SQL Compliant • Designed ground up for fast analytic queries against data of any size • Originally developed at Facebook • Proven on petabytes of data • SQL-On-Anything • Federated pluggable architecture to support many connector • Opensource, hosted on github • https://github.com/prestodb

Slide 18

Slide 18 text

Presto Architecture Presto Cluster Coordinator Worker Worker Worker Worker

Slide 19

Slide 19 text

19 Presto Architecture .

Slide 20

Slide 20 text

20 A Typical CDC Pipeline .

Slide 21

Slide 21 text

Presto + Apache Hudi

Slide 22

Slide 22 text

22 Evolution of presto-hudi integration .

Slide 23

Slide 23 text

23 Presto-hudi connector - Snapshot queries for CoW tables - Snapshot and read optimized queries for MoR tables - Incremental and time travel queries not supported. - No write support

Slide 24

Slide 24 text

24 Presto-hudi connector .

Slide 25

Slide 25 text

25 Query Execution: CoW • Query flow similar to a regular parquet table • ParquetPageSource used for reading records.

Slide 26

Slide 26 text

26 Method call Upcoming features

Slide 27

Slide 27 text

27 Query Execution: MoR • Read Optimized queries similar to CoW. • Relevant classes present in hudi-hadoop-mr module - HoodieParquetRealtimeInputFormat - RealtimeCompactedRecordReader - HoodieMergedLogRecordScanner

Slide 28

Slide 28 text

28 Method call

Slide 29

Slide 29 text

29 Record Reader: MoR • hoodie.realtime.merge.skip=false by default • Log files and parquet files merged on the fly.

Slide 30

Slide 30 text

30 Upcoming features • Schema evolution for CoW (https://github.com/prestodb/presto/pull/18557) • Improve hudi partition pruning (https://github.com/prestodb/presto/pull/18482) • Data skipping, filter pushdown (https://github.com/prestodb/presto/pull/18606) • Asynchronous split generation (https://github.com/prestodb/presto/pull/18210)

Slide 31

Slide 31 text

Ahana Cloud Fully-Managed Presto Service

Slide 32

Slide 32 text

Managing Presto Is Complex Hadoop complexity ▪ /etc/presto/config.properties ▪ /etc/presto/node.properties ▪ /etc/presto/jvm.config Many hidden parameters – difficult to tune Infra complexity ▪ No built-in catalog – users need to manage Hive metastore or AWS Glue ▪ No data lake S3 integration Poor out-of-box perf ▪ No tuning ▪ No high-performance indexing ▪ Basic optimizations for even for common queries

Slide 33

Slide 33 text

Ahana Cloud For Presto 1. Zero to Presto in 30 Minutes. Managed cloud service: No installation and conﬁguration. 2. Built for data teams of all experience level. 3. Moderate level of control of deployment without complexity. 4. Dedicated support from Presto experts.

Slide 34

Slide 34 text

Ahana Cloud for Presto Ahana Console (Control Plane) CLUSTER ORCHESTRATION CONSOLIDATED LOGGING SECURITY & ACCESS BILLING & SUPPORT In-VPC Presto Clusters (Compute Plane) AD HOC CLUSTER 1 TEST CLUSTER 2 PROD CLUSTER N Glue S3 RDS Elasticsearch Ahana Cloud Account Ahana console oversees and manages every Presto cluster Customer Cloud Account In-VPC orchestration of Presto clusters, where metadata, monitoring, and data sources reside

Slide 35

Slide 35 text

Getting Your Hands Dirty Lab Guide: https://onehouse.readme.io/docs/read-apache-hudi-tables-from-presto

Slide 36

Slide 36 text

Wrapping Up Conclusion

Slide 37

Slide 37 text

Conclusion In this hands-on workshop you have: 1. Learned about Ahana Cloud the easiest to use Presto experience in the cloud 2. Use Presto to query your AWS S3 Data Lake 3. Learned the Why and Hows of Apache HUDI

Slide 38

Slide 38 text

Next Steps for You... • Ahana Cloud is available on the AWS Marketplace • Try the sample code for your own tables • Sign-up for a 14-day free trial here: https://ahana.io/sign-up • Community edition • PrestoCon Day 2023 CFPs are open!

Slide 39

Slide 39 text

Thank you! Stay Up-to-Date with Ahana Website: https://ahana.io/ Blogs: https://ahana.io/blog/ Twitter: @ahanaio

Slide 40

Slide 40 text

How to get involved with Presto Join the Slack channel! prestodb.slack.com Write a blog for prestodb.io! prestodb.io/blog Join the virtual meetup group & present! meetup.com/prestodb Contribute to the project! github.com/prestodb

Slide 41

Slide 41 text

How to get involved with Apache Hudi Join the Hudi Slack channel https://bit.ly/hudi-slack-channel Create Hudi content hudi.apache.org/blog hudi.apache.org/videos Join the virtual meetup group & present! meetup.com/seattle-data-engineering-meetup-group/ Contribute to the project! github.com/apache/hudi

Slide 42

Slide 42 text

Questions?

Slide 43

Slide 43 text

SQL Editor Uniﬁed SQL Engine Ingestion Cloud Data Lakes CREATE TABLE, Ingest Data, Any Changes via Spark Open Data Lakehouse Hudi Tables