Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building an Open Data Lakehouse on AWS with Presto and Apache Hudi

March 31, 2023

Building an Open Data Lakehouse on AWS with Presto and Apache Hudi

You may be familiar with the Data Lakehouse, an emerging architecture that brings the flexibility, scale and cost management benefits of the data lake together with the data management capabilities of the data warehouse. In this workshop, we’ll get hands-on building an Open Data Lakehouse – an approach that brings open technologies and formats to your lakehouse.

For the purpose of this workshop, we’ll use Presto for the open source SQL query engine, Apache Hudi for ACID transactions, and AWS S3 for the data lake. You’ll get hands-on with Presto and Hudi. We’ll show you how to deploy each, connect them, set up your Hudi tables for ACID transactions, and finally run queries on your S3 data.


March 31, 2023

More Decks by Ahana

Other Decks in Technology


  1. Let’s make some introductions 2 Nadine Farah Dev Rel at

    Onehouse Rohan Pednekar Sr. Product Manager at Ahana
  2. Over the next 2 hours you will: Explore and understand

    how to build a Data Lakehouse using Presto, Hudi, and S3 in a Hands-On Lab Environment Objective for Today
  3. Agenda 1) Introductions (10 mins) 2) Understand the Technology (20

    mins) a) What is a Data Lakehouse? b) Presto and Hudi overview 3) Getting your hands dirty (80 mins) a) Lab1: Get started with Ahana console b) Lab2: Use Presto to query your AWS S3 Data Lake c) Optional Lab3: Federated Query with MySQL d) Lab4: Set up Apache Hudi on Presto e) Lab5: Hudi COW f) Lab6: Hudi MOR 4) Summary and Close Out (10 mins)
  4. 7 Overview • Open Data Lakehouse: Data lake + Data

    warehouse • Hudi (Hadoop Upserts, Deletes and Incrementals) • Presto: Distributed SQL query engine for federated queries
  5. Open Data Lakehouse Data Science, ML, & AI Reporting and

    Dashboarding Data Warehouse Proprietary Storage Proprietary SQL Query Processing ML and AI Frameworks SQL Query Processing Cloud Data Lake Open Formats Storage Governance, Discovery, Quality & Security Reporting and Dashboarding The Open Data Lakehouse
  6. 9 Checklist: Building your Open Data Lakehouse ❏ Moving data

    from OLTP databases to distributed storage - incremental or batch? ❏ Scattered data across various data sources ❏ Incremental pull for downstream pipelines ❏ Upserts and deletes ❏ Schema enforcement/evolution ❏ Late arriving updates ❏ Deduplicate incoming batch ❏ Latency vs data correctness? We need both ❏ Read and write amplification ❏ Efficient querying ❏ Interactive queries in real time ❏ ACID transactions ❏ Disaster recovery (duplicate data, wrong schema evolution etc) ❏ Storage management and file organization
  7. What is Apache Hudi? • Pioneer of the lakehouse storage

    architecture • Provides database like semantics over lake storage. • Completely serverless • Mainly built to solve for incremental workloads in batch fashion • Originally developed at Uber • Efficient use of compute resources. • Not a table format, complete data platform solution • Built for data lake workloads • Both streaming + batch style pipelines • Table services to manage storage and metadata • Introduced notions of Copy on Write vs Merge on Read • Different ways to think about the tradeoff between data freshness and query performance. • Terms heavily borrowed outside.
  8. What is Presto? • Open source, distributed MPP SQL query

    engine • Query in Place • Federated Querying • ANSI SQL Compliant • Designed ground up for fast analytic queries against data of any size • Originally developed at Facebook • Proven on petabytes of data • SQL-On-Anything • Federated pluggable architecture to support many connector • Opensource, hosted on github • https://github.com/prestodb
  9. 23 Presto-hudi connector - Snapshot queries for CoW tables -

    Snapshot and read optimized queries for MoR tables - Incremental and time travel queries not supported. - No write support
  10. 25 Query Execution: CoW • Query flow similar to a

    regular parquet table • ParquetPageSource used for reading records.
  11. 27 Query Execution: MoR • Read Optimized queries similar to

    CoW. • Relevant classes present in hudi-hadoop-mr module - HoodieParquetRealtimeInputFormat - RealtimeCompactedRecordReader - HoodieMergedLogRecordScanner
  12. 30 Upcoming features • Schema evolution for CoW (https://github.com/prestodb/presto/pull/18557) •

    Improve hudi partition pruning (https://github.com/prestodb/presto/pull/18482) • Data skipping, filter pushdown (https://github.com/prestodb/presto/pull/18606) • Asynchronous split generation (https://github.com/prestodb/presto/pull/18210)
  13. Managing Presto Is Complex Hadoop complexity ▪ /etc/presto/config.properties ▪ /etc/presto/node.properties

    ▪ /etc/presto/jvm.config Many hidden parameters – difficult to tune Infra complexity ▪ No built-in catalog – users need to manage Hive metastore or AWS Glue ▪ No data lake S3 integration Poor out-of-box perf ▪ No tuning ▪ No high-performance indexing ▪ Basic optimizations for even for common queries
  14. Ahana Cloud For Presto 1. Zero to Presto in 30

    Minutes. Managed cloud service: No installation and configuration. 2. Built for data teams of all experience level. 3. Moderate level of control of deployment without complexity. 4. Dedicated support from Presto experts.
  15. Ahana Cloud for Presto Ahana Console (Control Plane) CLUSTER ORCHESTRATION

    CONSOLIDATED LOGGING SECURITY & ACCESS BILLING & SUPPORT In-VPC Presto Clusters (Compute Plane) AD HOC CLUSTER 1 TEST CLUSTER 2 PROD CLUSTER N Glue S3 RDS Elasticsearch Ahana Cloud Account Ahana console oversees and manages every Presto cluster Customer Cloud Account In-VPC orchestration of Presto clusters, where metadata, monitoring, and data sources reside
  16. Conclusion In this hands-on workshop you have: 1. Learned about

    Ahana Cloud the easiest to use Presto experience in the cloud 2. Use Presto to query your AWS S3 Data Lake 3. Learned the Why and Hows of Apache HUDI
  17. Next Steps for You... • Ahana Cloud is available on

    the AWS Marketplace • Try the sample code for your own tables • Sign-up for a 14-day free trial here: https://ahana.io/sign-up • Community edition • PrestoCon Day 2023 CFPs are open!
  18. How to get involved with Presto Join the Slack channel!

    prestodb.slack.com Write a blog for prestodb.io! prestodb.io/blog Join the virtual meetup group & present! meetup.com/prestodb Contribute to the project! github.com/prestodb
  19. How to get involved with Apache Hudi Join the Hudi

    Slack channel https://bit.ly/hudi-slack-channel Create Hudi content hudi.apache.org/blog hudi.apache.org/videos Join the virtual meetup group & present! meetup.com/seattle-data-engineering-meetup-group/ Contribute to the project! github.com/apache/hudi
  20. SQL Editor Unified SQL Engine Ingestion Cloud Data Lakes CREATE

    TABLE, Ingest Data, Any Changes via Spark Open Data Lakehouse Hudi Tables