Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building an Open Data Lakehouse on AWS with Presto and Apache Hudi

Ahana
March 31, 2023

Building an Open Data Lakehouse on AWS with Presto and Apache Hudi

You may be familiar with the Data Lakehouse, an emerging architecture that brings the flexibility, scale and cost management benefits of the data lake together with the data management capabilities of the data warehouse. In this workshop, we’ll get hands-on building an Open Data Lakehouse – an approach that brings open technologies and formats to your lakehouse.

For the purpose of this workshop, we’ll use Presto for the open source SQL query engine, Apache Hudi for ACID transactions, and AWS S3 for the data lake. You’ll get hands-on with Presto and Hudi. We’ll show you how to deploy each, connect them, set up your Hudi tables for ACID transactions, and finally run queries on your S3 data.

Ahana

March 31, 2023
Tweet

More Decks by Ahana

Other Decks in Technology

Transcript

  1. HANDS-ON
    VIRTUAL LAB
    Building an Open
    Data Lakehouse
    with Presto, Hudi,
    and AWS S3

    View Slide

  2. Let’s make some introductions
    2
    Nadine Farah
    Dev Rel at Onehouse
    Rohan Pednekar
    Sr. Product Manager at Ahana

    View Slide

  3. Over the next 2 hours you will:
    Explore and understand how to build a Data Lakehouse using Presto, Hudi, and
    S3 in a Hands-On Lab Environment
    Objective for Today

    View Slide

  4. Agenda
    1) Introductions (10 mins)
    2) Understand the Technology (20 mins)
    a) What is a Data Lakehouse?
    b) Presto and Hudi overview
    3) Getting your hands dirty (80 mins)
    a) Lab1: Get started with Ahana console
    b) Lab2: Use Presto to query your AWS S3 Data Lake
    c) Optional Lab3: Federated Query with MySQL
    d) Lab4: Set up Apache Hudi on Presto
    e) Lab5: Hudi COW
    f) Lab6: Hudi MOR
    4) Summary and Close Out (10 mins)

    View Slide

  5. Guest Speaker
    .
    Pratyaksh Sharma
    Software Engineer, Ahana
    Apache Hudi Committer

    View Slide

  6. Understanding The
    Technology
    Presto & Hudi

    View Slide

  7. 7
    Overview
    • Open Data Lakehouse: Data lake + Data warehouse
    • Hudi (Hadoop Upserts, Deletes and Incrementals)
    • Presto: Distributed SQL query engine for federated queries

    View Slide

  8. Open Data Lakehouse
    Data Science,
    ML, & AI
    Reporting and Dashboarding
    Data Warehouse
    Proprietary Storage
    Proprietary
    SQL Query Processing
    ML and AI
    Frameworks
    SQL Query Processing
    Cloud Data Lake
    Open
    Formats
    Storage
    Governance,
    Discovery,
    Quality &
    Security
    Reporting and Dashboarding
    The Open Data Lakehouse

    View Slide

  9. 9
    Checklist: Building your Open Data Lakehouse
    ❏ Moving data from OLTP databases to distributed storage - incremental or batch?
    ❏ Scattered data across various data sources
    ❏ Incremental pull for downstream pipelines
    ❏ Upserts and deletes
    ❏ Schema enforcement/evolution
    ❏ Late arriving updates
    ❏ Deduplicate incoming batch
    ❏ Latency vs data correctness? We need both
    ❏ Read and write amplification
    ❏ Efficient querying
    ❏ Interactive queries in real time
    ❏ ACID transactions
    ❏ Disaster recovery (duplicate data, wrong schema evolution etc)
    ❏ Storage management and file organization

    View Slide

  10. Apache Hudi

    View Slide

  11. What is Apache Hudi?
    • Pioneer of the lakehouse storage architecture
    • Provides database like semantics over lake storage.
    • Completely serverless
    • Mainly built to solve for incremental workloads in batch fashion
    • Originally developed at Uber
    • Efficient use of compute resources.
    • Not a table format, complete data platform solution
    • Built for data lake workloads
    • Both streaming + batch style pipelines
    • Table services to manage storage and metadata
    • Introduced notions of Copy on Write vs Merge on Read
    • Different ways to think about the tradeoff between data freshness and query performance.
    • Terms heavily borrowed outside.

    View Slide

  12. 12
    10,000 ft View of Hudi
    .

    View Slide

  13. 13
    Apache Hudi Layout
    .

    View Slide

  14. 14
    Copy on Write
    .

    View Slide

  15. 15
    Merge on Read
    .

    View Slide

  16. Presto

    View Slide

  17. What is Presto?
    • Open source, distributed MPP SQL query engine
    • Query in Place
    • Federated Querying
    • ANSI SQL Compliant
    • Designed ground up for fast analytic queries against data of any size
    • Originally developed at Facebook
    • Proven on petabytes of data
    • SQL-On-Anything
    • Federated pluggable architecture to support many connector
    • Opensource, hosted on github
    • https://github.com/prestodb

    View Slide

  18. Presto Architecture
    Presto
    Cluster
    Coordinator Worker Worker Worker Worker

    View Slide

  19. 19
    Presto Architecture
    .

    View Slide

  20. 20
    A Typical CDC Pipeline
    .

    View Slide

  21. Presto + Apache Hudi

    View Slide

  22. 22
    Evolution of presto-hudi integration
    .

    View Slide

  23. 23
    Presto-hudi connector
    - Snapshot queries for CoW tables
    - Snapshot and read optimized queries for MoR tables
    - Incremental and time travel queries not supported.
    - No write support

    View Slide

  24. 24
    Presto-hudi connector
    .

    View Slide

  25. 25
    Query Execution: CoW
    • Query flow similar to a regular parquet table
    • ParquetPageSource used for reading records.

    View Slide

  26. 26
    Method call
    Upcoming
    features

    View Slide

  27. 27
    Query Execution: MoR
    • Read Optimized queries similar to CoW.
    • Relevant classes present in hudi-hadoop-mr module
    - HoodieParquetRealtimeInputFormat
    - RealtimeCompactedRecordReader
    - HoodieMergedLogRecordScanner

    View Slide

  28. 28
    Method call

    View Slide

  29. 29
    Record Reader: MoR
    • hoodie.realtime.merge.skip=false
    by default
    • Log files and parquet files merged
    on the fly.

    View Slide

  30. 30
    Upcoming features
    • Schema evolution for CoW (https://github.com/prestodb/presto/pull/18557)
    • Improve hudi partition pruning (https://github.com/prestodb/presto/pull/18482)
    • Data skipping, filter pushdown (https://github.com/prestodb/presto/pull/18606)
    • Asynchronous split generation (https://github.com/prestodb/presto/pull/18210)

    View Slide

  31. Ahana Cloud
    Fully-Managed Presto Service

    View Slide

  32. Managing Presto Is Complex
    Hadoop complexity
    ▪ /etc/presto/config.properties
    ▪ /etc/presto/node.properties
    ▪ /etc/presto/jvm.config
    Many hidden parameters –
    difficult to tune
    Infra complexity
    ▪ No built-in catalog – users
    need to manage Hive
    metastore or AWS Glue
    ▪ No data lake S3 integration
    Poor out-of-box perf
    ▪ No tuning
    ▪ No high-performance
    indexing
    ▪ Basic optimizations for
    even for common queries

    View Slide

  33. Ahana Cloud For Presto
    1. Zero to Presto in 30 Minutes.
    Managed cloud service: No installation
    and configuration.
    2. Built for data teams of all experience
    level.
    3. Moderate level of control of
    deployment without complexity.
    4. Dedicated support from Presto
    experts.

    View Slide

  34. Ahana Cloud for Presto
    Ahana Console (Control Plane)
    CLUSTER
    ORCHESTRATION
    CONSOLIDATED
    LOGGING
    SECURITY &
    ACCESS
    BILLING &
    SUPPORT
    In-VPC Presto Clusters (Compute Plane)
    AD HOC CLUSTER 1
    TEST CLUSTER 2
    PROD CLUSTER N
    Glue
    S3
    RDS
    Elasticsearch
    Ahana Cloud
    Account
    Ahana console
    oversees and
    manages every
    Presto cluster
    Customer
    Cloud Account
    In-VPC orchestration
    of Presto clusters,
    where metadata,
    monitoring, and
    data sources reside

    View Slide

  35. Getting Your Hands Dirty
    Lab Guide: https://onehouse.readme.io/docs/read-apache-hudi-tables-from-presto

    View Slide

  36. Wrapping Up
    Conclusion

    View Slide

  37. Conclusion
    In this hands-on workshop you have:
    1. Learned about Ahana Cloud the easiest
    to use Presto experience in the cloud
    2. Use Presto to query your AWS S3 Data
    Lake
    3. Learned the Why and Hows of Apache
    HUDI

    View Slide

  38. Next Steps for You...
    • Ahana Cloud is available on the AWS
    Marketplace
    • Try the sample code for your own tables
    • Sign-up for a 14-day free trial here:
    https://ahana.io/sign-up
    • Community edition
    • PrestoCon Day 2023 CFPs are open!

    View Slide

  39. Thank you!
    Stay Up-to-Date with Ahana
    Website: https://ahana.io/
    Blogs: https://ahana.io/blog/
    Twitter: @ahanaio

    View Slide

  40. How to get involved with Presto
    Join the Slack channel!
    prestodb.slack.com
    Write a blog for prestodb.io!
    prestodb.io/blog
    Join the virtual meetup group &
    present!
    meetup.com/prestodb
    Contribute to the project!
    github.com/prestodb

    View Slide

  41. How to get involved with Apache Hudi
    Join the Hudi Slack channel
    https://bit.ly/hudi-slack-channel
    Create Hudi content
    hudi.apache.org/blog
    hudi.apache.org/videos
    Join the virtual meetup group &
    present!
    meetup.com/seattle-data-engineering-meetup-group/
    Contribute to the project!
    github.com/apache/hudi

    View Slide

  42. Questions?

    View Slide

  43. SQL Editor
    Unified SQL Engine
    Ingestion Cloud Data Lakes
    CREATE TABLE,
    Ingest Data, Any
    Changes via Spark
    Open Data Lakehouse
    Hudi Tables

    View Slide