Virtual Lab: Building an Open Data Lakehouse with Presto, Hudi and AWS S3

by Ahana

Slide 1

Slide 1 text

HANDS-ON VIRTUAL LAB Building an Open Data Lakehouse with Presto, Hudi, and AWS S3 1

Slide 2

Slide 2 text

WELCOME TO THE AHANA HANDS-ON VIRTUAL LAB Jalpreet Singh Nanda Software Engineer [email protected] Sivabalan Narayanan Software Engineer [email protected] Your Lab Guides

Slide 3

Slide 3 text

Over the next 90 minutes you will: Explore and understand how to build a Data Lakehouse using Presto, Hudi, and S3 in a Hands-On Lab Environment Objective for Today 3

Slide 4

Slide 4 text

Agenda 1) Understand the Technology (20-25 mins) a) What is a Data Lakehouse? b) Presto and Hudi overview 2) Getting your hands dirty (90 mins) a) Set up Hudi on Presto (Nothing to do, ships out of the box) b) Write and Read data using Spark in HUDI format c) Apply inserts, updates and delete to HUDI d) Query HUDI data in S3 data with Presto e) Set up Data streamer utility to ingest data seamlessly (auto pilot) 3) Summary and Close Out (5 - 10 mins) 4

Slide 5

Slide 5 text

Understanding The Technology Presto & Hudi

Slide 6

Slide 6 text

The Open Data Lakehouse SQL Query Processing Reporting & Dashboarding Data Science In-Data Lake Transformation Cloud Data Lake TBs -> PBs Open Data Lake Security, Reliability, and Governance

Slide 7

Slide 7 text

What is Presto? • Open source, distributed MPP SQL query engine • Query in Place • Federated Querying • ANSI SQL Compliant • Designed ground up for fast analytic queries against data of any size • Originally developed at Facebook • Proven on petabytes of data • SQL-On-Anything • Federated pluggable architecture to support many connector • Opensource, hosted on github • https://github.com/prestodb 7

Slide 8

Slide 8 text

Presto Overview 8 Presto Cluster Coordinator Worker Worker Worker Worker

Slide 9

Slide 9 text

Presto Use Cases Data Lakehouse analytics Reporting & dashboarding Interactive ad hoc querying Transformation using SQL (ETL) Federated querying across data sources 9

Slide 10

Slide 10 text

Apache Hudi : Origins @ Uber 2016 10 Context ❏ Uber in hypergrowth ❏ Moving from warehouse to lake ❏ HDFS/Cloud storage is immutable Problems ❏ Extremely poor ingest performance ❏ Wasteful reading/writing ❏ Zero concurrency control or ACID

Slide 11

Slide 11 text

Upserts, Deletes, Incrementals 11

Slide 12

Slide 12 text

What is Apache Hudi? - Provides abstraction over you lake storage - Completely serverless - Built with streaming as ﬁrst principles - Incremental, Efﬁcient ETL downstream - Self managing services - Supports many query engines. 12

Slide 13

Slide 13 text

Hudi for Data lake . 13 Hudi Table Type Queries supported Copy-On-Write Snapshot, Incremental Merge-On-Read Snapshot, Incremental, Read Optimized Supports two table types Copy on Write (COW) Merge on Read (MOR)

Slide 14

Slide 14 text

Hudi Table: Copy On Write Snapshot Query Incremental Query Insert: A, B, C, D, E Update: A => A’, D => D’ Update: A’ => A”, E => E’, Insert: F commit time=0 commit time=1 commit time=2 A, B C, D E file1_t0.parquet file2_t0.parquet file3_t0.parquet A’, B C, D’ file1_t1.parquet file2_t1.parquet A”, B E’,F file1_t2.parquet file3_t2.parquet A,B,C,D,E A,B,C,D,E A’,B,C,D’,E A”,B,C,D’,E’,F A’,D’ A”,E’,F

Slide 15

Slide 15 text

Hudi Table: Merge On Read Snapshot Query Incremental Query Insert: A, B, C, D, E Update: A => A’, D => D’ Update: A’=>A”, E=>E’,Insert: F commit time=0 commit time=1 commit time=2 A, B C, D E file1_t0.parquet file2_t0.parquet file3_t0.parquet A’ D’ .file1_t1.log .file2_t1.log A” E’ .file1_t2.log .file3_t2.log A,B,C,D,E A,B,C,D,E A’,B,C,D’,E A”,B,C,D’,E’,F A’,D’ A”,E’,F Read Optimized Query A,B,C,D,E Compaction commit time=3 A”, B C, D’ E’,F file1_t3.parquet file2_t3.parquet file3_t3.parquet A”,B,C,D’,E’,F A”,E’,F A”,B,C,D’,E’,F A,B,C,D,E A,B,C,D,E

Slide 16

Slide 16 text

Hudi for Data lake 16 Choose Copy On Write if: - Write cost is not an issue, but need fast reads - Workload is fairly understood and not bursty - Bound by parquet ingestion performance - Simple to operate Choose Merge On Read if: - Need quick ingestion - Workload can be spiky or can change in pattern - Some operational chops - Both read optimized and real time

Slide 17

Slide 17 text

17 Hudi Step 1: Determine checkpoint to consume Step 2: Consume from source Step 3: Apply any transformation Step 4: Ingest to Hudi DeltaStreamer Repeat Steps 1 to 5 Source: Kafka, S3, DFS based source, etc. A self managed ingestion tool for Hudi Hudi for Data lake

Slide 18

Slide 18 text

Data streamer utility 18 A Self Managed Ingestion tool to ingest data into Hudi - Checkpointing - Various connectors like Kafka, DFS Log Files, Upstream tables … - Fault tolerant and reliable (auto rollbacks) - Automatic cleaning and compaction. - Filtering & SQL based transformation - Multi table ingestion - Sync Once or Continuous Mode

Slide 19

Slide 19 text

The Hudi Platform 19 Lake Storage (Cloud Object Stores, HDFS, …) Open File/Data Formats (Parquet, HFile, Avro, Orc, …) Concurrency Control (OCC, MVCC, Non-blocking, Lock providers, Orchestration, Scheduling...) Table Services (cleaning, compaction, clustering, indexing, file sizing,...) Indexes (Bloom filter, HBase, Bucket index, Hash based, Lucene..) Table Format (Schema, File listings, Stats, Evolution, …) Lake Cache (Columnar, transactional, mutable, WIP,...) Metaserver (Stats, table service coordination,...) SQL Query Engines (Spark, Flink, Hive, Presto, Trino, Impala, Redshift, BigQuery, Snowflake,..) Platform Services (Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...) Transactional Database Layer Execution/Runtimes

Slide 20

Slide 20 text

Ahana Cloud Fully-Managed Presto Service

Slide 21

Slide 21 text

Ahana Cloud For Presto 1. Zero to Presto in 30 Minutes. Managed cloud service: No installation and conﬁguration. 2. Built for data teams of all experience level. 3. Moderate level of control of deployment without complexity. 4. Dedicated support from Presto experts.

Slide 22

Slide 22 text

Managing Presto Is Complex Hadoop complexity ▪ /etc/presto/config.properties ▪ /etc/presto/node.properties ▪ /etc/presto/jvm.config Many hidden parameters – difficult to tune Just the query engine ▪ No built-in catalog – users need to manage Hive metastore or AWS Glue ▪ No data lake S3 integration Poor out-of-box perf ▪ No tuning ▪ No high-performance indexing ▪ Basic optimizations for even for common queries 22

Slide 23

Slide 23 text

Ahana Console (Control Plane) CLUSTER ORCHESTRATION CONSOLIDATED LOGGING SECURITY & ACCESS BILLING & SUPPORT In-VPC Presto Clusters (Compute Plane) AD HOC CLUSTER 1 TEST CLUSTER 2 PROD CLUSTER N Glue S3 RDS Elasticsearch Ahana Cloud Account Ahana console oversees and manages every Presto cluster Customer Cloud Account In-VPC orchestration of Presto clusters, where metadata, monitoring, and data sources reside Ahana Cloud for Presto 23

Slide 24

Slide 24 text

Getting Your Hands Dirty

Slide 25

Slide 25 text

Ahana Cloud Presto (Access Information) 25 Ahana Compute Plane AND Superset Access: URL: https://app.ahana.cloud (Compute Plane) https://superset.ahanademo1.cp.ahana.cloud/superset/sqllab/ (SuperSet) Email Address: [email protected] Password: Rapt0rX! Jupyter Notebook: Password: Summ3r$un CLI / JDBC Access: Cluster Username: presto Cluster Password: Summ3r$un TEMPORARY AWS Access Keys: (Needs AWS CLI) AWS_SECRET_ACCESS_KEY="QwD+N1XL/s1p/vUhWLqXCGUJy0Ko9myE18iEsy3N" AWS_ACCESS_KEY_ID="AKIATHY6BJDWF2PZJQDI"

Slide 26

Slide 26 text

Mechanisms for Capturing Change Data a. Ideal / Comprehensive: Native CDC (Change Data Capture) - Oracle, MySQL, PostGres, MSSQL, SalesForce, MongoDB etc. all provide change log feeds i. Robust but complicated and systemically expensive ii. Streaming Changes iii. Can segregate Inserts, Updates, Deletes b. Simple / Not Full Fledged : Incremental Pulls based on “last modiﬁed date” or “Incremental Primary Keys” i. Not all tables have an auto-updated last modiﬁed date column ii. Polling based iii. No segregation of Inserts, Updates (Every record is an upsert) iv. Cannot identify Deletes unless they are logical deletions 26

Slide 27

Slide 27 text

Tools for CDC • Oracle Golden Gate • Attunity Qlick • AWS DMS • FiveTran, Matillion, StreamSets, Talend etc. • Debezium (OpenSource) • Others…. 27

Slide 28

Slide 28 text

Demo • Your instructor will walk you through the DMS setup for Change Data Capture that has been setup for the Lab • The intention is to ONLY demonstrate ONE of the ways this can be setup 28

Slide 29

Slide 29 text

Presto Entity Hierarchy Cheat Sheet 29 1. catalog -> schema -> table select * from catalog.schema.table; 2. Show catalogs: show catalogs; 3. Show schemas in a catalog: show schemas from ; 4. Show tables in a schema: show tables from .;

Slide 30

Slide 30 text

Wrapping Up Conclusion

Slide 31

Slide 31 text

Evolution of Hudi and Presto Integration 31

Slide 32

Slide 32 text

Hudi in Presto (>= 0.272) 32 ❏ Metadata-based listing of Hudi tables using session property. ❏ Implement a new DirectoryLister for Hudi that uses HoodieTableFileSystemView to fetch data ﬁles. ❏ File system view is loaded once and cached.

Slide 33

Slide 33 text

Hudi in Presto (master…) 33 ❏ On par with presto-hive. ❏ Adoption of new Hudi features and optimizations ❏ Performance ❏ Stability

Slide 34

Slide 34 text

Hudi-Presto Connector ❏ Hive connector ❏ COW and MOR, snapshot and read optimized queries ❏ Recognizing Hudi ﬁle layout through native integration ❏ File listing based on Hudi table metadata ❏ New Hudi connector (merged on master) ❏ COW (Snapshot) and MOR (Read optimized) querying ❏ Open up more optimization opportunities ❏ E.g data skipping, multi-modal indexes 34

Slide 35

Slide 35 text

Adoption of new Hudi features Metadata Table - Efficient metadata fetching - Uses HFile for faster scans - Eliminate recursive file listing operations - Pruning using column stats Clustering (Optimize data layout alongside ingestion) - Problem 1: faster ingestion -> smaller file sizes - Problem 2: data locality for query (e.g., by city) ≠ ingestion order (e.g., trips by time) - Auto sizing, reorg data, no compromise on ingestion - Fancy space curves able to show 90% reduction in # rows scanned 35

Slide 36

Slide 36 text

Coming soon !!! Querying Change Streams - First-class support for incremental change streams - Incremental query using Hudi connector - Scan only relevant ﬁles in “incremental query window” Data Skipping - Scan record ranges based on their keys’ preﬁxes - Locality pertaining to predicate column - Avoid reading full index - O(#predicate columns) vs O(#table columns) Write support with Presto 36

Slide 37

Slide 37 text

Engage with the community User Docs : https://hudi.apache.org Technical Wiki : https://cwiki.apache.org/conﬂuence/display/HUDI Github : https://github.com/apache/hudi/ Twitter : https://twitter.com/apachehudi Mailing list(s) : [email protected] (send an empty email to subscribe) [email protected] (actual mailing list) Slack : https://join.slack.com/t/apache-hudi/signup Community Syncs : https://hudi.apache.org/community/syncs 37

Slide 38

Slide 38 text

Note about creating HUDI tables to Glue 1. To interact directly via Glue ETL / Spark you need Glue HUDI connector from the Marketplace (Free) https://aws.amazon.com/marketplace/pp/prodview-6r ofemcq6erku 2. Alternatively you can create HUDI tables in Glue using: a. Glue API b. AWS Glue Command Line : https://docs.aws.amazon.com/cli/latest/reference/glue/create-ta ble.html c. Athena Web Console 38

Slide 39

Slide 39 text

Conclusion In this hands-on workshop you have: 1. Learned about Ahana Cloud the easiest to use Presto experience in the cloud 2. Learned the Why and Hows of HUDI 39

Slide 40

Slide 40 text

Next Steps for You... • Ahana Cloud is available on the AWS Marketplace • Try the sample code for your own tables • Sign-up for a 14-day free trial here: https://ahana.io/sign-up 40

Slide 41

Slide 41 text

Thank you! Stay Up-to-Date with Ahana Website: https://ahana.io/ Blogs: https://ahana.io/blog/ Twitter: @ahanaio 41

Slide 42

Slide 42 text

How to get involved with Presto Join the Slack channel! prestodb.slack.com Write a blog for prestodb.io! prestodb.io/blog Join the virtual meetup group & present! meetup.com/prestodb Contribute to the project! github.com/prestodb 42

Slide 43

Slide 43 text

Questions? 43