Slide 1

Slide 1 text

How to build an Open Data Lakehouse Analytics Stack Shawn Gordon Sr. Developer Advocate

Slide 2

Slide 2 text

2 Why Open Data Lake Analytics? Enterprise Data Beyond Enterprise Data IoT, Third-party, Telemetry, Event 1000X More Data Terabytes to Petabytes Open & Flexible Open Source, Open Formats Reporting & Dashboarding Data Science In-data lake transformation Reporting & Dashboarding Data Warehouse Open Data Lakes

Slide 3

Slide 3 text

3 What is Open Data Lake Analytics

Slide 4

Slide 4 text

4 The Traditional Data Warehouse • Relational Database • Columnar Structure • In-Database Analytics • Structured Data • Modeled Data • Extract, Transform, Load • SQL Access Challenges • Expensive • Difficult to Manage • Costly to Maintain • Limited Data • Limited Access 4

Slide 5

Slide 5 text

5 The Drivers Behind Modernization Digital Transformation Real Time Events Modern Processing Techniques More Data Fast Data Smart Data The Deconstructed Database

Slide 6

Slide 6 text

6 The Traditional Data Lake • File System Data Store / Object Store • Structured / Semi-Structured Data • Ingestion • Discovery • Data Science • Notebook and Python Access • Less expensive, but… • Good enough performance • Supports ~70% of DW workloads • Different approach to governance 6

Slide 7

Slide 7 text

7 Data SQL Query Processing Data Warehouse Cloud Data Lake Data Processing 1-10 TB 1TB -> PB Open Data Lake Analytics Reporting & Dashboarding Data Science In-data lake transformation Open Data Lake Analytics Reporting & Dashboarding

Slide 8

Slide 8 text

8 Data Warehouse Operational Data Stores Third Party Data Machine Learning Semi- | unstructured Data Virtualization / Federated Access Streaming & IoT Data SQL Query Processing SQL Query Processing The Data Platform ETL ELT Data Engg Storage Compute 1-10 TB Query & Processing Storage Compute SQL Structured Workloads 1TB -> PB Data Lake Reporting Dashboards Visualizations Notebooks Custom Apps

Slide 9

Slide 9 text

10 Cloud data lake driving open source SQL query engines Presto is the De-Facto SQL Engine for Data Lakes https://db-engines.com/en/ranking_trend/relational+dbms

Slide 10

Slide 10 text

Merging the Data Warehouse and the Data Lake with a Distributed Query Engine 11 1. SQL Access 2. Data Lake and Data Warehouse Access 3. Unified Analytics 4. Distributed Queries 5. Limitless Scale 6. Complex Data Types • Leverage Resources • Better Insight • More Use Cases • Leverage Platforms • Remove Limits • Amplified Insight

Slide 11

Slide 11 text

Data Lake Use Cases

Slide 12

Slide 12 text

13 Emerging use cases Use Cases Data Lakehouse analytics Reporting & dashboarding Interactive querying use cases Transformation using SQL (ETL) Federated access across data sources SQL Data Science Customer-facing app analytics

Slide 13

Slide 13 text

14 The Data LakeHouse Components Dashboad / notebooks Compute / Query engine Operational catalog Table Format / Transaction manager Storage

Slide 14

Slide 14 text

15 The Data LakeHouse Stack

Slide 15

Slide 15 text

Considerations for Open Analytics Decision © 2021 Enterprise Management Associates, Inc. 16 | @ema_research Data Analytics Users Platform Cloud Enterprise Business Cost

Slide 16

Slide 16 text

Considerations for Any Unified Analytics Decision Data Structured Semi- Structured Real Time Structured Complex Data Types Textual Streaming © 2021 Enterprise Management Associates, Inc. 17 | @ema_research

Slide 17

Slide 17 text

Considerations for Any Unified Analytics Decision Data Analytics Users Platform SQL Python Notebook Search © 2021 Enterprise Management Associates, Inc. 18 | @ema_research

Slide 18

Slide 18 text

Considerations for Any Unified Analytics Decision Data Analytics Users Platform Engineer Analyst Scientist Business © 2021 Enterprise Management Associates, Inc. 19 | @ema_research

Slide 19

Slide 19 text

Considerations for Any Unified Analytics Decision Data Analytics Users Platform Cloud Enterprise Business Cost

Slide 20

Slide 20 text

Considerations for Any Unified Analytics Decision Elasticity Scale Mobility Globality Cloud Enterprise Business Cost © 2021 Enterprise Management Associates, Inc. 21 | @ema_research

Slide 21

Slide 21 text

Considerations for Any Unified Analytics Decision Security Privacy Governance Unification Cloud Enterprise Business Cost © 2021 Enterprise Management Associates, Inc. 22 | @ema_research

Slide 22

Slide 22 text

Considerations for Any Unified Analytics Decision Semantics Logic Value Optimization Cloud Enterprise Business Cost © 2021 Enterprise Management Associates, Inc. 23 | @ema_research

Slide 23

Slide 23 text

Considerations for Any Unified Analytics Decision Forecast Containment Chargeback Scale Cloud Enterprise Business Cost © 2021 Enterprise Management Associates, Inc. 24 | @ema_research

Slide 24

Slide 24 text

25 Challenges with SQL on Open Data Lakes Cloud DW / AWS Serverless options get very expensive for growing data volumes ▪ Cloud data warehouse costs grow much faster than compute engine costs ▪ Serverless options like AWS Athena charge /query and get expensive “Do it yourself” approach is complicated ▪ Big data skills in platform teams are limited ▪ Presto is complicated and operationally very time consuming Presto on AWS like AWS Athena has limited capabilities and doesn’t scale ▪ Limited concurrency of 20 per account ▪ No visibility into cluster logs, query logs, no flexibility / control on scale

Slide 25

Slide 25 text

Presto & Presto Community

Slide 26

Slide 26 text

27 Open Source Presto Overview • Distributed SQL query engine • Created at • ANSI SQL on Databases, Data lakes • Designed to be interactive & access petabytes of data • Open source, hosted at https://github.com/prestodb

Slide 27

Slide 27 text

28 Presto Users

Slide 28

Slide 28 text

Ahana Overview

Slide 29

Slide 29 text

30 How Ahana Cloud works? ~ 30 mins to create the compute plane https://app.ahana.cloud/signup Create Presto Clusters in your account

Slide 30

Slide 30 text

31 Ahana Cloud for Presto Ahana Console (Control Plane) CLUSTER ORCHESTRATION CONSOLIDATED LOGGING SECURITY & ACCESS BILLING & SUPPORT In-VPC Presto Clusters (Compute Plane) AD HOC CLUSTER 1 TEST CLUSTER 2 PROD CLUSTER N Glue S3 RDS Elasticsearch Ahana Cloud Account Ahana console oversees and manages every Presto cluster Customer Cloud Account In-VPC orchestration of Presto clusters, where metadata, monitoring, and data sources reside

Slide 31

Slide 31 text

32 Ahana Cloud Overview 1. Ahana Managed Service Console 2. Add data sources 3. Query data where it lives with Federated Connectors (in place) 4. Cluster management

Slide 32

Slide 32 text

33 Case study: Securonix NextGen SIEM Cluster AWS S3 Data Lake Glue Metastore ▪ Securonix is a Security information and event management software ▪ They use Ahana for in-app SQL analytics on data from AWS S3 for threat hunting ▪ They pull in billions of events per day that get stored in S3 ▪ With Ahana Cloud, they saw 3x better price performance compared with Presto on AWS

Slide 33

Slide 33 text

34 Ahana Cloud for Presto - Summary ▪ Brings SQL on AWS S3 with an open data lake + USER ▪ Presto compute brought to your data in your VPC in your account ▪ Fully managed Presto cluster life cycle including idle-time management ▪ Query AWS DBs - RDS/MySQL , RDS/Postgres, Elasticsearch, Redshift, Elasticsearch ▪ Cloud-native and highly available running on Kubernetes ▪ Bring your own ▪ BI tool / Data Science Notebook ▪ Metadata Catalog ▪ Transaction Manager Easy to use 3 x Price Performance Open & Flexible

Slide 34

Slide 34 text

3/31/22