Upgrade to Pro — share decks privately, control downloads, hide ads and more …

UNLOCKING THE VALUE OF YOUR DATA LAKE

Ahana
October 06, 2021

UNLOCKING THE VALUE OF YOUR DATA LAKE

Ahana

October 06, 2021
Tweet

More Decks by Ahana

Other Decks in Business

Transcript

  1. Unlocking the Value of Your Data Lake Dipti Borkar 


    
 Cofounder, Chief Product Officer & Chief Evangelist Chairperson |Community Team Presto Foundation
  2. 2 Today’s Speaker Dipti is a Cofounder, CPO & Chief

    Evangelist of Ahana with over 15 years experience in distributed data and database technology including relational, NoSQL and federated systems. She is also the Presto Foundation Outreach Chairperson. Prior to Ahana, Dipti held VP roles at Alluxio, Kinetica and Couchbase. At Alluxio, she was Vice President of Products and at Couchbase she held several leadership positions there including VP, Product Marketing, Head of Global Technical Sales and Head of Product Management. Earlier in her career Dipti managed development teams at IBM DB2 Distributed where she started her career as a database software engineer. Dipti holds a M.S. in Computer Science from UC San Diego, and an MBA from the Haas School of Business at UC Berkeley. Dipti Borkar 
 Cofounder, Chief Product Officer and Chief Evangelist 
 Ahana
  3. 3 The Traditional 
 Data Warehouse • Relational Database •

    Columnar Structure • In-Database Analytics • Structured Data • Modeled Data • Extract, Transform, Load • SQL Access Challenges • Expensive • Difficult to Manage • Costly to Maintain • Limited Data • Limited Access 3
  4. 4 The Drivers Behind Modernization Digital Transformation Real Time Events

    Modern Processing Techniques More Data Fast Data Smart Data The Deconstructed Database
  5. 5 Why Open Data Lake Analytics? Enterprise Data Beyond Enterprise

    Data 
 IoT, Third-party, Telemetry, Event 1000X More Data 
 Terabytes to Petabytes Open & 
 Flexible 
 Open Source, Open Formats Reporting & 
 Dashboarding 
 
 
 
 Data 
 Science 
 
 
 
 In-data lake 
 transformation 
 
 
 
 Reporting & 
 Dashboarding 
 
 
 
 Data Warehouse Open Data Lakes
  6. 6 The Traditional Data Lake • File System Data Store

    / Object Store • Structured / Semi-Structured Data • Ingestion • Discovery • Data Science • Notebook and Python Access • Less expensive, but… • Good enough performance • Supports ~70% of DW workloads • Different approach to governance 6
  7. 7 
 
 
 
 Data SQL Query Processing Data

    Warehouse Cloud Data Lake 
 Data Processing 1-10 TB 1TB -> PB The Next Data Warehouse is Open Data Lake Analytics Reporting & 
 Dashboarding 
 
 
 
 Data 
 Science 
 
 
 
 In-data lake 
 transformation 
 
 
 
 Open Data Lake Analytics Reporting & Dashboarding 
 
 
 

  8. 8 Data Warehouse 
 
 
 
 Operational 
 Data

    Stores Third Party Data Machine Learning Semi- | unstructured Data Virtualization / Federated Access Streaming & IoT Data SQL Query Processing SQL Query Processing The Data Platform ETL 
 ELT Data 
 Engg Storage Compute 1-10 TB Query & Processing Storage 
 Compute SQL Structured Workloads 1TB -> PB Data Lake Reporting Dashboards Visualizations Notebooks Custom Apps
  9. 9 Cloud data lake driving open source SQL query engines

    Presto is the De-Facto SQL Engine for Data Lakes https://db-engines.com/en/ranking_trend/relational+dbms
  10. 10 Similarities with Modern Data Warehouse & 
 The Modern

    Data Lake • Cloud-First • In-Memory Capabilities • Complex Data Types • Separate Storage & Compute • Expanded Analytics • Improved Performance • Storage Options • SQL Access • Cloud-First • In-Memory Capabilities • Columnar Data Types • Separate Storage & Compute • Expanded Analytics • Improved Performance • Storage Options • SQL Access
  11. Merging the Data Warehouse and the Data Lake with a

    Distributed Query Engine 11 1. SQL Access 2. Data Lake and Data Warehouse Access 3. Unified Analytics 4. Distributed Queries 5. Limitless Scale 6. Complex Data Types • Leverage Resources • Better Insight • More Use Cases • Leverage Platforms • Remove Limits • Amplified Insight
  12. 13 Emerging 
 use cases Use Cases Data Lakehouse 


    analytics Reporting & dashboarding Interactive querying 
 use cases Transformation 
 using SQL (ETL) Federated access 
 across data sources SQL 
 Data Science Customer-facing 
 app analytics
  13. © 2021 Enterprise Management Associates, Inc. Considerations for Open Analytics

    Decision 15 | @ema_research Data Analytics Users Platform Cloud Enterprise Business Cost
  14. © 2021 Enterprise Management Associates, Inc. Considerations for Any Unified

    Analytics Decision Data Structured Semi- Structured Real Time Structured Complex Data Types Textual Streaming 16 | @ema_research
  15. © 2021 Enterprise Management Associates, Inc. Considerations for Any Unified

    Analytics Decision Data Analytics Users Platform SQL Python Notebook Search 17 | @ema_research
  16. © 2021 Enterprise Management Associates, Inc. Considerations for Any Unified

    Analytics Decision Data Analytics Users Platform Engineer Analyst Scientist Business 18 | @ema_research
  17. © 2021 Enterprise Management Associates, Inc. Considerations for Any Unified

    Analytics Decision Elasticity Scale Mobility Globality Cloud Enterprise Business Cost 20 | @ema_research
  18. © 2021 Enterprise Management Associates, Inc. Considerations for Any Unified

    Analytics Decision Security Privacy Governance Unification Cloud Enterprise Business Cost 21 | @ema_research
  19. © 2021 Enterprise Management Associates, Inc. Considerations for Any Unified

    Analytics Decision Semantics Logic Value Optimization Cloud Enterprise Business Cost 22 | @ema_research
  20. © 2021 Enterprise Management Associates, Inc. Considerations for Any Unified

    Analytics Decision Forecast Containment Chargeback Scale Cloud Enterprise Business Cost 23 | @ema_research
  21. 24 Challenges with SQL on Open Data Lakes Cloud DW

    / AWS Serverless options get very expensive for growing data volumes ▪ Cloud data warehouse costs grow much faster than compute engine costs ▪ Serverless options like AWS Athena charge /query and get expensive “Do it yourself” approach 
 is complicated ▪ Big data skills in platform teams are limited 
 ▪ Presto is complicated and operationally very time consuming Presto on AWS like AWS Athena has limited capabilities and doesn’t scale ▪ Limited concurrency of 20 per account ▪ No visibility into cluster logs, query logs, no flexibility / control on scale
  22. 26 Open Source Presto Overview • Distributed SQL query engine

    • Created at • ANSI SQL on Databases, Data lakes • Designed to be interactive & access petabytes of data • Open source, hosted at 
 https://github.com/prestodb
  23. 29 How Ahana Cloud works? ~ 30 mins to create

    the compute plane https://app.ahana.cloud/signup Create Presto Clusters in your account
  24. 30 Ahana Cloud for Presto Ahana Console (Control Plane) CLUSTER

    ORCHESTRATION CONSOLIDATED LOGGING SECURITY & ACCESS BILLING & SUPPORT In-VPC Presto Clusters (Compute Plane) AD HOC CLUSTER 1 TEST CLUSTER 2 PROD CLUSTER N Glue S3 RDS Elasticsearch Ahana Cloud Account Ahana console oversees and manages every Presto cluster Customer Cloud Account In-VPC orchestration of Presto clusters, where metadata, monitoring, and data sources reside
  25. 31 Ahana Cloud Overview 1. Ahana Managed Service Console 2.

    Add data sources 3. Query data where it lives with Federated Connectors (in place) 4. Cluster management
  26. 32 Case study: Securonix NextGen SIEM Cluster AWS S3 Data

    Lake Glue Metastore ▪ Securonix is a Security information and event management software ▪ They use Ahana for in-app SQL analytics on data from AWS S3 for threat hunting ▪ They pull in billions of events per day that get stored in S3 ▪ With Ahana Cloud, they saw 3x better price performance compared with Presto on AWS
  27. 33 Ahana Cloud for Presto - Summary ▪ Brings SQL

    on AWS S3 with an open data lake + USER ▪ Presto compute brought to your data in your VPC in your account ▪ Fully managed Presto cluster life cycle including idle-time management ▪ Query AWS DBs - RDS/MySQL , RDS/Postgres, Elasticsearch, Redshift, Elasticsearch ▪ Cloud-native and highly available running on Kubernetes ▪ Bring your own ▪ BI tool / Data Science Notebook ▪ Metadata Catalog ▪ Transaction Manager Easy to use 3x Price Performance Open & Flexible