Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Warehouse or Data Lake, which one do I choose?

Ahana
May 20, 2022

Data Warehouse or Data Lake, which one do I choose?

May 2022 - Today’s data-driven companies have a choice to make – where do we store our data? As the move to the cloud continues to be a driving factor, the choice becomes either the data warehouse (Snowflake et al) or the data lake (AWS S3 et al). There are pro’s and con’s for each approach. While the data warehouse will give you strong data management with analytics, they don’t do well with semi-structured and unstructured data with tightly coupled storage and compute, not to mention expensive vendor lock-in. On the other hand, data lakes allow you to store all kinds of data and are extremely affordable, but they’re only meant for storage and by themselves provide no direct value to an organization.

Enter the Open Data Lakehouse, the next evolution of the data stack that gives you the openness and flexibility of the data lake with the key aspects of the data warehouse like management and transaction support.

In this webinar, you’ll hear from Ali LeClerc who will discuss the data landscape and why many companies are moving to an open data lakehouse. Ali will share more perspective on how you should think about what fits best based on your use case and workloads, and how some real world customers are using Presto, a SQL query engine, to bring analytics to the data lakehouse.

Ahana

May 20, 2022
Tweet

More Decks by Ahana

Other Decks in Technology

Transcript

  1. Data Warehouse or Data Lake, which do I choose? Ali

    LeClerc Head of Community, Ahana Chairperson, Community Team Presto Foundation
  2. 2 Today’s Agenda Quick introduction - Data Warehouses, Data Lakes

    Enter the Data Lakehouse Presto for the Data Lakehouse Real world use cases Q&A
  3. 3 The Traditional Data Warehouse • Relational Database • Columnar

    Structure • In-Database Analytics • Structured Data • Modeled Data • Extract, Transform, Load • SQL Access Challenges • Expensive • Difficult to Manage • Costly to Maintain • Limited Data • Limited Access 3
  4. 4 The Data Lake • File System Data Store •

    Semi-Structured Data • Ingestion • Discovery • Data Science • Notebook and Python Access Challenges • File System Data Store • Semi-Structured Data • Ingestion • Discovery • Data Science • Notebook and Python Access 4
  5. 5 The Drivers Behind Modernization Digital Transformation Real Time Events

    Modern Processing Techniques More Data Fast Data Smart Data The Deconstructed Database
  6. 6 Data Warehouse vs. Data Lake Data Warehouse • Cloud-First

    • In-Memory Capabilities • Complex Data Types • Storage & Compute still loosely coupled • High Performance • SQL Access • Expensive Data Lake • Cloud-First • In-Memory Capabilities • Open Formats • Columnar Data Types • Separate Storage & Compute • Expanded Analytics • Improved Performance • SQL Access • Cheaper
  7. 8 Merging the Data Warehouse and the Data Lake with

    a Distributed Query Engine 1. SQL Access 2. Data Lake and Data Warehouse Access 3. Unified Analytics 4. Distributed Queries 5. Limitless Scale 6. Complex Data Types • Leverage Resources • Better Insight • More Use Cases • Leverage Platforms • Remove Limits • Amplified Insight
  8. Open Data Lakehouse The Next EDW is the Open Data

    Lakehouse Data Science, ML, & AI Reporting and Dashboarding Data Warehouse Proprietary Storage Proprietary SQL Query Processing ML and AI Frameworks SQL Query Processing Cloud Data Lake Open Formats Storage Governance, Discovery, Quality & Security Reporting and Dashboarding
  9. Considerations for DW/DL © 2021 Enterprise Management Associates, Inc. 10

    | @ema_research Data Analytics Users Platform Cloud Enterprise Business Cost
  10. Data Structure d Semi- Structure d Real Time Structured Complex

    Data Types Textual Streaming © 2021 Enterprise Management Associates, Inc. 11 | @ema_research Considerations for DW/DL
  11. Data Analytics Users Platform SQL Python Notebook Search © 2021

    Enterprise Management Associates, Inc. 12 | @ema_research Considerations for DW/DL
  12. Data Analytics Users Platform Engineer Analyst Scientist Business © 2021

    Enterprise Management Associates, Inc. 13 | @ema_research Considerations for DW/DL
  13. Elasticity Scale Mobility Globality Cloud Enterprise Business Cost © 2021

    Enterprise Management Associates, Inc. 15 | @ema_research Considerations for DW/DL
  14. Security Privacy Governance Unification Cloud Enterprise Business Cost © 2021

    Enterprise Management Associates, Inc. 16 | @ema_research Considerations for DW/DL
  15. Semantics Logic Value Optimization Cloud Enterprise Business Cost © 2021

    Enterprise Management Associates, Inc. 17 | @ema_research Considerations for DW/DL
  16. Forecast Containment Chargeback Scale Cloud Enterpris e Business Cost ©

    2021 Enterprise Management Associates, Inc. 18 | @ema_research Considerations for DW/DL
  17. 20 Open Source Presto Overview SQL Query Processing What is

    Presto? • Open source, distributed SQL query engine for the data lake & lakehouse • Designed from ground up for fast analytic queries against data of any size • Query in place - no need to move (ETL) data • Federated querying - join data from different source formats
  18. 21 Presto Use Cases Data Lakehouse analytics Reporting & dashboarding

    Interactive ad hoc querying Transformation using SQL (ETL) Federated querying across data sources 21
  19. 22 Ahana Cloud For 1. Zero to Presto in 30

    Minutes. Managed cloud service: No installation and configuration. 2. Built for data teams of all experience level. 3. Moderate level of control of deployment without complexity. 4. Dedicated support from Presto experts.
  20. 24 Blinkit • India’s instant delivery service • Moved from

    the Data Warehouse to the Open Data Lakehouse powered by Presto & Ahana to power 200K orders/day • “Everything delivered in 10 minutes” “Ahana is providing Blinkit with a SaaS managed service for Presto, providing the company with the advanced data management capabilities it needs to meet its instant delivery promise.” Satyam Krishna, Engineering Manager at Blinkit
  21. 25 Securonix NextGen SIEM Cluster AWS S3 Data Lake Glue

    Metastore ▪ Securonix is a Security information and event management software ▪ They use Ahana for in-app SQL analytics on data from AWS S3 for threat hunting ▪ They pull in billions of events per day that get stored in S3 ▪ With Ahana Cloud, they saw 3x better price performance compared with Presto on AWS