Upgrade to Pro — share decks privately, control downloads, hide ads and more …

UNLOCKING THE VALUE OF YOUR DATA LAKE

Ahana
October 06, 2021

UNLOCKING THE VALUE OF YOUR DATA LAKE

Ahana

October 06, 2021
Tweet

More Decks by Ahana

Other Decks in Business

Transcript

  1. Unlocking the Value of
    Your Data Lake
    Dipti Borkar


    Cofounder, Chief Product Officer &


    Chief Evangelist


    Chairperson |Community Team


    Presto Foundation

    View Slide

  2. 2
    Today’s Speaker
    Dipti is a Cofounder, CPO & Chief Evangelist of Ahana with over 15 years experience
    in distributed data and database technology including relational, NoSQL and
    federated systems. She is also the Presto Foundation Outreach Chairperson. Prior to
    Ahana, Dipti held VP roles at Alluxio, Kinetica and Couchbase. At Alluxio, she was
    Vice President of Products and at Couchbase she held several leadership positions
    there including VP, Product Marketing, Head of Global Technical Sales and Head of
    Product Management. Earlier in her career Dipti managed development teams at
    IBM DB2 Distributed where she started her career as a database software engineer.
    Dipti holds a M.S. in Computer Science from UC San Diego, and an MBA from the
    Haas School of Business at UC Berkeley.
    Dipti Borkar

    Cofounder, Chief Product
    Officer and Chief Evangelist

    Ahana

    View Slide

  3. 3
    The Traditional

    Data Warehouse
    • Relational Database


    • Columnar Structure


    • In-Database Analytics


    • Structured Data


    • Modeled Data


    • Extract, Transform, Load


    • SQL Access


    Challenges


    • Expensive


    • Difficult to Manage


    • Costly to Maintain


    • Limited Data


    • Limited Access
    3

    View Slide

  4. 4
    The Drivers Behind Modernization
    Digital
    Transformation
    Real Time
    Events
    Modern
    Processing
    Techniques
    More Data
    Fast Data
    Smart Data
    The Deconstructed Database

    View Slide

  5. 5
    Why Open Data Lake Analytics?
    Enterprise Data
    Beyond Enterprise
    Data

    IoT, Third-party,
    Telemetry, Event
    1000X


    More Data

    Terabytes to
    Petabytes
    Open &

    Flexible

    Open Source,
    Open Formats
    Reporting &

    Dashboarding




    Data

    Science




    In-data lake

    transformation




    Reporting &

    Dashboarding




    Data Warehouse
    Open Data Lakes

    View Slide

  6. 6
    The Traditional Data Lake
    • File System Data Store / Object Store


    • Structured / Semi-Structured Data


    • Ingestion


    • Discovery


    • Data Science


    • Notebook and Python Access


    • Less expensive, but…


    • Good enough performance


    • Supports ~70% of DW workloads


    • Different approach to governance
    6

    View Slide

  7. 7




    Data
    SQL Query Processing
    Data Warehouse Cloud Data Lake

    Data Processing
    1-10 TB
    1TB -> PB
    The Next Data Warehouse is Open Data Lake Analytics
    Reporting &

    Dashboarding




    Data

    Science




    In-data lake

    transformation




    Open Data Lake Analytics
    Reporting & Dashboarding




    View Slide

  8. 8
    Data Warehouse




    Operational

    Data Stores
    Third Party


    Data
    Machine Learning


    Semi- | unstructured


    Data Virtualization /
    Federated Access
    Streaming &


    IoT Data
    SQL Query Processing
    SQL Query
    Processing
    The Data Platform
    ETL



    ELT


    Data

    Engg
    Storage


    Compute
    1-10 TB
    Query & Processing
    Storage

    Compute
    SQL


    Structured Workloads
    1TB -> PB
    Data Lake
    Reporting


    Dashboards


    Visualizations


    Notebooks


    Custom Apps

    View Slide

  9. 9
    Cloud data lake driving open source SQL query engines
    Presto is the De-Facto SQL Engine for Data Lakes
    https://db-engines.com/en/ranking_trend/relational+dbms

    View Slide

  10. 10
    Similarities with Modern Data Warehouse &

    The Modern Data Lake
    • Cloud-First


    • In-Memory Capabilities


    • Complex Data Types


    • Separate Storage & Compute


    • Expanded Analytics


    • Improved Performance


    • Storage Options


    • SQL Access
    • Cloud-First


    • In-Memory Capabilities


    • Columnar Data Types


    • Separate Storage & Compute


    • Expanded Analytics


    • Improved Performance


    • Storage Options


    • SQL Access

    View Slide

  11. Merging the Data Warehouse and the Data Lake with a Distributed Query Engine
    11
    1. SQL Access


    2. Data Lake and Data Warehouse Access


    3. Unified Analytics


    4. Distributed Queries


    5. Limitless Scale


    6. Complex Data Types
    • Leverage Resources


    • Better Insight


    • More Use Cases


    • Leverage Platforms


    • Remove Limits


    • Amplified Insight

    View Slide

  12. Use Cases

    View Slide

  13. 13
    Emerging

    use cases
    Use Cases
    Data Lakehouse

    analytics
    Reporting &


    dashboarding
    Interactive
    querying

    use cases
    Transformation

    using SQL (ETL)
    Federated access

    across data sources
    SQL

    Data Science
    Customer-facing

    app analytics

    View Slide

  14. 14
    Data LakeHouse

    View Slide

  15. © 2021 Enterprise Management Associates, Inc.
    Considerations for Open Analytics Decision
    15
    | @ema_research
    Data Analytics Users Platform
    Cloud Enterprise Business Cost

    View Slide

  16. © 2021 Enterprise Management Associates, Inc.
    Considerations for Any Unified Analytics Decision
    Data Structured
    Semi-
    Structured
    Real Time
    Structured
    Complex
    Data Types
    Textual Streaming
    16
    | @ema_research

    View Slide

  17. © 2021 Enterprise Management Associates, Inc.
    Considerations for Any Unified Analytics Decision
    Data Analytics Users Platform
    SQL Python Notebook Search
    17
    | @ema_research

    View Slide

  18. © 2021 Enterprise Management Associates, Inc.
    Considerations for Any Unified Analytics Decision
    Data Analytics Users Platform
    Engineer Analyst Scientist Business
    18
    | @ema_research

    View Slide

  19. Considerations for Any Unified Analytics Decision
    Data Analytics Users Platform
    Cloud Enterprise Business Cost

    View Slide

  20. © 2021 Enterprise Management Associates, Inc.
    Considerations for Any Unified Analytics Decision
    Elasticity Scale Mobility Globality
    Cloud Enterprise Business Cost
    20
    | @ema_research

    View Slide

  21. © 2021 Enterprise Management Associates, Inc.
    Considerations for Any Unified Analytics Decision
    Security Privacy Governance Unification
    Cloud Enterprise Business Cost
    21
    | @ema_research

    View Slide

  22. © 2021 Enterprise Management Associates, Inc.
    Considerations for Any Unified Analytics Decision
    Semantics Logic Value Optimization
    Cloud Enterprise Business Cost
    22
    | @ema_research

    View Slide

  23. © 2021 Enterprise Management Associates, Inc.
    Considerations for Any Unified Analytics Decision
    Forecast Containment Chargeback Scale
    Cloud Enterprise Business Cost
    23
    | @ema_research

    View Slide

  24. 24
    Challenges with SQL on Open Data Lakes
    Cloud DW / AWS Serverless
    options get very expensive for
    growing data volumes
    ▪ Cloud data warehouse
    costs grow much faster
    than compute engine costs


    ▪ Serverless options like
    AWS Athena charge /query
    and get expensive
    “Do it yourself” approach

    is complicated
    ▪ Big data skills in platform
    teams are limited

    ▪ Presto is complicated and
    operationally very time
    consuming


    Presto on AWS like AWS
    Athena has limited capabilities
    and doesn’t scale
    ▪ Limited concurrency of 20
    per account


    ▪ No visibility into cluster
    logs, query logs, no
    flexibility / control on
    scale

    View Slide

  25. Presto & Presto Community

    View Slide

  26. 26
    Open Source Presto Overview
    • Distributed SQL query engine


    • Created at


    • ANSI SQL on Databases, Data lakes


    • Designed to be interactive & access
    petabytes of data


    • Open source, hosted at

    https://github.com/prestodb

    View Slide

  27. 27
    Presto
    Users

    View Slide

  28. Ahana Overview

    View Slide

  29. 29
    How Ahana Cloud works?
    ~ 30 mins to create the compute plane
    https://app.ahana.cloud/signup Create Presto Clusters in your account

    View Slide

  30. 30
    Ahana Cloud for Presto
    Ahana Console (Control Plane)
    CLUSTER
    ORCHESTRATION
    CONSOLIDATED
    LOGGING
    SECURITY &
    ACCESS
    BILLING &
    SUPPORT
    In-VPC Presto Clusters (Compute Plane)
    AD HOC CLUSTER 1
    TEST CLUSTER 2
    PROD CLUSTER N
    Glue
    S3
    RDS
    Elasticsearch
    Ahana
    Cloud Account


    Ahana console
    oversees and
    manages every
    Presto cluster
    Customer
    Cloud Account


    In-VPC orchestration of
    Presto clusters, where
    metadata, monitoring,
    and data sources
    reside

    View Slide

  31. 31
    Ahana Cloud Overview
    1. Ahana Managed Service
    Console


    2. Add data sources


    3. Query data where it lives with
    Federated Connectors (in place)


    4. Cluster management

    View Slide

  32. 32
    Case study: Securonix
    NextGen SIEM
    Cluster
    AWS S3 Data
    Lake
    Glue
    Metastore
    ▪ Securonix is a Security information and
    event management software


    ▪ They use Ahana for in-app SQL
    analytics on data from AWS S3 for
    threat hunting


    ▪ They pull in billions of events per day
    that get stored in S3


    ▪ With Ahana Cloud, they saw 3x better
    price performance compared with
    Presto on AWS

    View Slide

  33. 33
    Ahana Cloud for Presto - Summary
    ▪ Brings SQL on AWS S3 with an open data lake
    +
    USER
    ▪ Presto compute brought to your data in your
    VPC in your account
    ▪ Fully managed Presto cluster life cycle
    including idle-time management
    ▪ Query AWS DBs - RDS/MySQL , RDS/Postgres,
    Elasticsearch, Redshift, Elasticsearch
    ▪ Cloud-native and highly available running on
    Kubernetes
    ▪ Bring your own


    ▪ BI tool / Data Science Notebook


    ▪ Metadata Catalog


    ▪ Transaction Manager
    Easy to use
    3x Price Performance
    Open & Flexible

    View Slide