Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Build an Open Data Lake Analytics Stack

Ahana
October 12, 2021
4.7k

How to Build an Open Data Lake Analytics Stack

Ahana

October 12, 2021
Tweet

Transcript

  1. How to Build an
    Open Data Lake
    Analytics Stack
    Dipti Borkar
    Cofounder, Chief Product Officer &
    Chief Evangelist
    Chairperson |Community Team
    Presto Foundation

    View Slide

  2. 2
    Today’s Speaker
    Dipti is a Cofounder, CPO & Chief Evangelist of Ahana with over 15 years experience
    in distributed data and database technology including relational, NoSQL and
    federated systems. She is also the Presto Foundation Outreach Chairperson. Prior to
    Ahana, Dipti held executive roles at Alluxio, Kinetica and Couchbase. At Alluxio, she
    was Vice President of Products and at Couchbase she held several leadership
    positions there including VP, Product Marketing, Head of Global Technical Sales and
    Head of Product Management. Earlier in her career Dipti managed development
    teams at IBM DB2 Distributed where she started her career as a database software
    engineer. Dipti holds a M.S. in Computer Science from UC San Diego, and an MBA
    from the Haas School of Business at UC Berkeley.
    Dipti Borkar
    Cofounder, Chief Product
    Officer and Chief Evangelist
    Ahana

    View Slide

  3. 3
    Why Open Data Lake Analytics?
    Enterprise Data
    Beyond Enterprise
    Data
    IoT, Third-party,
    Telemetry, Event
    1000X
    More Data
    Terabytes to
    Petabytes
    Open &
    Flexible
    Open Source,
    Open Formats
    Reporting &
    Dashboarding
    Data
    Science
    In-data lake
    transformation
    Reporting &
    Dashboarding
    Data Warehouse
    Open Data Lakes

    View Slide

  4. 4
    What is Open Data Lake Analytics

    View Slide

  5. 5
    The Traditional
    Data Warehouse
    • Relational Database
    • Columnar Structure
    • In-Database Analytics
    • Structured Data
    • Modeled Data
    • Extract, Transform, Load
    • SQL Access
    Challenges
    • Expensive
    • Difficult to Manage
    • Costly to Maintain
    • Limited Data
    • Limited Access
    5

    View Slide

  6. 6
    The Drivers Behind Modernization
    Digital
    Transformation
    Real Time
    Events
    Modern
    Processing
    Techniques
    More Data
    Fast Data
    Smart Data
    The Deconstructed Database

    View Slide

  7. 7
    The Traditional Data Lake
    • File System Data Store / Object Store
    • Structured / Semi-Structured Data
    • Ingestion
    • Discovery
    • Data Science
    • Notebook and Python Access
    • Less expensive, but…
    • Good enough performance
    • Supports ~70% of DW workloads
    • Different approach to governance
    7

    View Slide

  8. 8
    Data
    SQL Query Processing
    Data Warehouse
    Cloud Data Lake
    Data Processing
    1-10 TB
    1TB -> PB
    Open Data Lake Analytics
    Reporting &
    Dashboarding
    Data
    Science
    In-data lake
    transformation
    Open Data Lake Analytics
    Reporting & Dashboarding

    View Slide

  9. 9
    Data Warehouse
    Operational
    Data Stores
    Third Party
    Data
    Machine Learning
    Semi- | unstructured
    Data Virtualization /
    Federated Access
    Streaming &
    IoT Data
    SQL Query Processing
    SQL Query
    Processing
    The Data Platform
    ETL
    ELT
    Data
    Engg
    Storage
    Compute
    1-10 TB
    Query & Processing
    Storage Compute
    SQL
    Structured Workloads
    1TB -> PB
    Data Lake
    Reporting
    Dashboards
    Visualizations
    Notebooks
    Custom Apps

    View Slide

  10. 10
    Cloud data lake driving open source SQL query engines
    Presto is the De-Facto SQL Engine for Data Lakes
    https://db-engines.com/en/ranking_trend/relational+dbms

    View Slide

  11. Merging the Data Warehouse and the Data Lake with a Distributed Query Engine
    11
    1. SQL Access
    2. Data Lake and Data Warehouse Access
    3. Unified Analytics
    4. Distributed Queries
    5. Limitless Scale
    6. Complex Data Types
    • Leverage Resources
    • Better Insight
    • More Use Cases
    • Leverage Platforms
    • Remove Limits
    • Amplified Insight

    View Slide

  12. Data Lake Use Cases

    View Slide

  13. 13
    Emerging
    use cases
    Use Cases
    Data Lakehouse
    analytics
    Reporting &
    dashboarding
    Interactive
    querying
    use cases
    Transformation
    using SQL (ETL)
    Federated access
    across data sources
    SQL
    Data Science
    Customer-facing
    app analytics

    View Slide

  14. 14
    The Data LakeHouse Components
    Dashboad /
    notebooks
    Compute /
    Query
    engine
    Operational
    catalog
    Table Format
    / Transaction
    manager
    Storage

    View Slide

  15. 15
    The Data LakeHouse Stack

    View Slide

  16. Considerations for Open Analytics Decision
    © 2021 Enterprise Management Associates, Inc. 16
    | @ema_research
    Data Analytics Users Platform
    Cloud Enterprise Business Cost

    View Slide

  17. Considerations for Any Unified Analytics Decision
    Data Structured
    Semi-Struc
    tured
    Real Time
    Structured
    Complex
    Data Types
    Textual Streaming
    © 2021 Enterprise Management Associates, Inc. 17
    | @ema_research

    View Slide

  18. Considerations for Any Unified Analytics Decision
    Data Analytics Users Platform
    SQL Python Notebook Search
    © 2021 Enterprise Management Associates, Inc. 18
    | @ema_research

    View Slide

  19. Considerations for Any Unified Analytics Decision
    Data Analytics Users Platform
    Engineer Analyst Scientist Business
    © 2021 Enterprise Management Associates, Inc. 19
    | @ema_research

    View Slide

  20. Considerations for Any Unified Analytics Decision
    Data Analytics Users Platform
    Cloud Enterprise Business Cost

    View Slide

  21. Considerations for Any Unified Analytics Decision
    Elasticity Scale Mobility Globality
    Cloud Enterprise Business Cost
    © 2021 Enterprise Management Associates, Inc. 21
    | @ema_research

    View Slide

  22. Considerations for Any Unified Analytics Decision
    Security Privacy Governance Unification
    Cloud Enterprise Business Cost
    © 2021 Enterprise Management Associates, Inc. 22
    | @ema_research

    View Slide

  23. Considerations for Any Unified Analytics Decision
    Semantics Logic Value Optimization
    Cloud Enterprise Business Cost
    © 2021 Enterprise Management Associates, Inc. 23
    | @ema_research

    View Slide

  24. Considerations for Any Unified Analytics Decision
    Forecast Containment Chargeback Scale
    Cloud Enterprise Business Cost
    © 2021 Enterprise Management Associates, Inc. 24
    | @ema_research

    View Slide

  25. 25
    Challenges with SQL on Open Data Lakes
    Cloud DW / AWS Serverless
    options get very expensive for
    growing data volumes
    ▪ Cloud data warehouse
    costs grow much faster
    than compute engine costs
    ▪ Serverless options like
    AWS Athena charge /query
    and get expensive
    “Do it yourself” approach
    is complicated
    ▪ Big data skills in platform
    teams are limited
    ▪ Presto is complicated and
    operationally very time
    consuming
    Presto on AWS like AWS
    Athena has limited capabilities
    and doesn’t scale
    ▪ Limited concurrency of 20
    per account
    ▪ No visibility into cluster
    logs, query logs, no
    flexibility / control on scale

    View Slide

  26. Presto & Presto Community

    View Slide

  27. 27
    Open Source Presto Overview
    • Distributed SQL query engine
    • Created at
    • ANSI SQL on Databases, Data lakes
    • Designed to be interactive & access
    petabytes of data
    • Open source, hosted at
    https://github.com/prestodb

    View Slide

  28. 28
    Presto
    Users

    View Slide

  29. Ahana Overview

    View Slide

  30. 30
    How Ahana Cloud works?
    ~ 30 mins to create the compute plane
    https://app.ahana.cloud/signup Create Presto Clusters in your account

    View Slide

  31. 31
    Ahana Cloud for Presto
    Ahana Console (Control Plane)
    CLUSTER
    ORCHESTRATION
    CONSOLIDATED
    LOGGING
    SECURITY &
    ACCESS
    BILLING &
    SUPPORT
    In-VPC Presto Clusters (Compute Plane)
    AD HOC CLUSTER 1
    TEST CLUSTER 2
    PROD CLUSTER N
    Glue
    S3
    RDS
    Elasticsearch
    Ahana
    Cloud Account
    Ahana console
    oversees and
    manages every
    Presto cluster
    Customer
    Cloud Account
    In-VPC orchestration of
    Presto clusters, where
    metadata, monitoring,
    and data sources
    reside

    View Slide

  32. 32
    Ahana Cloud Overview
    1. Ahana Managed Service
    Console
    2. Add data sources
    3. Query data where it lives with
    Federated Connectors (in place)
    4. Cluster management

    View Slide

  33. 33
    Case study: Securonix
    NextGen SIEM
    Cluster
    AWS S3 Data
    Lake
    Glue
    Metastore
    ▪ Securonix is a Security information and
    event management software
    ▪ They use Ahana for in-app SQL
    analytics on data from AWS S3 for
    threat hunting
    ▪ They pull in billions of events per day
    that get stored in S3
    ▪ With Ahana Cloud, they saw 3x better
    price performance compared with
    Presto on AWS

    View Slide

  34. 34
    Ahana Cloud for Presto - Summary
    ▪ Brings SQL on AWS S3 with an open data lake
    +
    USER
    ▪ Presto compute brought to your data in your
    VPC in your account
    ▪ Fully managed Presto cluster life cycle
    including idle-time management
    ▪ Query AWS DBs - RDS/MySQL , RDS/Postgres,
    Elasticsearch, Redshift, Elasticsearch
    ▪ Cloud-native and highly available running on
    Kubernetes
    ▪ Bring your own
    ▪ BI tool / Data Science Notebook
    ▪ Metadata Catalog
    ▪ Transaction Manager
    Easy to use
    3
    x
    Price Performance
    Open & Flexible

    View Slide

  35. 12/17/20

    View Slide