Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Build & Query Secure S3 Data Lakes with Presto and Apache Ranger

Ahana
March 02, 2023

Build & Query Secure S3 Data Lakes with Presto and Apache Ranger

Looking at Presto and Apache Ranger to build a data lakehouse.

Ahana

March 02, 2023
Tweet

More Decks by Ahana

Other Decks in Technology

Transcript

  1. Feb 28 2023
    HANDS-ON
    VIRTUAL LAB
    Build & Query Secure S3
    Data Lakes with Presto
    and Apache Ranger.
    1

    View Slide

  2. WELCOME TO THE AHANA
    HANDS-ON VIRTUAL LAB
    Your Lab Guide:
    Rohan Pednekar

    View Slide

  3. Over the next 90 minutes you will:
    ● Explore and understand Presto
    ● Get a walk-through of creating a new cluster using Ahana
    ● Query data sitting in S3 using Presto
    ● Convert data to different Formats Using Presto
    ● Get a walk-through of enabling Apache Ranger for Presto
    ● Run SQL query to explore Ranger policies
    ● Run federated queries/joins across multiple sources combining data in S3
    and RDS/MYSQL
    Objective for Today
    3

    View Slide

  4. Agenda
    1) Understand the Technology (40 mins)
    a) What is Presto?
    b) What is Ahana Cloud?
    c) Apache Ranger Integration
    2) Getting your hands dirty ( 40 mins)
    3) Summary and Close Out (10 mins)
    4

    View Slide

  5. Understanding The
    Technology
    Presto

    View Slide

  6. What is Presto?
    • Open source, distributed MPP SQL query engine
    • Query in Place
    • Federated Querying
    • ANSI SQL Compliant
    • Designed ground up for fast analytic queries against data of any size
    • Originally developed at Facebook
    • Proven on petabytes of data
    • SQL-On-Anything
    • Federated pluggable architecture to support many connector
    • Opensource, hosted under Linux Foundation
    • https://github.com/prestodb
    6

    View Slide

  7. Presto Overview
    7
    Presto
    Cluster
    Coordinator Worker Worker Worker Worker

    View Slide

  8. Presto – It’s Exploding
    Presto is De-Facto SQL Engine
    https://db-engines.com/en/ranking_trend/relational+dbms
    Spark SQL vs. Presto
    “As one of our earliest
    members, Ahana has been
    strong supporters of the
    Presto Foundation since its
    launch in 2019.”

    View Slide

  9. Presto
    Users

    View Slide

  10. Presto Use Cases
    Data
    Lakehouse
    analytics
    Reporting &
    dashboarding
    Interactive
    ad hoc
    querying
    Transformation
    using SQL (ETL)
    Federated
    querying
    across data
    sources
    10

    View Slide

  11. Presto Architecture

    View Slide

  12. What makes Presto different?
    Scalable
    Architecture
    Pluggable
    Connectors
    Performance
    12

    View Slide

  13. Scalable Architecture
    • Two roles - coordinator
    and worker
    • Easy scale up and
    scale down
    • Scale up to 1000 workers
    • Validated at web scale
    companies
    New Worker
    New Worker
    Worker
    Worker
    Worker
    Coordinator
    Data
    Source
    Presto Cluster
    13
    Coordinator

    View Slide

  14. Scalable Architecture
    Parser/analyzer
    Worker
    Worker
    Worker
    Metadata API
    Planner Scheduler
    Data Location API
    Data Shuffle
    Data Shuffle
    Presto
    Connector
    Presto Coordinator
    BI Tools/Notebooks/Clients
    Presto CLI
    Looker
    JDBC
    Superset
    ...
    Tableau
    Jupyter
    Result
    Sets
    SQL
    Any
    Database,
    Data Stream,
    or Storage
    HDFS
    Object Stores (S3)
    MySQL
    ElasticSearch
    Kafka
    ...
    Presto
    Connector
    Presto
    Connector
    14

    View Slide

  15. Pluggable Presto Connectors
    15
    https://prestodb.io/docs/current/connector.html

    View Slide

  16. Presto Connector Data Model
    • Connector: Driver for a data source.
    • Example: HDFS, AWS S3, Cassandra, MySQL, SQL Server, Kafka
    • Catalog: Contains schemas from a data source
    specified by the connector
    • Schemas: Namespace to organize tables.
    • Tables: Set of unordered rows organized into columns
    with types.
    16

    View Slide

  17. Presto Hive Connector – Data File Types
    • Supported File Types
    • ORC
    • Parquet
    • Avro
    • RCFile
    • CSV
    • No data ingestion/duplication/movement needed
    • Query data in-place
    • SequenceFile
    • JSON
    • Text
    17

    View Slide

  18. Introducing Ahana Cloud
    Fully-Managed Presto Service

    View Slide

  19. Ahana Cloud for Presto Managed Service
    • Enables data platform engineers in minutes vs. days
    • Fully integrated & pre-configured
    • No ETL, in-place analytics

    View Slide

  20. Ahana Cloud for Presto
    Ahana Console (Control Plane)
    CLUSTER
    ORCHESTRATION
    CONSOLIDATED
    LOGGING
    SECURITY &
    ACCESS
    BILLING &
    SUPPORT
    In-VPC Presto Clusters (Compute Plane)
    AD HOC CLUSTER 1
    TEST CLUSTER 2
    PROD CLUSTER N
    Glue
    S3
    RDS
    Elasticsearch
    Ahana
    Cloud Account
    Ahana console
    oversees and
    manages every
    Presto cluster
    Customer
    Cloud Account
    In-VPC orchestration of
    Presto clusters, where
    metadata, monitoring,
    and data sources
    reside

    View Slide

  21. Ahana Cloud – Reference Architecture
    • Distributed SQL engine with
    proven scalability
    • Interactive ANSI SQL queries
    • Query data where it lives with
    Federated Connectors (no
    ETL)
    • High concurrency
    • Separation of compute and
    storage
    • Cost Management Features
    21

    View Slide

  22. Apache Ranger with
    Presto

    View Slide

  23. Why Apache Ranger
    ● An Open-Source Authorization Solution
    ● Cloud Agnostic
    ● Fine-grained access control
    ● Audit support
    ● Secured with SSL
    ● Easy to configure with Ahana Cloud
    23

    View Slide

  24. Ranger Plugin Architecture
    • Extended to reuse Hive Ranger Plugin
    • Centralized, fine-grained access control
    with column-level, row-level policies
    across all clusters
    • Supports centralized auditing of user
    access
    • Secured with SSL Support
    • Simplified integration with Ahana Cloud
    • Enable, monitor and manage
    comprehensive data security across data
    lake for the user-triggered Hive or Glue
    Catalog Queries
    24

    View Slide

  25. Fine Grained Access Control
    25

    View Slide

  26. Getting Your Hands Dirty
    https://bit.ly/lakehouselab

    View Slide

  27. Wrapping Up
    Conclusion and Things to Try

    View Slide

  28. Conclusion
    In this hands-on workshop you have:
    1. About Presto and Ahana Cloud
    2. How to effortlessly created and managed Presto
    clusters
    3. Run fast SQL federated queries combining datasets
    from S3 and MySQL
    4. Run presto queries with Apache Ranger enabled
    28

    View Slide

  29. 1. Scale Up and Scale Down your cluster
    2. Try queries with more/less workers; how does performance change?
    3. Fine grained access control with Lake Formation
    4. Table formats for transactions, schema evolution, table versioning with
    Apache Hudi, Apache Iceberg, Delta Lake Connector
    Things to try later...
    29

    View Slide

  30. Next Steps for You...
    • Ahana Cloud is available on the AWS Marketplace
    • Sign-up for a 14-day free trial here: https://ahana.io/sign-up
    • Community Edition - Free forever service
    30

    View Slide

  31. How to get involved with Presto
    Join the Slack channel!
    prestodb.slack.com
    Write a blog for prestodb.io!
    prestodb.io/blog
    Join the virtual meetup group &
    present!
    meetup.com/prestodb
    Contribute to the project!
    github.com/prestodb
    31

    View Slide

  32. Questions?
    32

    View Slide

  33. Thank you!
    Stay Up-to-Date with Ahana
    Website: https://ahana.io/
    Blogs: https://ahana.io/blog/
    Twitter: @ahanaio
    33

    View Slide