Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SQL on the Data Lake: Presto on AWS S3

Ahana
June 22, 2022

SQL on the Data Lake: Presto on AWS S3

Slides from the June 2022 Virtual Lab - SQL on the Data Lake: Presto on AWS S3

Ahana

June 22, 2022
Tweet

More Decks by Ahana

Other Decks in Technology

Transcript

  1. SQL on the Data Lake
    Presto on Amazon S3

    View Slide

  2. Agenda
    2
    • Data warehouses and data lake
    • Introduction to Presto
    • Presto Architecture
    • Presto CLI
    • Presto Data Source Model
    • Open File Formats
    • Metadata Catalogs
    • Federated Querying
    • Advanced Topics

    View Slide

  3. Data Warehouse and Data Lakes

    View Slide

  4. Enterprise Data Warehouse (EDW)
    Data Warehouse
    Proprietary Storage
    Proprietary
    SQL Query Processing
    Reporting and Dashboarding

    View Slide

  5. Cloud Data Lake
    Reporting and Dashboarding
    Data Warehouse
    Proprietary Storage
    Proprietary
    SQL Query Processing
    Cloud Data Lake
    Open
    Formats
    Storage

    View Slide

  6. Data Lake Storage
    6
    • Scalable, low-cost file storage.
    The scalability translates to the implementation as some
    distributed system. Low-cost is a relative term, and it’s typically in
    contrast to expensive specialized hardware.
    • Accessible to scalable compute engines

    View Slide

  7. Hadoop Distributed File System (HDFS)
    • Storage layer for Hadoop, the
    open source stack that
    ushered in Big Data to the
    mainstream in mid-2000s.
    • Distributed filesystem on
    commodity hardware
    (assumes hardware failure as
    first-class citizen)
    • “Data lake 1.0”

    View Slide

  8. Object Stores
    8
    • Object stores, particularly cloud-based ones, have become the
    storage of choice for modern data lakes.
    • Also stores files, but does not manage the files in a file hierarchy like
    filesystems. Simple PUT, GET, DELETE, and LIST.

    View Slide

  9. Amazon S3 (Simple Storage Service)
    • Amazon object storage
    service: industry-leading
    scalability, data availability,
    security, and performance
    • Introduced in 2006
    • Over 100 trillion objects
    worldwide
    • Inexpensive: $0.023 per GB

    View Slide

  10. Amazon S3 (Simple Storage Service)

    View Slide

  11. Cloud Data Lake
    Reporting and Dashboarding
    Data Warehouse
    Proprietary Storage
    Proprietary
    SQL Query Processing
    Cloud Data Lake
    Open
    Formats
    Storage

    View Slide

  12. Querying Data on S3
    Reporting and Dashboarding
    Data Warehouse
    Proprietary Storage
    Proprietary
    SQL Query Processing
    SQL Query Processing
    Cloud Data Lake
    Open
    Formats
    Storage
    Reporting and Dashboarding

    View Slide

  13. Introduction to Presto

    View Slide

  14. Presto

    View Slide

  15. Hadoop MapReduce

    View Slide

  16. Apache Hive
    • Open sourced in 2008 and
    originally SQL on top of HDFS
    • SQL-like dialect: HiveQL

    View Slide

  17. Apache Hive

    View Slide

  18. Presto
    • Successor to Hive for
    large-scale SQL. In-memory
    for low-latency performance.
    • Introduced in 2013 by Meta
    (Facebook)

    View Slide

  19. 19
    Presto
    Users

    View Slide

  20. Presto and Linux Foundation

    View Slide

  21. Why Presto?
    21
    • Scalable → Big data
    • Low-latency → Interactive experience, great for ad hoc use cases
    • Pluggable architecture to support multiple data sources
    • ANSI SQL → Lingua franca of data literate population
    (e.g. data analysts)
    • Open source and open governance → Diverse innovation, no vendor
    lock in

    View Slide

  22. Presto Architecture

    View Slide

  23. Coordinator and Workers
    • Coordinator. The “brains.”. Connects
    to clients, parses and plans queries,
    manages the Workers.
    • Worker. The “workhorse.” Executes
    tasks, fetches and processes data.
    Works together for distributed
    processing.

    View Slide

  24. Connector
    • Interface to data sources
    • Creates a table abstraction (e.g. tables, rows, columns) to underlying
    data source

    View Slide

  25. Architecture
    Details

    View Slide

  26. Demo: Presto CLI

    View Slide

  27. Connecting to Presto

    View Slide

  28. Presto CLI
    ./presto
    --server https://c1.ahanademo1.cp.ahana.cloud
    --password --user presto
    SELECT * FROM tpch.sf100.customer LIMIT 10;

    View Slide

  29. Presto Data Source Model

    View Slide

  30. TPC-H Benchmark

    View Slide

  31. TPC-H
    • Decision support benchmark
    widely used for database
    performance
    • Presto has a TPC-H connector
    built in with TPCH data.

    View Slide

  32. Presto Data Source Model
    • Data is organized in units of
    catalogs, schemas, and tables.
    • A table is like a database table.
    • Remember, though, Presto
    can connector to data sources
    that are NOT databases
    underneath. As we know, a
    data lake is just a bunch of
    files.
    tpch
    sf1
    sf100
    customer
    orders
    customer
    Table
    Schema
    Catalog

    View Slide

  33. Catalog
    • Represents a configured
    connector
    • You can have multiple
    catalogs using the same
    connector type.
    For example: I can have
    multiple MySQL catalogs to
    different or the same MySQL
    instance.
    SHOW CATALOGS;

    View Slide

  34. Catalog (cont.)
    34
    The term catalog is overloaded. If we need to disambiguate, we can
    call it the Presto catalog.

    View Slide

  35. Schema
    SHOW SCHEMAS in tpch;
    • Abstraction to organize a
    collection of tables
    • How it maps to the
    underlying data source is
    through the connector
    design. For relational
    sources, it’s straightforward;
    others, such as NoSQL, will be
    what makes sense for that
    source.

    View Slide

  36. Tables
    • Unordered rows with named
    typed columns
    SHOW TABLES in tpch.sf100;

    View Slide

  37. Referencing Tables
    37
    • Tables are referenced by a fully qualified name:
    catalog_name.schema_name.table_name
    Example: tpch.sf100.customer
    • Typing the fully qualified name can be cumbersome, especially if
    you primarily working in the same schema. You can tell Presto to
    scope assume a particular catalog-schema scope with:
    USE catalog_name.schema_name;
    Example: USE tpch.sf100;

    View Slide

  38. Lab Tables

    View Slide

  39. Sales Transactions
    transactions

    View Slide

  40. Customers
    customers

    View Slide

  41. Ahana Cloud for Presto

    View Slide

  42. Ahana Cloud For Presto
    1. Zero to Presto in 30 Minutes.
    Managed cloud service: No installation
    and configuration.
    2. Built for data teams of all experience
    level.
    3. Moderate level of control of
    deployment without complexity.
    4. Dedicated support from Presto
    experts.

    View Slide

  43. Personas

    View Slide

  44. Open File Formats

    View Slide

  45. Comma Separated Values (CSV)
    • Text file where column values are separated by commas
    "txn_datetime","cust_id","cc_num","category","amt"
    2018-01-01 00:57:44,"1300","6963165313246376","gas_transport",13.05
    2018-01-01 01:42:01,"6631","6761206684819706","kids_pets",43.74
    2018-01-01 02:01:37,"3569","676602124466","grocery_pos",6.4
    2018-01-01 02:08:38,"5670","4828986262424457062","gas_transport",66.15
    2018-01-01 02:49:08,"2779","1205887406197654","home",89.29

    View Slide

  46. JavaScript Object Notation (JSON)
    • Text file key-value parts (can be complex and nested)
    {"txn_datetime":"2018-01-01
    00:57:44","cust_id":"1300","cc_num":"6963165313246376","category":"gas
    _transport","amt":13.05}
    {"txn_datetime":"2018-01-01
    01:42:01","cust_id":"6631","cc_num":"6761206684819706","category":"kid
    s_pets","amt":43.74}
    {"txn_datetime":"2018-01-01
    02:01:37","cust_id":"3569","cc_num":"676602124466","category":"grocery
    _pos","amt":6.4}
    {"txn_datetime":"2018-01-01
    02:08:38","cust_id":"5670","cc_num":"4828986262424457062","category":"
    gas_transport","amt":66.15}

    View Slide

  47. Optimized Row Columnar (ORC)
    • Open-source
    column-oriented file format

    View Slide

  48. Parquet
    • Open-source
    column-oriented file format

    View Slide

  49. Presto Data Types
    • Boolean. This type captures boolean
    values true and false.
    • Integer Types
    • TINYINT. 8-bit signed 2’s comp.
    (min = -2^7, max = 2^7 - 1)
    • SMALLINT. 16-bit signed 2’s comp.
    (min = -2^15, max = 2^15 - 1)
    • INTEGER. 32-bit signed 2’s comp.
    (min = -2^31, max = 2^31 - 1)
    • BIGINT. 64-bit signed 2’s comp.
    (min = -2^63, max = 2^63 - 1)
    • Floating-Point Types
    • REAL. 32-bit inexact,
    variable-precision; IEEE 754
    • DOUBLE. 64-bit inexact,
    variable-precision; IEEE 754
    • Fixed-Precision Types
    • DECIMAL. Precision up to 38 digits.
    Performance is best up to 18 digits.
    • String Types
    • VARCHAR. Variable length character
    data with an optional maximum
    length.
    • CHAR. Fixed length character data
    • VARBINARY. Variable length binary
    data.
    • JSON. JSON value type: JSON object,
    a JSON array, a JSON number, a
    JSON string, true, false or null.

    View Slide

  50. Presto Data Types (cont.)
    • Date and Time
    • DATE. Calendar date (year, month,
    day).
    • TIME. Time of day (hour, minute,
    second, millisecond) without a time
    zone. Values of this type are parsed
    and rendered in the session time
    zone.
    • TIME WITH TIME ZONE. Time of day
    (hour, minute, second, millisecond)
    with a time zone. Values of this type
    are rendered using the time zone
    from the value.
    • TIMESTAMP. Instant in time that
    includes the date and time of day
    without a time zone. Values of this
    type are parsed and rendered in
    the session time zone.
    • TIMESTAMP WITH TIME ZONE. Instant
    in time that includes the date and time of
    day with a time zone. Values of this type
    are rendered using the time zone from
    the value.
    • INTERVAL YEAR TO MONTH. Span of
    years and months.
    • INTERVAL DAY TO SECOND. Span of
    days, hours, minutes, seconds and
    milliseconds.

    View Slide

  51. Metadata Catalogs

    View Slide

  52. Metadata Catalog
    Cloud Data Lake
    Open
    Formats
    Storage

    View Slide

  53. How do I know which files are make up the
    transactions table?
    Cloud Data Lake
    SQL Query Processing
    SELECT category, sum(amt) as total
    FROM glue.pq.transactions GROUP BY category;
    ?

    View Slide

  54. Metadata Catalog
    Cloud Data Lake
    SQL Query Processing
    SELECT category, sum(amt) as total
    FROM glue.pq.transactions GROUP BY category;
    Metadata Catalog
    Table transactions is located at:
    s3://ahana-labs-presto/glue/pq/transactions
    s3://ahana-labs-presto/glue/pq/transactions

    View Slide

  55. Data Catalog to Storage (S3) Mapping

    View Slide

  56. Hive Metastore

    View Slide

  57. Ahana-Managed Hive Metastore
    • For convenience, Ahana
    Cloud provides an
    out-of-the-box Hive
    Metastore (HMS) if you need
    a metadata catalog

    View Slide

  58. AWS Glue
    • AWS Glue provide a Glue
    metadata catalog which
    integrates nicely with
    Amazon S3

    View Slide

  59. Metadata Catalog
    • More than just file locations.
    • Can include other metadata,
    which can be useful to
    optimizing queries

    View Slide

  60. Optional: Federated Querying

    View Slide

  61. Federated Querying

    View Slide

  62. Advanced Topics

    View Slide

  63. Advanced Topics
    63
    • Optimizing data layout (e.g. partitioning) for performance
    • Data management capabilities
    • Fine grained access control
    • Table formats for transactions, schema evolution, table
    versioning: Apache Hudi, Apache Iceberg, Delta

    View Slide

  64. Open Data Lakehouse
    The Next EDW is the Open Data Lakehouse
    Data Science,
    ML, & AI
    Reporting and Dashboarding
    Data Warehouse
    Proprietary Storage
    Proprietary
    SQL Query Processing
    Governance,
    Discovery,
    Quality &
    Security
    ML and AI
    Frameworks
    SQL Query Processing
    Cloud Data Lake
    Open
    Formats
    Storage
    Reporting and Dashboarding

    View Slide