Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SQL on the Data Lake: Presto on AWS S3

Ahana
June 22, 2022

SQL on the Data Lake: Presto on AWS S3

Slides from the June 2022 Virtual Lab - SQL on the Data Lake: Presto on AWS S3

Ahana

June 22, 2022
Tweet

More Decks by Ahana

Other Decks in Technology

Transcript

  1. Agenda 2 • Data warehouses and data lake • Introduction

    to Presto • Presto Architecture • Presto CLI • Presto Data Source Model • Open File Formats • Metadata Catalogs • Federated Querying • Advanced Topics
  2. Cloud Data Lake Reporting and Dashboarding Data Warehouse Proprietary Storage

    Proprietary SQL Query Processing Cloud Data Lake Open Formats Storage
  3. Data Lake Storage 6 • Scalable, low-cost file storage. The

    scalability translates to the implementation as some distributed system. Low-cost is a relative term, and it’s typically in contrast to expensive specialized hardware. • Accessible to scalable compute engines
  4. Hadoop Distributed File System (HDFS) • Storage layer for Hadoop,

    the open source stack that ushered in Big Data to the mainstream in mid-2000s. • Distributed filesystem on commodity hardware (assumes hardware failure as first-class citizen) • “Data lake 1.0”
  5. Object Stores 8 • Object stores, particularly cloud-based ones, have

    become the storage of choice for modern data lakes. • Also stores files, but does not manage the files in a file hierarchy like filesystems. Simple PUT, GET, DELETE, and LIST.
  6. Amazon S3 (Simple Storage Service) • Amazon object storage service:

    industry-leading scalability, data availability, security, and performance • Introduced in 2006 • Over 100 trillion objects worldwide • Inexpensive: $0.023 per GB
  7. Cloud Data Lake Reporting and Dashboarding Data Warehouse Proprietary Storage

    Proprietary SQL Query Processing Cloud Data Lake Open Formats Storage
  8. Querying Data on S3 Reporting and Dashboarding Data Warehouse Proprietary

    Storage Proprietary SQL Query Processing SQL Query Processing Cloud Data Lake Open Formats Storage Reporting and Dashboarding
  9. Apache Hive • Open sourced in 2008 and originally SQL

    on top of HDFS • SQL-like dialect: HiveQL
  10. Presto • Successor to Hive for large-scale SQL. In-memory for

    low-latency performance. • Introduced in 2013 by Meta (Facebook)
  11. Why Presto? 21 • Scalable → Big data • Low-latency

    → Interactive experience, great for ad hoc use cases • Pluggable architecture to support multiple data sources • ANSI SQL → Lingua franca of data literate population (e.g. data analysts) • Open source and open governance → Diverse innovation, no vendor lock in
  12. Coordinator and Workers • Coordinator. The “brains.”. Connects to clients,

    parses and plans queries, manages the Workers. • Worker. The “workhorse.” Executes tasks, fetches and processes data. Works together for distributed processing.
  13. Connector • Interface to data sources • Creates a table

    abstraction (e.g. tables, rows, columns) to underlying data source
  14. TPC-H • Decision support benchmark widely used for database performance

    • Presto has a TPC-H connector built in with TPCH data.
  15. Presto Data Source Model • Data is organized in units

    of catalogs, schemas, and tables. • A table is like a database table. • Remember, though, Presto can connector to data sources that are NOT databases underneath. As we know, a data lake is just a bunch of files. tpch sf1 sf100 customer orders customer Table Schema Catalog
  16. Catalog • Represents a configured connector • You can have

    multiple catalogs using the same connector type. For example: I can have multiple MySQL catalogs to different or the same MySQL instance. SHOW CATALOGS;
  17. Catalog (cont.) 34 The term catalog is overloaded. If we

    need to disambiguate, we can call it the Presto catalog.
  18. Schema SHOW SCHEMAS in tpch; • Abstraction to organize a

    collection of tables • How it maps to the underlying data source is through the connector design. For relational sources, it’s straightforward; others, such as NoSQL, will be what makes sense for that source.
  19. Referencing Tables 37 • Tables are referenced by a fully

    qualified name: catalog_name.schema_name.table_name Example: tpch.sf100.customer • Typing the fully qualified name can be cumbersome, especially if you primarily working in the same schema. You can tell Presto to scope assume a particular catalog-schema scope with: USE catalog_name.schema_name; Example: USE tpch.sf100;
  20. Ahana Cloud For Presto 1. Zero to Presto in 30

    Minutes. Managed cloud service: No installation and configuration. 2. Built for data teams of all experience level. 3. Moderate level of control of deployment without complexity. 4. Dedicated support from Presto experts.
  21. Comma Separated Values (CSV) • Text file where column values

    are separated by commas "txn_datetime","cust_id","cc_num","category","amt" 2018-01-01 00:57:44,"1300","6963165313246376","gas_transport",13.05 2018-01-01 01:42:01,"6631","6761206684819706","kids_pets",43.74 2018-01-01 02:01:37,"3569","676602124466","grocery_pos",6.4 2018-01-01 02:08:38,"5670","4828986262424457062","gas_transport",66.15 2018-01-01 02:49:08,"2779","1205887406197654","home",89.29
  22. JavaScript Object Notation (JSON) • Text file key-value parts (can

    be complex and nested) {"txn_datetime":"2018-01-01 00:57:44","cust_id":"1300","cc_num":"6963165313246376","category":"gas _transport","amt":13.05} {"txn_datetime":"2018-01-01 01:42:01","cust_id":"6631","cc_num":"6761206684819706","category":"kid s_pets","amt":43.74} {"txn_datetime":"2018-01-01 02:01:37","cust_id":"3569","cc_num":"676602124466","category":"grocery _pos","amt":6.4} {"txn_datetime":"2018-01-01 02:08:38","cust_id":"5670","cc_num":"4828986262424457062","category":" gas_transport","amt":66.15}
  23. Presto Data Types • Boolean. This type captures boolean values

    true and false. • Integer Types • TINYINT. 8-bit signed 2’s comp. (min = -2^7, max = 2^7 - 1) • SMALLINT. 16-bit signed 2’s comp. (min = -2^15, max = 2^15 - 1) • INTEGER. 32-bit signed 2’s comp. (min = -2^31, max = 2^31 - 1) • BIGINT. 64-bit signed 2’s comp. (min = -2^63, max = 2^63 - 1) • Floating-Point Types • REAL. 32-bit inexact, variable-precision; IEEE 754 • DOUBLE. 64-bit inexact, variable-precision; IEEE 754 • Fixed-Precision Types • DECIMAL. Precision up to 38 digits. Performance is best up to 18 digits. • String Types • VARCHAR. Variable length character data with an optional maximum length. • CHAR. Fixed length character data • VARBINARY. Variable length binary data. • JSON. JSON value type: JSON object, a JSON array, a JSON number, a JSON string, true, false or null.
  24. Presto Data Types (cont.) • Date and Time • DATE.

    Calendar date (year, month, day). • TIME. Time of day (hour, minute, second, millisecond) without a time zone. Values of this type are parsed and rendered in the session time zone. • TIME WITH TIME ZONE. Time of day (hour, minute, second, millisecond) with a time zone. Values of this type are rendered using the time zone from the value. • TIMESTAMP. Instant in time that includes the date and time of day without a time zone. Values of this type are parsed and rendered in the session time zone. • TIMESTAMP WITH TIME ZONE. Instant in time that includes the date and time of day with a time zone. Values of this type are rendered using the time zone from the value. • INTERVAL YEAR TO MONTH. Span of years and months. • INTERVAL DAY TO SECOND. Span of days, hours, minutes, seconds and milliseconds.
  25. How do I know which files are make up the

    transactions table? Cloud Data Lake SQL Query Processing SELECT category, sum(amt) as total FROM glue.pq.transactions GROUP BY category; ?
  26. Metadata Catalog Cloud Data Lake SQL Query Processing SELECT category,

    sum(amt) as total FROM glue.pq.transactions GROUP BY category; Metadata Catalog Table transactions is located at: s3://ahana-labs-presto/glue/pq/transactions s3://ahana-labs-presto/glue/pq/transactions
  27. Ahana-Managed Hive Metastore • For convenience, Ahana Cloud provides an

    out-of-the-box Hive Metastore (HMS) if you need a metadata catalog
  28. AWS Glue • AWS Glue provide a Glue metadata catalog

    which integrates nicely with Amazon S3
  29. Metadata Catalog • More than just file locations. • Can

    include other metadata, which can be useful to optimizing queries
  30. Advanced Topics 63 • Optimizing data layout (e.g. partitioning) for

    performance • Data management capabilities • Fine grained access control • Table formats for transactions, schema evolution, table versioning: Apache Hudi, Apache Iceberg, Delta
  31. Open Data Lakehouse The Next EDW is the Open Data

    Lakehouse Data Science, ML, & AI Reporting and Dashboarding Data Warehouse Proprietary Storage Proprietary SQL Query Processing Governance, Discovery, Quality & Security ML and AI Frameworks SQL Query Processing Cloud Data Lake Open Formats Storage Reporting and Dashboarding