SQL on the Data Lake: Presto on AWS S3

SQL on the Data Lake Presto on Amazon S3

Agenda 2 • Data warehouses and data lake • Introduction
to Presto • Presto Architecture • Presto CLI • Presto Data Source Model • Open File Formats • Metadata Catalogs • Federated Querying • Advanced Topics

Data Warehouse and Data Lakes

Enterprise Data Warehouse (EDW) Data Warehouse Proprietary Storage Proprietary SQL
Query Processing Reporting and Dashboarding

Cloud Data Lake Reporting and Dashboarding Data Warehouse Proprietary Storage
Proprietary SQL Query Processing Cloud Data Lake Open Formats Storage

Data Lake Storage 6 • Scalable, low-cost ﬁle storage. The
scalability translates to the implementation as some distributed system. Low-cost is a relative term, and it’s typically in contrast to expensive specialized hardware. • Accessible to scalable compute engines

Hadoop Distributed File System (HDFS) • Storage layer for Hadoop,
the open source stack that ushered in Big Data to the mainstream in mid-2000s. • Distributed ﬁlesystem on commodity hardware (assumes hardware failure as ﬁrst-class citizen) • “Data lake 1.0”

Object Stores 8 • Object stores, particularly cloud-based ones, have
become the storage of choice for modern data lakes. • Also stores files, but does not manage the files in a file hierarchy like filesystems. Simple PUT, GET, DELETE, and LIST.

Amazon S3 (Simple Storage Service) • Amazon object storage service:
industry-leading scalability, data availability, security, and performance • Introduced in 2006 • Over 100 trillion objects worldwide • Inexpensive: $0.023 per GB

Amazon S3 (Simple Storage Service)

Cloud Data Lake Reporting and Dashboarding Data Warehouse Proprietary Storage
Proprietary SQL Query Processing Cloud Data Lake Open Formats Storage

Querying Data on S3 Reporting and Dashboarding Data Warehouse Proprietary
Storage Proprietary SQL Query Processing SQL Query Processing Cloud Data Lake Open Formats Storage Reporting and Dashboarding

Introduction to Presto

Presto

Hadoop MapReduce

Apache Hive • Open sourced in 2008 and originally SQL
on top of HDFS • SQL-like dialect: HiveQL

Apache Hive

Presto • Successor to Hive for large-scale SQL. In-memory for
low-latency performance. • Introduced in 2013 by Meta (Facebook)

19 Presto Users

Presto and Linux Foundation

Why Presto? 21 • Scalable → Big data • Low-latency
→ Interactive experience, great for ad hoc use cases • Pluggable architecture to support multiple data sources • ANSI SQL → Lingua franca of data literate population (e.g. data analysts) • Open source and open governance → Diverse innovation, no vendor lock in

Presto Architecture

Coordinator and Workers • Coordinator. The “brains.”. Connects to clients,
parses and plans queries, manages the Workers. • Worker. The “workhorse.” Executes tasks, fetches and processes data. Works together for distributed processing.

Connector • Interface to data sources • Creates a table
abstraction (e.g. tables, rows, columns) to underlying data source

Architecture Details

Demo: Presto CLI

Connecting to Presto

Presto CLI ./presto --server https://c1.ahanademo1.cp.ahana.cloud --password --user presto SELECT *
FROM tpch.sf100.customer LIMIT 10;

Presto Data Source Model

TPC-H Benchmark

TPC-H • Decision support benchmark widely used for database performance
• Presto has a TPC-H connector built in with TPCH data.

Presto Data Source Model • Data is organized in units
of catalogs, schemas, and tables. • A table is like a database table. • Remember, though, Presto can connector to data sources that are NOT databases underneath. As we know, a data lake is just a bunch of ﬁles. tpch sf1 sf100 customer orders customer Table Schema Catalog

Catalog • Represents a conﬁgured connector • You can have
multiple catalogs using the same connector type. For example: I can have multiple MySQL catalogs to different or the same MySQL instance. SHOW CATALOGS;

Catalog (cont.) 34 The term catalog is overloaded. If we
need to disambiguate, we can call it the Presto catalog.

Schema SHOW SCHEMAS in tpch; • Abstraction to organize a
collection of tables • How it maps to the underlying data source is through the connector design. For relational sources, it’s straightforward; others, such as NoSQL, will be what makes sense for that source.

Tables • Unordered rows with named typed columns SHOW TABLES
in tpch.sf100;

Referencing Tables 37 • Tables are referenced by a fully
qualiﬁed name: catalog_name.schema_name.table_name Example: tpch.sf100.customer • Typing the fully qualiﬁed name can be cumbersome, especially if you primarily working in the same schema. You can tell Presto to scope assume a particular catalog-schema scope with: USE catalog_name.schema_name; Example: USE tpch.sf100;

Lab Tables

Sales Transactions transactions

Customers customers

Ahana Cloud for Presto

Ahana Cloud For Presto 1. Zero to Presto in 30
Minutes. Managed cloud service: No installation and conﬁguration. 2. Built for data teams of all experience level. 3. Moderate level of control of deployment without complexity. 4. Dedicated support from Presto experts.

Personas

Open File Formats

Comma Separated Values (CSV) • Text ﬁle where column values
are separated by commas "txn_datetime","cust_id","cc_num","category","amt" 2018-01-01 00:57:44,"1300","6963165313246376","gas_transport",13.05 2018-01-01 01:42:01,"6631","6761206684819706","kids_pets",43.74 2018-01-01 02:01:37,"3569","676602124466","grocery_pos",6.4 2018-01-01 02:08:38,"5670","4828986262424457062","gas_transport",66.15 2018-01-01 02:49:08,"2779","1205887406197654","home",89.29

JavaScript Object Notation (JSON) • Text ﬁle key-value parts (can
be complex and nested) {"txn_datetime":"2018-01-01 00:57:44","cust_id":"1300","cc_num":"6963165313246376","category":"gas _transport","amt":13.05} {"txn_datetime":"2018-01-01 01:42:01","cust_id":"6631","cc_num":"6761206684819706","category":"kid s_pets","amt":43.74} {"txn_datetime":"2018-01-01 02:01:37","cust_id":"3569","cc_num":"676602124466","category":"grocery _pos","amt":6.4} {"txn_datetime":"2018-01-01 02:08:38","cust_id":"5670","cc_num":"4828986262424457062","category":" gas_transport","amt":66.15}

Optimized Row Columnar (ORC) • Open-source column-oriented ﬁle format

Parquet • Open-source column-oriented ﬁle format

Presto Data Types • Boolean. This type captures boolean values
true and false. • Integer Types • TINYINT. 8-bit signed 2’s comp. (min = -2^7, max = 2^7 - 1) • SMALLINT. 16-bit signed 2’s comp. (min = -2^15, max = 2^15 - 1) • INTEGER. 32-bit signed 2’s comp. (min = -2^31, max = 2^31 - 1) • BIGINT. 64-bit signed 2’s comp. (min = -2^63, max = 2^63 - 1) • Floating-Point Types • REAL. 32-bit inexact, variable-precision; IEEE 754 • DOUBLE. 64-bit inexact, variable-precision; IEEE 754 • Fixed-Precision Types • DECIMAL. Precision up to 38 digits. Performance is best up to 18 digits. • String Types • VARCHAR. Variable length character data with an optional maximum length. • CHAR. Fixed length character data • VARBINARY. Variable length binary data. • JSON. JSON value type: JSON object, a JSON array, a JSON number, a JSON string, true, false or null.

Presto Data Types (cont.) • Date and Time • DATE.
Calendar date (year, month, day). • TIME. Time of day (hour, minute, second, millisecond) without a time zone. Values of this type are parsed and rendered in the session time zone. • TIME WITH TIME ZONE. Time of day (hour, minute, second, millisecond) with a time zone. Values of this type are rendered using the time zone from the value. • TIMESTAMP. Instant in time that includes the date and time of day without a time zone. Values of this type are parsed and rendered in the session time zone. • TIMESTAMP WITH TIME ZONE. Instant in time that includes the date and time of day with a time zone. Values of this type are rendered using the time zone from the value. • INTERVAL YEAR TO MONTH. Span of years and months. • INTERVAL DAY TO SECOND. Span of days, hours, minutes, seconds and milliseconds.

Metadata Catalogs

Metadata Catalog Cloud Data Lake Open Formats Storage

How do I know which ﬁles are make up the
transactions table? Cloud Data Lake SQL Query Processing SELECT category, sum(amt) as total FROM glue.pq.transactions GROUP BY category; ?

Metadata Catalog Cloud Data Lake SQL Query Processing SELECT category,
sum(amt) as total FROM glue.pq.transactions GROUP BY category; Metadata Catalog Table transactions is located at: s3://ahana-labs-presto/glue/pq/transactions s3://ahana-labs-presto/glue/pq/transactions

Data Catalog to Storage (S3) Mapping

Hive Metastore

Ahana-Managed Hive Metastore • For convenience, Ahana Cloud provides an
out-of-the-box Hive Metastore (HMS) if you need a metadata catalog

AWS Glue • AWS Glue provide a Glue metadata catalog
which integrates nicely with Amazon S3

Metadata Catalog • More than just ﬁle locations. • Can
include other metadata, which can be useful to optimizing queries

Optional: Federated Querying

Federated Querying

Advanced Topics

Advanced Topics 63 • Optimizing data layout (e.g. partitioning) for
performance • Data management capabilities • Fine grained access control • Table formats for transactions, schema evolution, table versioning: Apache Hudi, Apache Iceberg, Delta

Open Data Lakehouse The Next EDW is the Open Data
Lakehouse Data Science, ML, & AI Reporting and Dashboarding Data Warehouse Proprietary Storage Proprietary SQL Query Processing Governance, Discovery, Quality & Security ML and AI Frameworks SQL Query Processing Cloud Data Lake Open Formats Storage Reporting and Dashboarding

SQL on the Data Lake: Presto on AWS S3

SQL on the Data Lake: Presto on AWS S3

More Decks by Ahana

Other Decks in Technology

Featured

Transcript