SQL on the Data Lake: Presto on AWS S3

Slide 1

Slide 1 text

SQL on the Data Lake Presto on Amazon S3

Slide 2

Slide 2 text

Agenda 2 • Data warehouses and data lake • Introduction to Presto • Presto Architecture • Presto CLI • Presto Data Source Model • Open File Formats • Metadata Catalogs • Federated Querying • Advanced Topics

Slide 3

Slide 3 text

Data Warehouse and Data Lakes

Slide 4

Slide 4 text

Enterprise Data Warehouse (EDW) Data Warehouse Proprietary Storage Proprietary SQL Query Processing Reporting and Dashboarding

Slide 5

Slide 5 text

Cloud Data Lake Reporting and Dashboarding Data Warehouse Proprietary Storage Proprietary SQL Query Processing Cloud Data Lake Open Formats Storage

Slide 6

Slide 6 text

Data Lake Storage 6 • Scalable, low-cost ﬁle storage. The scalability translates to the implementation as some distributed system. Low-cost is a relative term, and it’s typically in contrast to expensive specialized hardware. • Accessible to scalable compute engines

Slide 7

Slide 7 text

Hadoop Distributed File System (HDFS) • Storage layer for Hadoop, the open source stack that ushered in Big Data to the mainstream in mid-2000s. • Distributed ﬁlesystem on commodity hardware (assumes hardware failure as ﬁrst-class citizen) • “Data lake 1.0”

Slide 8

Slide 8 text

Object Stores 8 • Object stores, particularly cloud-based ones, have become the storage of choice for modern data lakes. • Also stores files, but does not manage the files in a file hierarchy like filesystems. Simple PUT, GET, DELETE, and LIST.

Slide 9

Slide 9 text

Amazon S3 (Simple Storage Service) • Amazon object storage service: industry-leading scalability, data availability, security, and performance • Introduced in 2006 • Over 100 trillion objects worldwide • Inexpensive: $0.023 per GB

Slide 10

Slide 10 text

Amazon S3 (Simple Storage Service)

Slide 11

Slide 11 text

Cloud Data Lake Reporting and Dashboarding Data Warehouse Proprietary Storage Proprietary SQL Query Processing Cloud Data Lake Open Formats Storage

Slide 12

Slide 12 text

Querying Data on S3 Reporting and Dashboarding Data Warehouse Proprietary Storage Proprietary SQL Query Processing SQL Query Processing Cloud Data Lake Open Formats Storage Reporting and Dashboarding

Slide 13

Slide 13 text

Introduction to Presto

Slide 14

Slide 14 text

Presto

Slide 15

Slide 15 text

Hadoop MapReduce

Slide 16

Slide 16 text

Apache Hive • Open sourced in 2008 and originally SQL on top of HDFS • SQL-like dialect: HiveQL

Slide 17

Slide 17 text

Apache Hive

Slide 18

Slide 18 text

Presto • Successor to Hive for large-scale SQL. In-memory for low-latency performance. • Introduced in 2013 by Meta (Facebook)

Slide 19

Slide 19 text

19 Presto Users

Slide 20

Slide 20 text

Presto and Linux Foundation

Slide 21

Slide 21 text

Why Presto? 21 • Scalable → Big data • Low-latency → Interactive experience, great for ad hoc use cases • Pluggable architecture to support multiple data sources • ANSI SQL → Lingua franca of data literate population (e.g. data analysts) • Open source and open governance → Diverse innovation, no vendor lock in

Slide 22

Slide 22 text

Presto Architecture

Slide 23

Slide 23 text

Coordinator and Workers • Coordinator. The “brains.”. Connects to clients, parses and plans queries, manages the Workers. • Worker. The “workhorse.” Executes tasks, fetches and processes data. Works together for distributed processing.

Slide 24

Slide 24 text

Connector • Interface to data sources • Creates a table abstraction (e.g. tables, rows, columns) to underlying data source

Slide 25

Slide 25 text

Architecture Details

Slide 26

Slide 26 text

Demo: Presto CLI

Slide 27

Slide 27 text

Connecting to Presto

Slide 28

Slide 28 text

Presto CLI ./presto --server https://c1.ahanademo1.cp.ahana.cloud --password --user presto SELECT * FROM tpch.sf100.customer LIMIT 10;

Slide 29

Slide 29 text

Presto Data Source Model

Slide 30

Slide 30 text

TPC-H Benchmark

Slide 31

Slide 31 text

TPC-H • Decision support benchmark widely used for database performance • Presto has a TPC-H connector built in with TPCH data.

Slide 32

Slide 32 text

Presto Data Source Model • Data is organized in units of catalogs, schemas, and tables. • A table is like a database table. • Remember, though, Presto can connector to data sources that are NOT databases underneath. As we know, a data lake is just a bunch of ﬁles. tpch sf1 sf100 customer orders customer Table Schema Catalog

Slide 33

Slide 33 text

Catalog • Represents a conﬁgured connector • You can have multiple catalogs using the same connector type. For example: I can have multiple MySQL catalogs to different or the same MySQL instance. SHOW CATALOGS;

Slide 34

Slide 34 text

Catalog (cont.) 34 The term catalog is overloaded. If we need to disambiguate, we can call it the Presto catalog.

Slide 35

Slide 35 text

Schema SHOW SCHEMAS in tpch; • Abstraction to organize a collection of tables • How it maps to the underlying data source is through the connector design. For relational sources, it’s straightforward; others, such as NoSQL, will be what makes sense for that source.

Slide 36

Slide 36 text

Tables • Unordered rows with named typed columns SHOW TABLES in tpch.sf100;

Slide 37

Slide 37 text

Referencing Tables 37 • Tables are referenced by a fully qualiﬁed name: catalog_name.schema_name.table_name Example: tpch.sf100.customer • Typing the fully qualiﬁed name can be cumbersome, especially if you primarily working in the same schema. You can tell Presto to scope assume a particular catalog-schema scope with: USE catalog_name.schema_name; Example: USE tpch.sf100;

Slide 38

Slide 38 text

Lab Tables

Slide 39

Slide 39 text

Sales Transactions transactions

Slide 40

Slide 40 text

Customers customers

Slide 41

Slide 41 text

Ahana Cloud for Presto

Slide 42

Slide 42 text

Ahana Cloud For Presto 1. Zero to Presto in 30 Minutes. Managed cloud service: No installation and conﬁguration. 2. Built for data teams of all experience level. 3. Moderate level of control of deployment without complexity. 4. Dedicated support from Presto experts.

Slide 43

Slide 43 text

Personas

Slide 44

Slide 44 text

Open File Formats

Slide 45

Slide 45 text

Comma Separated Values (CSV) • Text ﬁle where column values are separated by commas "txn_datetime","cust_id","cc_num","category","amt" 2018-01-01 00:57:44,"1300","6963165313246376","gas_transport",13.05 2018-01-01 01:42:01,"6631","6761206684819706","kids_pets",43.74 2018-01-01 02:01:37,"3569","676602124466","grocery_pos",6.4 2018-01-01 02:08:38,"5670","4828986262424457062","gas_transport",66.15 2018-01-01 02:49:08,"2779","1205887406197654","home",89.29

Slide 46

Slide 46 text

JavaScript Object Notation (JSON) • Text ﬁle key-value parts (can be complex and nested) {"txn_datetime":"2018-01-01 00:57:44","cust_id":"1300","cc_num":"6963165313246376","category":"gas _transport","amt":13.05} {"txn_datetime":"2018-01-01 01:42:01","cust_id":"6631","cc_num":"6761206684819706","category":"kid s_pets","amt":43.74} {"txn_datetime":"2018-01-01 02:01:37","cust_id":"3569","cc_num":"676602124466","category":"grocery _pos","amt":6.4} {"txn_datetime":"2018-01-01 02:08:38","cust_id":"5670","cc_num":"4828986262424457062","category":" gas_transport","amt":66.15}

Slide 47

Slide 47 text

Optimized Row Columnar (ORC) • Open-source column-oriented ﬁle format

Slide 48

Slide 48 text

Parquet • Open-source column-oriented ﬁle format

Slide 49

Slide 49 text

Presto Data Types • Boolean. This type captures boolean values true and false. • Integer Types • TINYINT. 8-bit signed 2’s comp. (min = -2^7, max = 2^7 - 1) • SMALLINT. 16-bit signed 2’s comp. (min = -2^15, max = 2^15 - 1) • INTEGER. 32-bit signed 2’s comp. (min = -2^31, max = 2^31 - 1) • BIGINT. 64-bit signed 2’s comp. (min = -2^63, max = 2^63 - 1) • Floating-Point Types • REAL. 32-bit inexact, variable-precision; IEEE 754 • DOUBLE. 64-bit inexact, variable-precision; IEEE 754 • Fixed-Precision Types • DECIMAL. Precision up to 38 digits. Performance is best up to 18 digits. • String Types • VARCHAR. Variable length character data with an optional maximum length. • CHAR. Fixed length character data • VARBINARY. Variable length binary data. • JSON. JSON value type: JSON object, a JSON array, a JSON number, a JSON string, true, false or null.

Slide 50

Slide 50 text

Presto Data Types (cont.) • Date and Time • DATE. Calendar date (year, month, day). • TIME. Time of day (hour, minute, second, millisecond) without a time zone. Values of this type are parsed and rendered in the session time zone. • TIME WITH TIME ZONE. Time of day (hour, minute, second, millisecond) with a time zone. Values of this type are rendered using the time zone from the value. • TIMESTAMP. Instant in time that includes the date and time of day without a time zone. Values of this type are parsed and rendered in the session time zone. • TIMESTAMP WITH TIME ZONE. Instant in time that includes the date and time of day with a time zone. Values of this type are rendered using the time zone from the value. • INTERVAL YEAR TO MONTH. Span of years and months. • INTERVAL DAY TO SECOND. Span of days, hours, minutes, seconds and milliseconds.

Slide 51

Slide 51 text

Metadata Catalogs

Slide 52

Slide 52 text

Metadata Catalog Cloud Data Lake Open Formats Storage

Slide 53

Slide 53 text

How do I know which ﬁles are make up the transactions table? Cloud Data Lake SQL Query Processing SELECT category, sum(amt) as total FROM glue.pq.transactions GROUP BY category; ?

Slide 54

Slide 54 text

Metadata Catalog Cloud Data Lake SQL Query Processing SELECT category, sum(amt) as total FROM glue.pq.transactions GROUP BY category; Metadata Catalog Table transactions is located at: s3://ahana-labs-presto/glue/pq/transactions s3://ahana-labs-presto/glue/pq/transactions

Slide 55

Slide 55 text

Data Catalog to Storage (S3) Mapping

Slide 56

Slide 56 text

Hive Metastore

Slide 57

Slide 57 text

Ahana-Managed Hive Metastore • For convenience, Ahana Cloud provides an out-of-the-box Hive Metastore (HMS) if you need a metadata catalog

Slide 58

Slide 58 text

AWS Glue • AWS Glue provide a Glue metadata catalog which integrates nicely with Amazon S3

Slide 59

Slide 59 text

Metadata Catalog • More than just ﬁle locations. • Can include other metadata, which can be useful to optimizing queries

Slide 60

Slide 60 text

Optional: Federated Querying

Slide 61

Slide 61 text

Federated Querying

Slide 62

Slide 62 text

Advanced Topics

Slide 63

Slide 63 text

Advanced Topics 63 • Optimizing data layout (e.g. partitioning) for performance • Data management capabilities • Fine grained access control • Table formats for transactions, schema evolution, table versioning: Apache Hudi, Apache Iceberg, Delta

Slide 64

Slide 64 text

Open Data Lakehouse The Next EDW is the Open Data Lakehouse Data Science, ML, & AI Reporting and Dashboarding Data Warehouse Proprietary Storage Proprietary SQL Query Processing Governance, Discovery, Quality & Security ML and AI Frameworks SQL Query Processing Cloud Data Lake Open Formats Storage Reporting and Dashboarding