Introduction to Presto, an open source distributed SQL engine

Dipti Borkar Co-Founder & CPO | Ahana An introduction to
Presto, an open source distributed SQL engine

2 Agenda • What is Presto? • History of federation
• Introduction to Presto • What made Presto different? • Scalable architecture • Flexible Connectors • Performance • The life of a query

3 Technology Cycles Rhyme: Data Federation FDBMS Challenges RDBMS FDBMS
Paper by McCleod / Heimbigner (1985) FDBMS Paper by Sheth / Larson (1990) OLTP to DW Wins Data Warehouse becomes the source of truth Star schema becomes sacred Cloud & Big Data Composite Software (founded 2001) Garlic Paper by Laura Haas (2002) à DB2 Federated Google File System Paper (2003) MapReduce paper (2006) Spark Paper (2010) Too many Data Sources, No one uber schema New Cloud DW w/ Data Lakes Based on SQL Self Service Platforms which enable Self-Service Analytics SQL Federation Makes Comeback Dremel Paper (2010) à Drill paper (2012) SQL ++ paper (2014) à Couchbase SQL++ engine (2018) Presto paper (2019), PartiQL (2019) 80’s 90’s 2000’s 2010’s 2020’s

4 Presto: One of the Fastest Growing Open Source Projects
in Data Analytics Business Needs Data-driven decision making Businesses need more data to iterate over Technology Trends Disaggregation of Storage and Compute The rise of data lakes

5 What is Presto? • Distributed SQL query engine •
ANSI SQL on Databases, Data lakes • Designed to be interactive • Access to petabytes of data • Opensource, hosted on github • https://github.com/prestodb

6 Presto Overview

7 Common Questions? • Is presto a database? • How
is it related to Hadoop? • How is it different from a data warehouse?

8 Sample Presto deployment stack & use cases • Ad
hoc • BI tools • Dashboard • A/B testing • ETL/scheduled job • Online service

9 What made Presto different? • Scalable architecture • Pluggable
Connectors • Performance

10 Scalable Architecture • Two roles - coordinator and worker
• Easy scale up and scale down • Scale up to 1000 workers • Validated at web scaled companies

11 Scalable Architecture

12 Pluggable Presto Connectors

13 Presto Connector Data Model • Connector: Driver for a
data source. • Example: HDFS, AWS S3, Cassandra, MySQL, SQL Server, Kafka • Catalog: Contains schemas from a data source specified by the connector • Schemas: Namespace to organize tables. • Tables: Set of unordered rows organized into columns with types.

14 Presto Hive Connector for Object stores & Files systems

15 Presto Hive Connector – Access Control

16 Presto Hive Connector – Data File Types • Supported
File Types • ORC • Parquet • Avro • RCFile • SequenceFile • JSON • Text • No data ingestion needed

17 Presto Druid Connector for real-time analytics

18 Why Presto is Fast • In-Memory processing • Pull
model • Columnar storage and execution

19 The Life of a Query – Simple Scan

20 The Life of a Query – Join and Aggregation
SELECT orders.orderkey, SUM(tax) FROM orders LEFT JOIN lineitem ON orders.orderkey = lineitem.orderkey WHERE discount = 0 GROUP BY orders.orderkey This example is from Presto: SQL on Everything https://research.fb.com/publications/ presto-sql-on-everything/

21 Logical Plan - Do NOT Join Two Big Tables

22 Limitations • Memory Limitation (some spill-over support) • Fault
Tolerance • Single Coordinator

Ahana Overview

24 Ahana At A Glance • First PrestoDB-based Company •
Named Best Big Data Startup of 2020 by Datanami • Named CRN Top 10 Big Data Startup of 2020 • Investment from Google Ventures, Lux Ventures, Leslie Ventures • Team of experts in cloud, database, and Presto • Premier member of

26 Managing Presto Remains Complex Hadoop complexity ▪ /etc/presto/config.properties ▪
/etc/presto/node.properties ▪ /etc/presto/jvm.config Many hidden parameters – difficult to tune Just the query engine ▪ No built-in catalog – users need to manage Hive metastore or AWS Glue ▪ No datalake S3 integration Poor out-of-box perf ▪ No tuning ▪ No high-performance indexing ▪ Basic optimizations for even for common queries

27 How Ahana Cloud works? ~ 30 mins to create
the compute plane https://app.ahana.cloud/signup Create Presto Clusters in your account

28 Ahana Cloud – Reference Architecture • Distributed SQL engine
with proven scalability • Interactive ANSI SQL queries • Query data where it lives with Federated Connectors (no ETL) • High concurrency • Separation of compute and storage

29 Ahana Cloud for Presto Ahana Console (Control Plane) CLUSTER
ORCHESTRATION CONSOLIDATED LOGGING SECURITY & ACCESS BILLING & SUPPORT In-VPC Presto Clusters (Compute Plane) AD HOC CLUSTER 1 TEST CLUSTER 2 PROD CLUSTER N Glue S3 RDS Elasticsearch Ahana Cloud Account Ahana console oversees and manages every Presto cluster Customer Cloud Account In-VPC orchestration of Presto clusters, where metadata, monitoring, and data sources reside

30 COMPUTE PLANE Coordinator 1 Worker Worker Worker Metastores Scale
Up/Down SumUp’s Redshift, MySQL, Postgres, MongoDB (SSL / HTTPS) Coordinator 2 Worker Worker Worker Worker Worker USER DATA PLANE Cluster: ReportingProd Cluster: DataEnggJobs CREATE 4 NODE CLUSTER Metastore ADD DATA SOURCE & AUTO-RESTART OPERATION: OPERATION: CREATE 2 NODE CLUSTER RE-SIZE STOP ($0 WHEN STOPPED) START CLUSTER /W SAVED CONFIG & DATA SOURCES ATTACHED Coordinator 2 Worker Worker Worker Worker Worker AWS EMR does not allow for ▪ Cluster click-button restart, stop & start, auto-restarts for catalog changes ▪ Cluster & data source configs and metastores are not preserved ▪ Re-started clusters are not auto upgraded to latest Presto version Ahana Cloud – Seamless Cluster Operations

31 Ahana Cloud Summary Gives you Presto as a Cloud
Data Warehouse in an open, disaggregated stack Managed Presto in-VPC in user account Built-in metadata catalog, data lake, Apache Superset - Start, stop, restart, resize, terminate – end-to-end cluster life cycle management Amazon sources: S3, RDS/MySQL, RDS/Postgres, Elasticsearch, Redshift Highly available & scalable running in containers on Kubernetes across AZs Flexible analytics stack with BYO - metadata, data source, BI tool or notebook Ahana Cloud for Presto Point & Query Cloud Service

Open source Presto

33 Presto Foundation: Community Driven

An Innovative 1st Year • Project Aria • RaptorX •
Presto-on-Spark • Disaggregated Coordinator (a.k.a. Fireball) • SQL Functions • UDF Support • Pinot Connector • Druid Connector • … and more ...

36 PrestoDB Advancements by the Community 1. Improved planner via
Project Aria - prestodb can now push down entire expression to the data source for some file formats like ORC. https://prestodb.io/blog/2019/12/23/improve-presto-planner https://engineering.fb.com/data-infrastructure/aria-presto/ 2. Grouped execution of non-partitioned tables via Project Presto Unlimited https://prestodb.io/blog/2019/08/05/presto-unlimited-mpp-database-at-scale https://github.com/prestodb/presto/issues/12124 3. UDFs - Dynamic SQL functions support https://prestodb.io/docs/current/admin/function-namespace-managers.html 4. Connectors - Pinot and Druid https://prestodb.io/docs/current/connector.html https://prestosql.io/docs/current/connector.html

40 Get started Docker Sandbox for Presto https://hub.docker.com/r/ahanaio/prestodb-sandbox AWS Sandbox
AMI for Presto https://ahana.io/tutorials/aws-sandbox/

41 Join the Presto Community • Require new feature or
file a bug: github.com/prestodb/presto • Meetup: meetup.com/prestodb • Slack: prestodb.slack.com • Twitter: @prestodb Stay Up-to-Date with Ahana • URL: ahana.io • Twitter: @ahanaio

Q & A And yes! We are hiring!

12/16/20

45 Data-Driven Companies need Low Data Latency Analysts and Scientists
need to answer questions: The time it takes from a user having a question to the time they can actually answer it “Data Latency” = 1. User wants to track or explore some new data 2. User meets with Data Eng team to make plan 3. Data team acquire data and check access permissions 4. Build and test the ETLs and make tables available to user 5. Notify the user so they can ask their questions ! Can be days or weeks of time

Introduction to Presto, an open source distribu...

Introduction to Presto, an open source distributed SQL engine

More Decks by Ahana

Other Decks in Technology

Featured

Transcript