Women in Big Data Meetup:An introduction to Presto, an open source distributed SQL engine

by Ahana

Slide 1

Slide 1 text

Dipti Borkar Co-Founder & CPO | Ahana An introduction to Presto, an open source distributed SQL engine

Slide 2

Slide 2 text

Founder Mom Immigrant Girl data geek (DB) Engineer always Product techie Team builder Open source believer Mixologist

Slide 3

Slide 3 text

3 Agenda • What is Presto? • History of federation • Introduction to Presto • What made Presto different? • Scalable architecture • Flexible Connectors • Performance • The life of a query

Slide 4

Slide 4 text

4 Technology Cycles Rhyme: Data Federation FDBMS Challenges RDBMS FDBMS Paper by McCleod / Heimbigner (1985) FDBMS Paper by Sheth / Larson (1990) OLTP to DW Wins Data Warehouse becomes the source of truth Star schema becomes sacred Cloud & Big Data Composite Software (founded 2001) Garlic Paper by Laura Haas (2002) à DB2 Federated Google File System Paper (2003) MapReduce paper (2006) Spark Paper (2010) Too many Data Sources, No one uber schema New Cloud DW w/ Data Lakes Based on SQL Self Service Platforms which enable Self-Service Analytics SQL Federation Makes Comeback Dremel Paper (2010) à Drill paper (2012) SQL ++ paper (2014) à Couchbase SQL++ engine (2018) Presto paper (2019), PartiQL (2019) 80’s 90’s 2000’s 2010’s 2020’s

Slide 5

Slide 5 text

5 Presto: One of the Fastest Growing Open Source Projects in Data Analytics Business Needs Data-driven decision making Businesses need more data to iterate over Technology Trends Disaggregation of Storage and Compute The rise of data lakes

Slide 6

Slide 6 text

6 What is Presto? • Distributed SQL query engine • ANSI SQL on Databases, Data lakes • Designed to be interactive • Access to petabytes of data • Opensource, hosted on github • https://github.com/prestodb

Slide 7

Slide 7 text

7 Presto Overview

Slide 8

Slide 8 text

8 Common Questions? • Is presto a database? • How is it related to Hadoop? • How is it different from a data warehouse?

Slide 9

Slide 9 text

9 Sample Presto deployment stack & use cases • Ad hoc • BI tools • Dashboard • A/B testing • ETL/scheduled job • Online service

Slide 10

Slide 10 text

10 What made Presto different? • Scalable architecture • Pluggable Connectors • Performance

Slide 11

Slide 11 text

11 Scalable Architecture • Two roles - coordinator and worker • Easy scale up and scale down • Scale up to 1000 workers • Validated at web scaled companies

Slide 12

Slide 12 text

12 Scalable Architecture

Slide 13

Slide 13 text

13 Pluggable Presto Connectors

Slide 14

Slide 14 text

14 Presto Connector Data Model • Connector: Driver for a data source. • Example: HDFS, AWS S3, Cassandra, MySQL, SQL Server, Kafka • Catalog: Contains schemas from a data source specified by the connector • Schemas: Namespace to organize tables. • Tables: Set of unordered rows organized into columns with types.

Slide 15

Slide 15 text

15 Presto Hive Connector for Object stores & Files systems

Slide 16

Slide 16 text

16 Presto Hive Connector – Access Control

Slide 17

Slide 17 text

17 Presto Hive Connector – Data File Types • Supported File Types • ORC • Parquet • Avro • RCFile • SequenceFile • JSON • Text • No data ingestion needed

Slide 18

Slide 18 text

18 Presto Druid Connector for real-time analytics

Slide 19

Slide 19 text

19 Why Presto is Fast • In-Memory processing • Pull model • Columnar storage and execution

Slide 20

Slide 20 text

20 The Life of a Query – Simple Scan

Slide 21

Slide 21 text

21 The Life of a Query – Join and Aggregation SELECT orders.orderkey, SUM(tax) FROM orders LEFT JOIN lineitem ON orders.orderkey = lineitem.orderkey WHERE discount = 0 GROUP BY orders.orderkey This example is from Presto: SQL on Everything https://research.fb.com/publications/ presto-sql-on-everything/

Slide 22

Slide 22 text

22 Logical Plan - Do NOT Join Two Big Tables

Slide 23

Slide 23 text

23 Limitations • Memory Limitation • Fault Tolerance • Single Coordinator

Slide 24

Slide 24 text

24 Get started Docker Sandbox for Presto https://hub.docker.com/r/ahanaio/prestodb-sandbox AWS Sandbox AMI for Presto https://ahana.io/tutorials/aws-sandbox/

Slide 25

Slide 25 text

25 Ahana • SQL analytics company based on Presto • Team of experts in cloud, database, and Presto • Investment from Google Ventures • Named CRN Top 10 Big Data Startup of 2020 • Premier member of “[Ahana founders] have been strong supporters of the Presto Foundation since its launch in September 2019” “We are excited to welcome Ahana, as the first and only company focused on supporting Presto of the Presto Foundation”

Slide 26

Slide 26 text

https://events.linuxfoundation.org/prestocon/ PRESTO20WIBD Free for WiBD Members

Slide 27

Slide 27 text

27 Join the Presto Community • Require new feature or file a bug: github.com/prestodb/presto • Slack: prestodb.slack.com • Twitter: @prestodb Stay Up-to-Date with Ahana • URL: ahana.io • Twitter: @ahanaio

Slide 28

Slide 28 text

Q & A And yes! We are hiring!

Slide 29

Slide 29 text

8/27/20

Slide 30

Slide 30 text

30 Presto Foundation: Community Driven

Slide 31

Slide 31 text

31 Data-Driven Companies need Low Data Latency Analysts and Scientists need to answer questions: The time it takes from a user having a question to the time they can actually answer it “Data Latency” = 1. User wants to track or explore some new data 2. User meets with Data Eng team to make plan 3. Data team acquire data and check access permissions 4. Build and test the ETLs and make tables available to user 5. Notify the user so they can ask their questions ! Can be days or weeks of time