Women in Big Data Meetup:An introduction to Presto, an open source distributed SQL engine

Dipti Borkar Co-Founder & CPO | Ahana An introduction to
Presto, an open source distributed SQL engine

Founder Mom Immigrant Girl data geek (DB) Engineer always Product
techie Team builder Open source believer Mixologist

3 Agenda • What is Presto? • History of federation
• Introduction to Presto • What made Presto different? • Scalable architecture • Flexible Connectors • Performance • The life of a query

4 Technology Cycles Rhyme: Data Federation FDBMS Challenges RDBMS FDBMS
Paper by McCleod / Heimbigner (1985) FDBMS Paper by Sheth / Larson (1990) OLTP to DW Wins Data Warehouse becomes the source of truth Star schema becomes sacred Cloud & Big Data Composite Software (founded 2001) Garlic Paper by Laura Haas (2002) à DB2 Federated Google File System Paper (2003) MapReduce paper (2006) Spark Paper (2010) Too many Data Sources, No one uber schema New Cloud DW w/ Data Lakes Based on SQL Self Service Platforms which enable Self-Service Analytics SQL Federation Makes Comeback Dremel Paper (2010) à Drill paper (2012) SQL ++ paper (2014) à Couchbase SQL++ engine (2018) Presto paper (2019), PartiQL (2019) 80’s 90’s 2000’s 2010’s 2020’s

5 Presto: One of the Fastest Growing Open Source Projects
in Data Analytics Business Needs Data-driven decision making Businesses need more data to iterate over Technology Trends Disaggregation of Storage and Compute The rise of data lakes

6 What is Presto? • Distributed SQL query engine •
ANSI SQL on Databases, Data lakes • Designed to be interactive • Access to petabytes of data • Opensource, hosted on github • https://github.com/prestodb

7 Presto Overview

8 Common Questions? • Is presto a database? • How
is it related to Hadoop? • How is it different from a data warehouse?

9 Sample Presto deployment stack & use cases • Ad
hoc • BI tools • Dashboard • A/B testing • ETL/scheduled job • Online service

10 What made Presto different? • Scalable architecture • Pluggable
Connectors • Performance

11 Scalable Architecture • Two roles - coordinator and worker
• Easy scale up and scale down • Scale up to 1000 workers • Validated at web scaled companies

12 Scalable Architecture

13 Pluggable Presto Connectors

14 Presto Connector Data Model • Connector: Driver for a
data source. • Example: HDFS, AWS S3, Cassandra, MySQL, SQL Server, Kafka • Catalog: Contains schemas from a data source specified by the connector • Schemas: Namespace to organize tables. • Tables: Set of unordered rows organized into columns with types.

15 Presto Hive Connector for Object stores & Files systems

16 Presto Hive Connector – Access Control

17 Presto Hive Connector – Data File Types • Supported
File Types • ORC • Parquet • Avro • RCFile • SequenceFile • JSON • Text • No data ingestion needed

18 Presto Druid Connector for real-time analytics

19 Why Presto is Fast • In-Memory processing • Pull
model • Columnar storage and execution

20 The Life of a Query – Simple Scan

21 The Life of a Query – Join and Aggregation
SELECT orders.orderkey, SUM(tax) FROM orders LEFT JOIN lineitem ON orders.orderkey = lineitem.orderkey WHERE discount = 0 GROUP BY orders.orderkey This example is from Presto: SQL on Everything https://research.fb.com/publications/ presto-sql-on-everything/

22 Logical Plan - Do NOT Join Two Big Tables

23 Limitations • Memory Limitation • Fault Tolerance • Single
Coordinator

24 Get started Docker Sandbox for Presto https://hub.docker.com/r/ahanaio/prestodb-sandbox AWS Sandbox
AMI for Presto https://ahana.io/tutorials/aws-sandbox/

25 Ahana • SQL analytics company based on Presto •
Team of experts in cloud, database, and Presto • Investment from Google Ventures • Named CRN Top 10 Big Data Startup of 2020 • Premier member of “[Ahana founders] have been strong supporters of the Presto Foundation since its launch in September 2019” “We are excited to welcome Ahana, as the first and only company focused on supporting Presto of the Presto Foundation”

https://events.linuxfoundation.org/prestocon/ PRESTO20WIBD Free for WiBD Members

27 Join the Presto Community • Require new feature or
file a bug: github.com/prestodb/presto • Slack: prestodb.slack.com • Twitter: @prestodb Stay Up-to-Date with Ahana • URL: ahana.io • Twitter: @ahanaio

Q & A And yes! We are hiring!

8/27/20

30 Presto Foundation: Community Driven

31 Data-Driven Companies need Low Data Latency Analysts and Scientists
need to answer questions: The time it takes from a user having a question to the time they can actually answer it “Data Latency” = 1. User wants to track or explore some new data 2. User meets with Data Eng team to make plan 3. Data team acquire data and check access permissions 4. Build and test the ETLs and make tables available to user 5. Notify the user so they can ask their questions ! Can be days or weeks of time

Women in Big Data Meetup:An introduction to Pre...

Women in Big Data Meetup:An introduction to Presto, an open source distributed SQL engine

Ahana

More Decks by Ahana

Other Decks in Technology

Featured

Transcript

Dipti Borkar Co-Founder & CPO | Ahana An introduction to

Founder Mom Immigrant Girl data geek (DB) Engineer always Product

3 Agenda • What is Presto? • History of federation

4 Technology Cycles Rhyme: Data Federation FDBMS Challenges RDBMS FDBMS

5 Presto: One of the Fastest Growing Open Source Projects

6 What is Presto? • Distributed SQL query engine •

7 Presto Overview

8 Common Questions? • Is presto a database? • How

9 Sample Presto deployment stack & use cases • Ad

10 What made Presto different? • Scalable architecture • Pluggable

11 Scalable Architecture • Two roles - coordinator and worker

12 Scalable Architecture

13 Pluggable Presto Connectors

14 Presto Connector Data Model • Connector: Driver for a

15 Presto Hive Connector for Object stores & Files systems

16 Presto Hive Connector – Access Control

17 Presto Hive Connector – Data File Types • Supported

18 Presto Druid Connector for real-time analytics

19 Why Presto is Fast • In-Memory processing • Pull

20 The Life of a Query – Simple Scan

21 The Life of a Query – Join and Aggregation

22 Logical Plan - Do NOT Join Two Big Tables

23 Limitations • Memory Limitation • Fault Tolerance • Single

24 Get started Docker Sandbox for Presto https://hub.docker.com/r/ahanaio/prestodb-sandbox AWS Sandbox

25 Ahana • SQL analytics company based on Presto •

https://events.linuxfoundation.org/prestocon/ PRESTO20WIBD Free for WiBD Members

27 Join the Presto Community • Require new feature or

Q & A And yes! We are hiring!

8/27/20

30 Presto Foundation: Community Driven

31 Data-Driven Companies need Low Data Latency Analysts and Scientists