Level 101 for Presto: What is PrestoDB?

00101a1274d1f92977f4e442ef73be86?s=47 Ahana
July 28, 2020

Level 101 for Presto: What is PrestoDB?

Presto is a widely adopted federated SQL engine for federated querying across multiple data sources. With Presto, you can perform ad hoc querying of data in place. For today’s “data hacker”, Presto helps solve challenges around time to discovery and the amount of time it takes to do ad hoc analysis.

In Level 101, you’ll get an overview of Presto, including:

A high level overview of Presto & most common use cases
The problems it solves and why you should use it
A live, hands-on demo on getting Presto running on Docker
Real world example: How Twitter uses Presto at scale

00101a1274d1f92977f4e442ef73be86?s=128

Ahana

July 28, 2020
Tweet

Transcript

  1. Level 101 for Presto SQL on Everything Part 1 of

    the Tech Talk Series for Presto What is PrestoDB? What’s the difference? Beinan Wang Sr. Software Engineer, Twitter Dipti Borkar Co-Founder & CPO, Ahana
  2. Presto 101 Outline • What is Presto? • How are

    we using Presto? • What made Presto different? ◦ Scalable architecture ◦ Flexible Connectors ◦ Performance • The life of a query 2
  3. What is Presto? • Distributed SQL query engine ◦ ANSI

    SQL on Hadoop, Kafka, Druid etc. ◦ Designed to be interactive ◦ Access to petabytes of data • Opensource, hosted on github ◦ https://github.com/prestodb • Open question: ◦ Is presto a database? 3
  4. How are we using Presto? • Adhoc • BI tools

    • Dashboard • A/B testing • ETL/scheduled job • Online service * 4
  5. What made presto different? • Scalable architecture • Pluggable Connectors

    • Performance 5
  6. Scalable architecture • Two roles -- coordinator and worker •

    Easy scale up and scale down ◦ Scale up to 1000 workers* ◦ Fit in web scaled companies 6
  7. Pluggable Presto Connectors

  8. Presto Connector Data Model • Connector: Driver for a data

    source. ◦ Example: HDFS, Cassandra, Kafka, SQL Server • Catalog: Contains schemas from a datasource specified by the connector • Schemas: Namespace to organize tables. • Tables: Set of unordered rows organized into columns with types. 8
  9. Presto Hive Connector 9

  10. Presto Hive Connector -- Access Control 10

  11. Presto Hive Connector -- Data File Types 11 • Supported

    File Types ◦ ORC ◦ Parquet ◦ Avro ◦ RCFile ◦ SequenceFile ◦ JSON ◦ Text • No data ingestion needed
  12. Presto Druid Connector 12

  13. Why Presto is Fast • In-Memory processing • Pull model

    • Columnar storage and execution • Bytecode generation 13
  14. The Life of a Query -- Simple Scan SELECT *

    FROM orders WHERE discount = 0
  15. The Life of a Query -- Join and Aggregation SELECT

    orders.orderkey, SUM(tax) FROM orders LEFT JOIN lineitem ON orders.orderkey = lineitem.orderkey WHERE discount = 0 GROUP BY orders.orderkey This example is from Presto: SQL on Everything https://research.fb.com/publications/presto-sql-on-everything/
  16. Logical Plan -- do NOT join two big tables This

    example is from Presto: SQL on Everything https://research.fb.com/publications/presto-sql-on-everything/
  17. Limitations • Memory Limitation • Fault Tolerance • Single Point

    of Failure: Coordinator 17
  18. Time for a demo! Local Setup Query TPC-DS Cloud Setup

    Query S3 / Parquet
  19. Docker Sandbox for Presto https://hub.docker.com/r/ahanaio/prestodb-sandbox

  20. AWS Sandbox AMI for Presto https://ahana.io/tutorials/aws-sandbox/

  21. Q&A

  22. Join the Presto Community • Require new feature or file

    a bug: github.com/prestodb/presto • Slack: prestodb.slack.com • Twitter: @prestodb 22 Stay up-to-date with Ahana • URL: ahana.io • Twitter: @ahanaio