Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Presto, an open source distributed SQL engine

Ahana
December 16, 2020

Introduction to Presto, an open source distributed SQL engine

Ahana

December 16, 2020
Tweet

More Decks by Ahana

Other Decks in Technology

Transcript

  1. Dipti Borkar Co-Founder & CPO | Ahana An introduction to

    Presto, an open source distributed SQL engine
  2. 2 Agenda • What is Presto? • History of federation

    • Introduction to Presto • What made Presto different? • Scalable architecture • Flexible Connectors • Performance • The life of a query
  3. 3 Technology Cycles Rhyme: Data Federation FDBMS Challenges RDBMS FDBMS

    Paper by McCleod / Heimbigner (1985) FDBMS Paper by Sheth / Larson (1990) OLTP to DW Wins Data Warehouse becomes the source of truth Star schema becomes sacred Cloud & Big Data Composite Software (founded 2001) Garlic Paper by Laura Haas (2002) à DB2 Federated Google File System Paper (2003) MapReduce paper (2006) Spark Paper (2010) Too many Data Sources, No one uber schema New Cloud DW w/ Data Lakes Based on SQL Self Service Platforms which enable Self-Service Analytics SQL Federation Makes Comeback Dremel Paper (2010) à Drill paper (2012) SQL ++ paper (2014) à Couchbase SQL++ engine (2018) Presto paper (2019), PartiQL (2019) 80’s 90’s 2000’s 2010’s 2020’s
  4. 4 Presto: One of the Fastest Growing Open Source Projects

    in Data Analytics Business Needs Data-driven decision making Businesses need more data to iterate over Technology Trends Disaggregation of Storage and Compute The rise of data lakes
  5. 5 What is Presto? • Distributed SQL query engine •

    ANSI SQL on Databases, Data lakes • Designed to be interactive • Access to petabytes of data • Opensource, hosted on github • https://github.com/prestodb
  6. 7 Common Questions? • Is presto a database? • How

    is it related to Hadoop? • How is it different from a data warehouse?
  7. 8 Sample Presto deployment stack & use cases • Ad

    hoc • BI tools • Dashboard • A/B testing • ETL/scheduled job • Online service
  8. 10 Scalable Architecture • Two roles - coordinator and worker

    • Easy scale up and scale down • Scale up to 1000 workers • Validated at web scaled companies
  9. 13 Presto Connector Data Model • Connector: Driver for a

    data source. • Example: HDFS, AWS S3, Cassandra, MySQL, SQL Server, Kafka • Catalog: Contains schemas from a data source specified by the connector • Schemas: Namespace to organize tables. • Tables: Set of unordered rows organized into columns with types.
  10. 16 Presto Hive Connector – Data File Types • Supported

    File Types • ORC • Parquet • Avro • RCFile • SequenceFile • JSON • Text • No data ingestion needed
  11. 18 Why Presto is Fast • In-Memory processing • Pull

    model • Columnar storage and execution
  12. 20 The Life of a Query – Join and Aggregation

    SELECT orders.orderkey, SUM(tax) FROM orders LEFT JOIN lineitem ON orders.orderkey = lineitem.orderkey WHERE discount = 0 GROUP BY orders.orderkey This example is from Presto: SQL on Everything https://research.fb.com/publications/ presto-sql-on-everything/
  13. 24 Ahana At A Glance • First PrestoDB-based Company •

    Named Best Big Data Startup of 2020 by Datanami • Named CRN Top 10 Big Data Startup of 2020 • Investment from Google Ventures, Lux Ventures, Leslie Ventures • Team of experts in cloud, database, and Presto • Premier member of
  14. 25

  15. 26 Managing Presto Remains Complex Hadoop complexity ▪ /etc/presto/config.properties ▪

    /etc/presto/node.properties ▪ /etc/presto/jvm.config Many hidden parameters – difficult to tune Just the query engine ▪ No built-in catalog – users need to manage Hive metastore or AWS Glue ▪ No datalake S3 integration Poor out-of-box perf ▪ No tuning ▪ No high-performance indexing ▪ Basic optimizations for even for common queries
  16. 27 How Ahana Cloud works? ~ 30 mins to create

    the compute plane https://app.ahana.cloud/signup Create Presto Clusters in your account
  17. 28 Ahana Cloud – Reference Architecture • Distributed SQL engine

    with proven scalability • Interactive ANSI SQL queries • Query data where it lives with Federated Connectors (no ETL) • High concurrency • Separation of compute and storage
  18. 29 Ahana Cloud for Presto Ahana Console (Control Plane) CLUSTER

    ORCHESTRATION CONSOLIDATED LOGGING SECURITY & ACCESS BILLING & SUPPORT In-VPC Presto Clusters (Compute Plane) AD HOC CLUSTER 1 TEST CLUSTER 2 PROD CLUSTER N Glue S3 RDS Elasticsearch Ahana Cloud Account Ahana console oversees and manages every Presto cluster Customer Cloud Account In-VPC orchestration of Presto clusters, where metadata, monitoring, and data sources reside
  19. 30 COMPUTE PLANE Coordinator 1 Worker Worker Worker Metastores Scale

    Up/Down SumUp’s Redshift, MySQL, Postgres, MongoDB (SSL / HTTPS) Coordinator 2 Worker Worker Worker Worker Worker USER DATA PLANE Cluster: ReportingProd Cluster: DataEnggJobs CREATE 4 NODE CLUSTER Metastore ADD DATA SOURCE & AUTO-RESTART OPERATION: OPERATION: CREATE 2 NODE CLUSTER RE-SIZE STOP ($0 WHEN STOPPED) START CLUSTER /W SAVED CONFIG & DATA SOURCES ATTACHED Coordinator 2 Worker Worker Worker Worker Worker AWS EMR does not allow for ▪ Cluster click-button restart, stop & start, auto-restarts for catalog changes ▪ Cluster & data source configs and metastores are not preserved ▪ Re-started clusters are not auto upgraded to latest Presto version Ahana Cloud – Seamless Cluster Operations
  20. 31 Ahana Cloud Summary Gives you Presto as a Cloud

    Data Warehouse in an open, disaggregated stack Managed Presto in-VPC in user account Built-in metadata catalog, data lake, Apache Superset - Start, stop, restart, resize, terminate – end-to-end cluster life cycle management Amazon sources: S3, RDS/MySQL, RDS/Postgres, Elasticsearch, Redshift Highly available & scalable running in containers on Kubernetes across AZs Flexible analytics stack with BYO - metadata, data source, BI tool or notebook Ahana Cloud for Presto Point & Query Cloud Service
  21. An Innovative 1st Year • Project Aria • RaptorX •

    Presto-on-Spark • Disaggregated Coordinator (a.k.a. Fireball) • SQL Functions • UDF Support • Pinot Connector • Druid Connector • … and more ...
  22. 36 PrestoDB Advancements by the Community 1. Improved planner via

    Project Aria - prestodb can now push down entire expression to the data source for some file formats like ORC. https://prestodb.io/blog/2019/12/23/improve-presto-planner https://engineering.fb.com/data-infrastructure/aria-presto/ 2. Grouped execution of non-partitioned tables via Project Presto Unlimited https://prestodb.io/blog/2019/08/05/presto-unlimited-mpp-database-at-scale https://github.com/prestodb/presto/issues/12124 3. UDFs - Dynamic SQL functions support https://prestodb.io/docs/current/admin/function-namespace-managers.html 4. Connectors - Pinot and Druid https://prestodb.io/docs/current/connector.html https://prestosql.io/docs/current/connector.html
  23. 37 PrestoDB Advancements by the Community 1. Improved planner via

    Project Aria - prestodb can now push down entire expression to the data source for some file formats like ORC. https://prestodb.io/blog/2019/12/23/improve-presto-planner https://engineering.fb.com/data-infrastructure/aria-presto/ 2. Grouped execution of non-partitioned tables via Project Presto Unlimited https://prestodb.io/blog/2019/08/05/presto-unlimited-mpp-database-at-scale https://github.com/prestodb/presto/issues/12124 3. UDFs - Dynamic SQL functions support https://prestodb.io/docs/current/admin/function-namespace-managers.html 4. Connectors - Pinot and Druid https://prestodb.io/docs/current/connector.html https://prestosql.io/docs/current/connector.html
  24. 38 PrestoDB Advancements by the Community 1. Improved planner via

    Project Aria - prestodb can now push down entire expression to the data source for some file formats like ORC. https://prestodb.io/blog/2019/12/23/improve-presto-planner https://engineering.fb.com/data-infrastructure/aria-presto/ 2. Grouped execution of non-partitioned tables via Project Presto Unlimited https://prestodb.io/blog/2019/08/05/presto-unlimited-mpp-database-at-scale https://github.com/prestodb/presto/issues/12124 3. UDFs - Dynamic SQL functions support https://prestodb.io/docs/current/admin/function-namespace-managers.html 4. Connectors - Pinot and Druid https://prestodb.io/docs/current/connector.html https://prestosql.io/docs/current/connector.html
  25. 39 PrestoDB Advancements by the Community 1. Improved planner via

    Project Aria - prestodb can now push down entire expression to the data source for some file formats like ORC. https://prestodb.io/blog/2019/12/23/improve-presto-planner https://engineering.fb.com/data-infrastructure/aria-presto/ 2. Grouped execution of non-partitioned tables via Project Presto Unlimited https://prestodb.io/blog/2019/08/05/presto-unlimited-mpp-database-at-scale https://github.com/prestodb/presto/issues/12124 3. UDFs - Dynamic SQL functions support https://prestodb.io/docs/current/admin/function-namespace-managers.html 4. Connectors - Pinot and Druid https://prestodb.io/docs/current/connector.html https://prestosql.io/docs/current/connector.html
  26. 41 Join the Presto Community • Require new feature or

    file a bug: github.com/prestodb/presto • Meetup: meetup.com/prestodb • Slack: prestodb.slack.com • Twitter: @prestodb Stay Up-to-Date with Ahana • URL: ahana.io • Twitter: @ahanaio
  27. 45 Data-Driven Companies need Low Data Latency Analysts and Scientists

    need to answer questions: The time it takes from a user having a question to the time they can actually answer it “Data Latency” = 1. User wants to track or explore some new data 2. User meets with Data Eng team to make plan 3. Data team acquire data and check access permissions 4. Build and test the ETLs and make tables available to user 5. Notify the user so they can ask their questions ! Can be days or weeks of time