Save 37% off PRO during our Black Friday Sale! »

Hands-on Virtual Lab: Getting Started with Presto

00101a1274d1f92977f4e442ef73be86?s=47 Ahana
April 21, 2021

Hands-on Virtual Lab: Getting Started with Presto

Learn more about Presto and how to get started with Presto using Ahana Cloud.



April 21, 2021


  1. HANDS-ON VIRTUAL LAB Getting Started With Presto April 21 2021


    Asif Kazi Solutions Engineer
  3. Over the next 90 minutes you will: • Explore and

    understand Presto using a Hands-On Lab Environment • Start by creating a 1-node cluster on your desktop/laptop • Move to a production-ready, fully managed, containerized, kubernetes environment in the Cloud • Query SQL and object data sources using ANSI SQL • Run federated queries/joins across multiple sources combining data in S3 and RDS/MYSQL Objective for Today 3
  4. Agenda 1) Downloads and Cluster Creation (Background Task - 15

    -20 mins) 2) Understand the Technology (15-20 mins) a) What is Presto? b) What is Ahana Cloud? 3) Getting your hands dirty (60 mins) 4) Summary and Close Out (5 mins) 4
  5. Downloads and Cluster Creation Environment Readiness

  6. Lab 0: Setup Follow instructions outlined in Section 3.0 using

    the accompanying guide 6
  7. Understanding The Technology Presto

  8. What is Presto? • Open source, distributed MPP SQL query

    engine • Query in Place • Federated Querying • ANSI SQL Compliant • Designed ground up for fast analytic queries against data of any size • Originally developed at Facebook • Proven on petabytes of data • SQL-On-Anything • Federated pluggable architecture to support many connector • Opensource, hosted on github • 8
  9. Presto Overview 9 Presto Cluster Coordinator Worker Worker Worker Worker

  10. Presto: One of the Fastest Growing Open Source Projects in

    Data Analytics Business Needs • Data-driven decision making • Businesses need more data to iterate over Technology Trends • Disaggregation of Storage and Compute • The rise of data lakes 10
  11. Common Questions • Is Presto a database? • How is

    it related to Hadoop? • How is it different from a data warehouse? 11
  12. Presto Use Cases Data Lakehouse analytics Reporting & dashboarding Interactive

    ad hoc querying Transformation using SQL (ETL) Federated querying across data sources 12
  13. Presto Architecture

  14. What makes Presto different? Scalable Architecture Pluggable Connectors Performance 14

  15. Scalable Architecture • Two roles - coordinator and worker •

    Easy scale up and scale down • Scale up to 1000 workers • Validated at web scale companies New Worker New Worker Worker Worker Worker Coordinator Data Source Presto Cluster 15
  16. Scalable Architecture Parser/analyzer Worker Worker Worker Metadata API Planner Scheduler

    Data Location API Data Shuffle Data Shuffle Presto Connector Presto Coordinator BI Tools/Notebooks/Clients Presto CLI Looker JDBC Superset ... Tableau Jupyter Result Sets SQL Any Database, Data Stream, or Storage HDFS Object Stores (S3) MySQL ElasticSearch Kafka ... Presto Connector Presto Connector 16
  17. Pluggable Presto Connectors 17

  18. Presto Connector Data Model • Connector: Driver for a data

    source. • Example: HDFS, AWS S3, Cassandra, MySQL, SQL Server, Kafka • Catalog: Contains schemas from a data source specified by the connector • Schemas: Namespace to organize tables. • Tables: Set of unordered rows organized into columns with types. 18
  19. Presto Hive Connector for Object stores & Files systems S3

    Worker Worker Coordinator Hive Metastore Thrift API Split Split 19
  20. Presto Hive Connector – Data File Types • Supported File

    Types • ORC • Parquet • Avro • RCFile • CSV • No data ingestion/duplication/movement needed • Query data in-place • SequenceFile • JSON • Text 20
  21. Why Presto is Fast ? In-Memory Processing Pull Model Columnar

    storage & execution 21
  22. Introducing Ahana Cloud Fully-Managed Presto Service

  23. 23 Ahana Cloud is a fully-managed, cloud-native, Presto service

  24. Managing Presto Is Complex Hadoop complexity ▪ /etc/presto/ ▪ /etc/presto/

    ▪ /etc/presto/jvm.config Many hidden parameters – difficult to tune Just the query engine ▪ No built-in catalog – users need to manage Hive metastore or AWS Glue ▪ No data lake S3 integration Poor out-of-box perf ▪ No tuning ▪ No high-performance indexing ▪ Basic optimizations for even for common queries 24
  25. Ahana Cloud – Reference Architecture • Distributed SQL engine with

    proven scalability • Interactive ANSI SQL queries • Query data where it lives with Federated Connectors (no ETL) • High concurrency • Separation of compute and storage 25

    ACCESS BILLING & SUPPORT In-VPC Presto Clusters (Compute Plane) AD HOC CLUSTER 1 TEST CLUSTER 2 PROD CLUSTER N Glue S3 RDS Elasticsearch Ahana Cloud Account Ahana console oversees and manages every Presto cluster Customer Cloud Account In-VPC orchestration of Presto clusters, where metadata, monitoring, and data sources reside Ahana Cloud for Presto 26
  27. Built-in metadata catalog, data lake, Apache Superset Managed Presto in-VPC

    in user account Start, stop, restart, resize, terminate – end-to-end cluster life cycle management Amazon sources: S3, RDS/MySQL, RDS/Postgres, Elasticsearch, Redshift Highly available & scalable running in containers on Kubernetes across AZs Flexible analytics stack with BYO - metadata, data source, BI tool or notebook Ahana Cloud Summary Ahana Cloud for Presto Point & Query Cloud Service Gives you Presto as a Cloud Data Warehouse in an open, disaggregated stack 27
  28. Getting Your Hands Dirty Single Node to Production Scale

  29. Local Presto 29

  30. Lab 1: Local Presto Configuration Follow instructions outlined in Section

    3.1 using the accompanying guide 30
  31. Lab 2: Start Your Configured Local Presto Follow instructions outlined

    in Section 3.2 using the accompanying guide 31
  32. Lab 3: Query Your Catalogs Follow instructions outlined in Section

    3.3 using the accompanying guide 32
  33. Ahana Cloud Presto 33

  34. Lab 4: Connect to your Ahana Cloud Presto Follow instructions

    outlined in Section 3.4 using the accompanying guide 34
  35. Lab 5: Connect SuperSet to Ahana Follow instructions outlined in

    Section 3.5 using the accompanying guide 35
  36. Lab 6: Query MySQL Follow instructions outlined in Section 3.6

    using the accompanying guide 36
  37. Wrapping Up Conclusion and Things to Try

  38. 1. Scale Up and Scale Down your cluster 2. Check

    your cluster’s PrestoDB Console when running SQL 3. Try queries with more/less workers; how does performance change? 4. Try queries using Parquet format instead of ORC 5. Delete your cluster Things to try later... 38
  39. Conclusion In this hands-on workshop you have: 1. Learned about

    Presto and Ahana Cloud 2. Effortlessly created and managed Presto clusters 3. Run fast SQL federated queries combining datasets from S3 and MySQL using a BI tool 4. Evaluated Ahana’s flexibility to scale out and scale in based on demand, stop and start clusters, and change data sources. 39
  40. Next Steps for You... • Ahana Cloud is available on

    the AWS Marketplace • Sign-up for a 14-day free trial here: 40
  41. Thank you! Stay Up-to-Date with Ahana Website: Blogs:

    Twitter: @ahanaio 41
  42. How to get involved with Presto Join the Slack channel! Write a blog for! Join the virtual meetup group & present! Contribute to the project! 42
  43. Questions? 43

  44. movies_metadata (MySQL) DDL: create table demodb.movies_metadata( adult varchar(5) ,belongs_to_collection varchar(5000)

    ,budget integer ,genres varchar(1000) ,homepage varchar(256) ,id integer ,imdb_id varchar(12) ,original_language char(2) ,original_title varchar(256) ,overview varchar(5000) ,popularity double ,poster_path varchar(256) ,production_companies varchar(1000) ,production_countries varchar(1000) ,release_date date ,revenue integer ,runtime double ,spoken_languages varchar(1000) ,status varchar(12) ,tagline varchar(256) ,title varchar(256) ,video boolean ,vote_average double ,vote_count integer ); 44 Additional Information The sample data used for the Lab Ratings Data layout: userid (int) movieid (int) rating (double) datetime (timestamp) 26.2m records, S3, CSV 5.3k rows RDS/MySQL table All the movie datasets and more are available from :
  45. Data Driven Organizations Modernize from Warehouses to Open Cloud Data

    Lake Analytics Key characteristics: • Query in place • Data sources and volumes growing fast • Good-Great perf to query all data (1000X more) • Open, Disaggregated Pain: Complex infrastructure needed to Query Data Lakes Open Data Lake Analytics CLOUD Data Lake Structured, Semi-Structured and Unstructured Data 80% of data in S3 20% other data sources (QUERY ENGINE) HIVE (Catalog) Cloud Data Warehouse Key characteristics: • Ingest enterprise data • Data volumes growing slow • Best performance on subset of company data • Lock-in Modernized Platform BI Tools / Notebook Integrations BI Reports Data Marts ETL External Data Operational Data