Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hands-on Virtual Lab: Getting Started with Presto

Ahana
April 21, 2021

Hands-on Virtual Lab: Getting Started with Presto

Learn more about Presto and how to get started with Presto using Ahana Cloud.

Ahana

April 21, 2021
Tweet

More Decks by Ahana

Other Decks in Technology

Transcript

  1. Over the next 90 minutes you will: • Explore and

    understand Presto using a Hands-On Lab Environment • Start by creating a 1-node cluster on your desktop/laptop • Move to a production-ready, fully managed, containerized, kubernetes environment in the Cloud • Query SQL and object data sources using ANSI SQL • Run federated queries/joins across multiple sources combining data in S3 and RDS/MYSQL Objective for Today 3
  2. Agenda 1) Downloads and Cluster Creation (Background Task - 15

    -20 mins) 2) Understand the Technology (15-20 mins) a) What is Presto? b) What is Ahana Cloud? 3) Getting your hands dirty (60 mins) 4) Summary and Close Out (5 mins) 4
  3. What is Presto? • Open source, distributed MPP SQL query

    engine • Query in Place • Federated Querying • ANSI SQL Compliant • Designed ground up for fast analytic queries against data of any size • Originally developed at Facebook • Proven on petabytes of data • SQL-On-Anything • Federated pluggable architecture to support many connector • Opensource, hosted on github • https://github.com/prestodb 8
  4. Presto: One of the Fastest Growing Open Source Projects in

    Data Analytics Business Needs • Data-driven decision making • Businesses need more data to iterate over Technology Trends • Disaggregation of Storage and Compute • The rise of data lakes 10
  5. Common Questions • Is Presto a database? • How is

    it related to Hadoop? • How is it different from a data warehouse? 11
  6. Presto Use Cases Data Lakehouse analytics Reporting & dashboarding Interactive

    ad hoc querying Transformation using SQL (ETL) Federated querying across data sources 12
  7. Scalable Architecture • Two roles - coordinator and worker •

    Easy scale up and scale down • Scale up to 1000 workers • Validated at web scale companies New Worker New Worker Worker Worker Worker Coordinator Data Source Presto Cluster 15
  8. Scalable Architecture Parser/analyzer Worker Worker Worker Metadata API Planner Scheduler

    Data Location API Data Shuffle Data Shuffle Presto Connector Presto Coordinator BI Tools/Notebooks/Clients Presto CLI Looker JDBC Superset ... Tableau Jupyter Result Sets SQL Any Database, Data Stream, or Storage HDFS Object Stores (S3) MySQL ElasticSearch Kafka ... Presto Connector Presto Connector 16
  9. Presto Connector Data Model • Connector: Driver for a data

    source. • Example: HDFS, AWS S3, Cassandra, MySQL, SQL Server, Kafka • Catalog: Contains schemas from a data source specified by the connector • Schemas: Namespace to organize tables. • Tables: Set of unordered rows organized into columns with types. 18
  10. Presto Hive Connector for Object stores & Files systems S3

    Worker Worker Coordinator Hive Metastore Thrift API Split Split 19
  11. Presto Hive Connector – Data File Types • Supported File

    Types • ORC • Parquet • Avro • RCFile • CSV • No data ingestion/duplication/movement needed • Query data in-place • SequenceFile • JSON • Text 20
  12. Managing Presto Is Complex Hadoop complexity ▪ /etc/presto/config.properties ▪ /etc/presto/node.properties

    ▪ /etc/presto/jvm.config Many hidden parameters – difficult to tune Just the query engine ▪ No built-in catalog – users need to manage Hive metastore or AWS Glue ▪ No data lake S3 integration Poor out-of-box perf ▪ No tuning ▪ No high-performance indexing ▪ Basic optimizations for even for common queries 24
  13. Ahana Cloud – Reference Architecture • Distributed SQL engine with

    proven scalability • Interactive ANSI SQL queries • Query data where it lives with Federated Connectors (no ETL) • High concurrency • Separation of compute and storage 25
  14. Ahana Console (Control Plane) CLUSTER ORCHESTRATION CONSOLIDATED LOGGING SECURITY &

    ACCESS BILLING & SUPPORT In-VPC Presto Clusters (Compute Plane) AD HOC CLUSTER 1 TEST CLUSTER 2 PROD CLUSTER N Glue S3 RDS Elasticsearch Ahana Cloud Account Ahana console oversees and manages every Presto cluster Customer Cloud Account In-VPC orchestration of Presto clusters, where metadata, monitoring, and data sources reside Ahana Cloud for Presto 26
  15. Built-in metadata catalog, data lake, Apache Superset Managed Presto in-VPC

    in user account Start, stop, restart, resize, terminate – end-to-end cluster life cycle management Amazon sources: S3, RDS/MySQL, RDS/Postgres, Elasticsearch, Redshift Highly available & scalable running in containers on Kubernetes across AZs Flexible analytics stack with BYO - metadata, data source, BI tool or notebook Ahana Cloud Summary Ahana Cloud for Presto Point & Query Cloud Service Gives you Presto as a Cloud Data Warehouse in an open, disaggregated stack 27
  16. Lab 2: Start Your Configured Local Presto Follow instructions outlined

    in Section 3.2 using the accompanying guide 31
  17. Lab 4: Connect to your Ahana Cloud Presto Follow instructions

    outlined in Section 3.4 using the accompanying guide 34
  18. Lab 5: Connect SuperSet to Ahana Follow instructions outlined in

    Section 3.5 using the accompanying guide 35
  19. 1. Scale Up and Scale Down your cluster 2. Check

    your cluster’s PrestoDB Console when running SQL 3. Try queries with more/less workers; how does performance change? 4. Try queries using Parquet format instead of ORC 5. Delete your cluster Things to try later... 38
  20. Conclusion In this hands-on workshop you have: 1. Learned about

    Presto and Ahana Cloud 2. Effortlessly created and managed Presto clusters 3. Run fast SQL federated queries combining datasets from S3 and MySQL using a BI tool 4. Evaluated Ahana’s flexibility to scale out and scale in based on demand, stop and start clusters, and change data sources. 39
  21. Next Steps for You... • Ahana Cloud is available on

    the AWS Marketplace • Sign-up for a 14-day free trial here: https://ahana.io/sign-up 40
  22. How to get involved with Presto Join the Slack channel!

    prestodb.slack.com Write a blog for prestodb.io! prestodb.io/blog Join the virtual meetup group & present! meetup.com/prestodb Contribute to the project! github.com/prestodb 42
  23. movies_metadata (MySQL) DDL: create table demodb.movies_metadata( adult varchar(5) ,belongs_to_collection varchar(5000)

    ,budget integer ,genres varchar(1000) ,homepage varchar(256) ,id integer ,imdb_id varchar(12) ,original_language char(2) ,original_title varchar(256) ,overview varchar(5000) ,popularity double ,poster_path varchar(256) ,production_companies varchar(1000) ,production_countries varchar(1000) ,release_date date ,revenue integer ,runtime double ,spoken_languages varchar(1000) ,status varchar(12) ,tagline varchar(256) ,title varchar(256) ,video boolean ,vote_average double ,vote_count integer ); 44 Additional Information The sample data used for the Lab Ratings Data layout: userid (int) movieid (int) rating (double) datetime (timestamp) 26.2m records, S3, CSV 5.3k rows RDS/MySQL table All the movie datasets and more are available from : https://www.kaggle.com/rounakbanik/the-movies-dataset
  24. Data Driven Organizations Modernize from Warehouses to Open Cloud Data

    Lake Analytics Key characteristics: • Query in place • Data sources and volumes growing fast • Good-Great perf to query all data (1000X more) • Open, Disaggregated Pain: Complex infrastructure needed to Query Data Lakes Open Data Lake Analytics CLOUD Data Lake Structured, Semi-Structured and Unstructured Data 80% of data in S3 20% other data sources (QUERY ENGINE) HIVE (Catalog) Cloud Data Warehouse Key characteristics: • Ingest enterprise data • Data volumes growing slow • Best performance on subset of company data • Lock-in Modernized Platform BI Tools / Notebook Integrations BI Reports Data Marts ETL External Data Operational Data