Hands-on Virtual Lab: Getting Started with Presto

HANDS-ON VIRTUAL LAB Getting Started With Presto April 21 2021
1

WELCOME TO THE AHANA HANDS-ON VIRTUAL LAB Your Lab Guide:
Asif Kazi Solutions Engineer [email protected]

Over the next 90 minutes you will: • Explore and
understand Presto using a Hands-On Lab Environment • Start by creating a 1-node cluster on your desktop/laptop • Move to a production-ready, fully managed, containerized, kubernetes environment in the Cloud • Query SQL and object data sources using ANSI SQL • Run federated queries/joins across multiple sources combining data in S3 and RDS/MYSQL Objective for Today 3

Agenda 1) Downloads and Cluster Creation (Background Task - 15
-20 mins) 2) Understand the Technology (15-20 mins) a) What is Presto? b) What is Ahana Cloud? 3) Getting your hands dirty (60 mins) 4) Summary and Close Out (5 mins) 4

Downloads and Cluster Creation Environment Readiness

Lab 0: Setup Follow instructions outlined in Section 3.0 using
the accompanying guide 6

Understanding The Technology Presto

What is Presto? • Open source, distributed MPP SQL query
engine • Query in Place • Federated Querying • ANSI SQL Compliant • Designed ground up for fast analytic queries against data of any size • Originally developed at Facebook • Proven on petabytes of data • SQL-On-Anything • Federated pluggable architecture to support many connector • Opensource, hosted on github • https://github.com/prestodb 8

Presto Overview 9 Presto Cluster Coordinator Worker Worker Worker Worker

Presto: One of the Fastest Growing Open Source Projects in
Data Analytics Business Needs • Data-driven decision making • Businesses need more data to iterate over Technology Trends • Disaggregation of Storage and Compute • The rise of data lakes 10

Common Questions • Is Presto a database? • How is
it related to Hadoop? • How is it different from a data warehouse? 11

Presto Use Cases Data Lakehouse analytics Reporting & dashboarding Interactive
ad hoc querying Transformation using SQL (ETL) Federated querying across data sources 12

Presto Architecture

What makes Presto different? Scalable Architecture Pluggable Connectors Performance 14

Scalable Architecture • Two roles - coordinator and worker •
Easy scale up and scale down • Scale up to 1000 workers • Validated at web scale companies New Worker New Worker Worker Worker Worker Coordinator Data Source Presto Cluster 15

Scalable Architecture Parser/analyzer Worker Worker Worker Metadata API Planner Scheduler
Data Location API Data Shufﬂe Data Shufﬂe Presto Connector Presto Coordinator BI Tools/Notebooks/Clients Presto CLI Looker JDBC Superset ... Tableau Jupyter Result Sets SQL Any Database, Data Stream, or Storage HDFS Object Stores (S3) MySQL ElasticSearch Kafka ... Presto Connector Presto Connector 16

Pluggable Presto Connectors 17

Presto Connector Data Model • Connector: Driver for a data
source. • Example: HDFS, AWS S3, Cassandra, MySQL, SQL Server, Kafka • Catalog: Contains schemas from a data source speciﬁed by the connector • Schemas: Namespace to organize tables. • Tables: Set of unordered rows organized into columns with types. 18

Presto Hive Connector for Object stores & Files systems S3
Worker Worker Coordinator Hive Metastore Thrift API Split Split 19

Presto Hive Connector – Data File Types • Supported File
Types • ORC • Parquet • Avro • RCFile • CSV • No data ingestion/duplication/movement needed • Query data in-place • SequenceFile • JSON • Text 20

Why Presto is Fast ? In-Memory Processing Pull Model Columnar
storage & execution 21

Introducing Ahana Cloud Fully-Managed Presto Service

23 Ahana Cloud is a fully-managed, cloud-native, Presto service

Managing Presto Is Complex Hadoop complexity ▪ /etc/presto/config.properties ▪ /etc/presto/node.properties
▪ /etc/presto/jvm.config Many hidden parameters – difficult to tune Just the query engine ▪ No built-in catalog – users need to manage Hive metastore or AWS Glue ▪ No data lake S3 integration Poor out-of-box perf ▪ No tuning ▪ No high-performance indexing ▪ Basic optimizations for even for common queries 24

Ahana Cloud – Reference Architecture • Distributed SQL engine with
proven scalability • Interactive ANSI SQL queries • Query data where it lives with Federated Connectors (no ETL) • High concurrency • Separation of compute and storage 25

Ahana Console (Control Plane) CLUSTER ORCHESTRATION CONSOLIDATED LOGGING SECURITY &
ACCESS BILLING & SUPPORT In-VPC Presto Clusters (Compute Plane) AD HOC CLUSTER 1 TEST CLUSTER 2 PROD CLUSTER N Glue S3 RDS Elasticsearch Ahana Cloud Account Ahana console oversees and manages every Presto cluster Customer Cloud Account In-VPC orchestration of Presto clusters, where metadata, monitoring, and data sources reside Ahana Cloud for Presto 26

Built-in metadata catalog, data lake, Apache Superset Managed Presto in-VPC
in user account Start, stop, restart, resize, terminate – end-to-end cluster life cycle management Amazon sources: S3, RDS/MySQL, RDS/Postgres, Elasticsearch, Redshift Highly available & scalable running in containers on Kubernetes across AZs Flexible analytics stack with BYO - metadata, data source, BI tool or notebook Ahana Cloud Summary Ahana Cloud for Presto Point & Query Cloud Service Gives you Presto as a Cloud Data Warehouse in an open, disaggregated stack 27

Getting Your Hands Dirty Single Node to Production Scale

Local Presto 29

Lab 1: Local Presto Conﬁguration Follow instructions outlined in Section
3.1 using the accompanying guide 30

Lab 2: Start Your Conﬁgured Local Presto Follow instructions outlined
in Section 3.2 using the accompanying guide 31

Lab 3: Query Your Catalogs Follow instructions outlined in Section
3.3 using the accompanying guide 32

Ahana Cloud Presto 33

Lab 4: Connect to your Ahana Cloud Presto Follow instructions
outlined in Section 3.4 using the accompanying guide 34

Lab 5: Connect SuperSet to Ahana Follow instructions outlined in
Section 3.5 using the accompanying guide 35

Lab 6: Query MySQL Follow instructions outlined in Section 3.6
using the accompanying guide 36

Wrapping Up Conclusion and Things to Try

1. Scale Up and Scale Down your cluster 2. Check
your cluster’s PrestoDB Console when running SQL 3. Try queries with more/less workers; how does performance change? 4. Try queries using Parquet format instead of ORC 5. Delete your cluster Things to try later... 38

Conclusion In this hands-on workshop you have: 1. Learned about
Presto and Ahana Cloud 2. Effortlessly created and managed Presto clusters 3. Run fast SQL federated queries combining datasets from S3 and MySQL using a BI tool 4. Evaluated Ahana’s ﬂexibility to scale out and scale in based on demand, stop and start clusters, and change data sources. 39

Next Steps for You... • Ahana Cloud is available on
the AWS Marketplace • Sign-up for a 14-day free trial here: https://ahana.io/sign-up 40

Thank you! Stay Up-to-Date with Ahana Website: https://ahana.io/ Blogs: https://ahana.io/blog/
Twitter: @ahanaio 41

How to get involved with Presto Join the Slack channel!
prestodb.slack.com Write a blog for prestodb.io! prestodb.io/blog Join the virtual meetup group & present! meetup.com/prestodb Contribute to the project! github.com/prestodb 42

Questions? 43

movies_metadata (MySQL) DDL: create table demodb.movies_metadata( adult varchar(5) ,belongs_to_collection varchar(5000)
,budget integer ,genres varchar(1000) ,homepage varchar(256) ,id integer ,imdb_id varchar(12) ,original_language char(2) ,original_title varchar(256) ,overview varchar(5000) ,popularity double ,poster_path varchar(256) ,production_companies varchar(1000) ,production_countries varchar(1000) ,release_date date ,revenue integer ,runtime double ,spoken_languages varchar(1000) ,status varchar(12) ,tagline varchar(256) ,title varchar(256) ,video boolean ,vote_average double ,vote_count integer ); 44 Additional Information The sample data used for the Lab Ratings Data layout: userid (int) movieid (int) rating (double) datetime (timestamp) 26.2m records, S3, CSV 5.3k rows RDS/MySQL table All the movie datasets and more are available from : https://www.kaggle.com/rounakbanik/the-movies-dataset

Data Driven Organizations Modernize from Warehouses to Open Cloud Data
Lake Analytics Key characteristics: • Query in place • Data sources and volumes growing fast • Good-Great perf to query all data (1000X more) • Open, Disaggregated Pain: Complex infrastructure needed to Query Data Lakes Open Data Lake Analytics CLOUD Data Lake Structured, Semi-Structured and Unstructured Data 80% of data in S3 20% other data sources (QUERY ENGINE) HIVE (Catalog) Cloud Data Warehouse Key characteristics: • Ingest enterprise data • Data volumes growing slow • Best performance on subset of company data • Lock-in Modernized Platform BI Tools / Notebook Integrations BI Reports Data Marts ETL External Data Operational Data

Hands-on Virtual Lab: Getting Started with Presto

Hands-on Virtual Lab: Getting Started with Presto

More Decks by Ahana

Other Decks in Technology

Featured

Transcript