Introduction to Amazon Athena

© 2020, Amazon Web Services, Inc. or its Affiliates. All
rights reserved. Sungmin Kim, AWS Solutions Architect Amazon Athena Interactive query service to analyze data in Amazon S3 using standard

rights reserved. Agenda • Amazon Athena Overview • Athena Design Patterns • Athena in Action • Create External Tables • Data Formats and SerDes • Partitions • Optimize and Secure • Workgroups • Views • CTAS • Monitoring & Auditing • Summary

rights reserved. 1. Introduction to Amazon Athena

rights reserved. Amazon Athena Serverless Distributed (SQL) Query Engine for Amazon S3 Distributed SQL Query Engine Amazon S3

rights reserved. Why Amazon Athena ? • Decouple storage from compute • Serverless – No infrastructure or resources to manage • Pay only for data scanned • Schema on read – Same data, many views • Secure – IAM/SAMLv2 for authentication; Encryption at rest & in transit • Standard compliant and open storage file formats • Built on powerful community supported OSS solutions

rights reserved. Simple Pricing • DDL operations – FREE • SQL operations – FREE • Query concurrency – FREE • Data scanned - $5 / TB • Standard S3 rates for storage, requests, and data transfers apply

rights reserved. Runs standard SQL • Uses Presto with ANSI SQL support • Works with standard data formats • CSV • Apache Weblogs • JSON • Parquet • ORC • Handles complex queries • Large Joins • Window functions • Arrays

rights reserved. Fast Performance for Large Data Sets • Fast, ad-hoc queries • Executes queries in parallel • No provisioning extra resources for complex queries • Scales automatically • Amazon Athena Federated Query

rights reserved. Storage Query Engine Scale-up Scale-out How to handle Large Data Sets?

rights reserved. Storage Storage Interface Layer Storage Compute Decouple storage from compute Compute Scale-out

rights reserved. Storage Interface Layer Scale-out Storage Scale-out Compute Distributed Processing Framework Distributed File System (DFS) ex) HDFS, Amazon S3, … ex) Hadoop MapReduce, Apache Spark, Flink Apache Hive, Presto Amazon Athena, … Distributed Data Processing System

rights reserved. 2. Athena Design Patterns

rights reserved. 1: Ad-hoc use-case

rights reserved. S3 Athena AWS Glue Data Catalog Query data Hot data Warm & cold data Application request 2: SaaS use-case

rights reserved. AWS service logs Application logs Data sourced from external vendors S3 Athena Update table partition Query data S3 Athena CTAS and INSERT INTO to ETL 3: ETL and query use-case Glue Data Catalog

rights reserved. S3 Athena Query data S3 4: Data science exploration and feature engineering Glue Data Catalog Raw Data Transformed data SageMaker

rights reserved. Comparison of SQL Processing engines Data Structure Semi Semi Semi Full Languages API/SQL SQL SQL SQL Data Store S3 (Glue), S3/HDFS (Spark) S3/HDFS S3 Local Use case Transformation SQL Queries for S3/HDFS Serverless SQL Queries for S3 Fully Featured SQL Database Performance Glue Amazon Athena Amazon Redshift Amazon EMR/ Amazon EMR

rights reserved. 3. Athena in Action

rights reserved. Create External Tables • Use Apache Hive DDL to create table • Run DDL statements using the Athena console • Via a JDBC driver, using SQL workbench • Using the Athena create table wizard • Create “external” table in DDL • Creates a view of the data • Deleting table doesn’t delete the data in S3 • Schema-on-read • Projects your schema onto data at query execution time • No need for data loading or transformation

rights reserved. Navigate to ’Saved Queries’ to get DDL

rights reserved. Create External Table using DDL

rights reserved. New Flights Parquet Table Created

rights reserved. Save Query Results • When you run an Athena query for the first time, an S3 bucket called "aws-athena-query-results-<account_id>" is created on your account, where "<account_id>" is replaced with your AWS account ID. • The results of all Athena queries are stored in this S3 bucket using the year, month, and day the query was run, the hexadecimal Athena query ID, and the region the query was run in the following format:

rights reserved. Athena S3 $ aws s3 ls s3://ads/impressions/ PRE dt=2009-04-12-13-00/ PRE dt=2009-04-12-13-05/ PRE dt=2009-04-12-13-10/ PRE dt=2009-04-12-13-15/ PRE dt=2009-04-12-13-20/ PRE dt=2009-04-12-14-00/ PRE dt=2009-04-12-14-05/ PRE dt=2009-04-12-14-10/ PRE dt=2009-04-12-14-15/ PRE dt=2009-04-12-14-20/ PRE dt=2009-04-12-15-00/ PRE dt=2009-04-12-15-05/ ….. ------------------------------------------- 2009-04-12-13-20 ap3HcVKAWfXtgIPu6WpuUfAfL0DQEc 2009-04-12-13-20 17uchtodoS9kdeQP1x0XThKl5IuRsV 2009-04-12-13-20 JOUf1SCtRwviGw8sVcghqE5h0nkgtp 2009-04-12-13-20 NQ2XP0J0dvVbCXJ0pb4XvqJ5A4QxxH 2009-04-12-13-20 fFAItiBMsgqro9kRdIwbeX60SROaxr 2009-04-12-13-20 V4og4R9W6G3QjHHwF7gI1cSqig5D1G 2009-04-12-13-20 hPEPtBwk45msmwWTxPVVo1kVu4v11b 2009-04-12-13-20 v0SkfxegheD90gp31UCr6FplnKpx6i 2009-04-12-13-20 1iD9odVgOIi4QWkwHMcOhmwTkWDKfj 2009-04-12-13-20 b31tJiIA25CK8eDHQrHnbcknfSndUk SELECT dt, impressionId FROM impressions WHERE dt < '2009-04-12-14-00’ AND dt >='2009-04-12-13-00’ ORDER BY dt DESC LIMIT 100; How to execute query?

rights reserved. SELECT dt, impressionId FROM impressions WHERE dt < '2009-04-12-14-00’ AND dt >='2009-04-12-13-00’ ORDER BY dt DESC LIMIT 100; Data Schema & Location CREATE EXTERNAL TABLE impressions ( requestBeginTime timestamp, adId string, impressionId string, referrer string, userAgent string, ip varchar(15), number integer, modelId string, requestEndTime timestamp, timers struct<modelLookup:string, requestTime:float>, hostname string, sessionId string) PARTITIONED BY (dt string) ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe’ LOCATION 's3://ads/tables/impressions/'; Athena S3 Schema-On Read

rights reserved. 3.1 Tuning and Design Patterns

rights reserved. Data Formats and SerDes • Current data formats and SerDes • Standard • CSV, TSV • JSON • Apache web logs • Custom-Delimited Files • Open source columnar formats • Parquet • ORC

rights reserved. Data Formats in Create Table

rights reserved. Column Data Types • Column Data Types • Primitives ( Integer, Double, Date, String, Decimal, Varchar…) • ARRAY • MAP • STRUCT • Nested JSON • Query using structures and array of structures

rights reserved. Example CREATE EXTERNAL TABLE impressions ( requestBeginTime timestamp, adId string, impressionId string, referrer string, userAgent string, ip varchar(15), number integer, modelId string, requestEndTime timestamp, timers struct<modelLookup:string, requestTime:float>, hostname string, sessionId string) PARTITIONED BY (dt string) ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe’ LOCATION 's3://ads/tables/impressions/'; Athena S3 Primitives Struct

rights reserved. 3.2 Partitions

rights reserved. Partitions • Partition on any column in table • Limits the amount of data each query scans • Common practice • Multi-level time based partition • Example • PARTITIONED BY (year STRING, month STRING, day STRING) • Performance and cost savings on queries filtering by day, month and year

rights reserved. Example CREATE EXTERNAL TABLE impressions ( requestBeginTime timestamp, adId string, impressionId string, referrer string, userAgent string, ip varchar(15), number integer, modelId string, requestEndTime timestamp, timers struct<modelLookup:string, requestTime:float>, hostname string, sessionId string) PARTITIONED BY (dt string) ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe’ LOCATION 's3://ads/tables/impressions/'; Athena S3

rights reserved. Partitions in Create Table

rights reserved. How does partitioning work? • Flight dataset in Parquet • Partitioned by year • Group by query • Top 10 airports with the most departures since 2000 • Where year >= 2000 • Athena doesn’t have to scan data in partitions before year 2000 • Amount of data scanned affects pricing • 1 min 6s, 3.11GB for unpartitioned • 37.37 s, 2.25GB for CSV • 9.32 secs, 0.04GB for Parquet

rights reserved. How does partitioning work? CSV: 37.37 s, 2.25GB No partitions: 1min 6s, 3.11GB

rights reserved. Using Partition Projection • Partition values and locations are calculated from configuration • Partition pruning gathers metadata and "prunes" it to only the partitions that apply to your query • Allows Athena to avoid calling GetPartitions To use partition projection, you specify the ranges of partition values and projection types for each partition column in the table properties in the AWS Glue Data Catalog or in your external Hive metastore.

rights reserved. Data Catalog (Meta Store) table 이름 데이터 위치 flight_csv s3://athena-examples/flight/ 2005 flight_csv s3://athena-examples/flight/ 2006 flight_csv s3://athena-examples/flight/ 2007 flight_csv s3://athena-examples/flight/ 2008 …. 2016 flight_csv s3://athena-examples/flight/ 파티션(year) S3 $ aws s3 ls s3://athena-examples/flight/ PRE year=2005/ PRE year=2006/ PRE year=2007/ PRE year=2008/ PRE year=2009/ PRE year=2010/ PRE year=2011/ PRE year=2012/ PRE year=2013/ PRE year=2014/ PRE year=2015/ PRE year=2016/ S3의 디렉터리 구조에 맞게 파티션 정보 생성 새로운 파티션을 MetaStore에 추가하기 Data Catalog (Meta Store) – Schema, Location

rights reserved. Athena S3 Meta Store (1) read all data (2) filtering Athena S3 (1) read data located in specific partitions (3) execute query (2) execute query • Partitions ✓ • Scan할 데이터 양이 적음 • I/O 속도 향상 비용 감소 Meta Store • Partitions ✗ • Scan할 데이터 양이 많음 • I/O 속도 저하 비용 증가 SELECT origin, count(*) as total_departures FROM flight_csv WHERE year >= ‘2000’ GROUP BY origin ORDER BY total_departures DESC LIMIT 10

rights reserved. Partition by Running MSCK Repair • If your data is already partitioned in Hive format • Use MSCK repair table command. • To sync all new data in S3 with Hive metastore • Example: • msck repair table flights_csv • show partitions flights_csv

rights reserved. Partition by Manually Adding Partitions • If data is not partitioned in Hive format • Cannot use MSCK repair table command • Use ALTER TABLE ADD PARTITION to add each partition manually • Look at automating adding partitions • Example: – ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01’) location 's3://athena- examples/elb/raw/2015/01/01/'

rights reserved. Data Catalog (Meta Store) table 이름 데이터 위치 flight_csv s3://athena-examples/flight/ 2005 flight_csv s3://athena-examples/flight/ 2006 flight_csv s3://athena-examples/flight/ 2007 flight_csv s3://athena-examples/flight/ 2008 …. 2016 flight_csv s3://athena-examples/flight/ 파티션(year) S3 $ aws s3 ls s3://athena-examples/flight/ PRE year=2005/ PRE year=2006/ PRE year=2007/ PRE year=2008/ PRE year=2009/ PRE year=2010/ PRE year=2011/ PRE year=2012/ PRE year=2013/ PRE year=2014/ PRE year=2015/ PRE year=2016/ S3의 디렉터리 구조에 맞게 파티션 정보 생성 새로운 파티션을 MetaStore에 추가하기 Data Catalog (Meta Store) – Schema, Location ALTER TABLE … ADD PARTITIONS … MSCK REPAIR …

rights reserved. Convert to Columnar Formats - Parquet/ORC • Convert to Parquet/ORC, use EMR • Create EMR cluster with Hive or Spark • In the step section of create cluster • Specify script in S3, Hive or PySpark • Point to the input data in text/CSV/JSON • Create output data in Parquet/ORC in S3 • Use Athena CTAS queries to create a new table with Parquet data from a source table in a different format. • CREATE TABLE new_table WITH ( format = 'Parquet', parquet_compression = 'SNAPPY’) AS SELECT * F ROM old_table;

rights reserved. Different workloads - OLTP vs. OLAP OLTP • Online transaction processing • Lots of small operations involving whole rows OLAP • Online analytical processing • Few large operations involving subset of all columns Assumption: I/O is expensive (memory, disk, network, …)

rights reserved. A0 B0 C0 A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4 A5 B5 C5 Row-wise vs. Columnar Row-wise • Horizontal partitioning • OLTP ✓, OLAP ✗ Columnar • Vertical partitioning • OLTP ✗, OLAP ✓ • Free projection pushdown • Compression opportunities Columnar A0 B0 C0 A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4 A5 B5 C5 Row 0 Row 1 Row 2 Row 3 Row 4 Row 5 Col A Col B Col C A0 A1 A2 A3 A4 A5 B0 B1 B2 B3 B4 B5 C0 C1 C2 C3 C4 C5 Row-wise

rights reserved. 3.3 Optimize and Secure

rights reserved. Best Practices • Partition your data • Divides your table into parts and keeps the related data together based on column values such as date, country, region, etc • Bucket your data • You can specify one or more columns containing rows that you want to group together, and put those rows into multiple buckets. • Compress and split files • Can speed up your queries significantly • Optimize File Sizes • Ensuring that your file formats are splittable helps with parallelism

rights reserved. Best Practices Continued • Optimize columnar data store generation • Apache Parquet and Apache ORC • Optimize Queries • ORDER BY • JOIN • GROUP BY • LIKE • Only include the columns that you need

rights reserved. Access Control for Athena • Access Control • AWS Identity and Access Management (IAM) • Access Control Lists (ACLs) • Amazon S3 bucket policies. • IAM policies for S3 • IAM is natively integrated with S3 • Grant IAM users fine-grained control to S3 • Restrict users from querying it using Athena

rights reserved. Access Control Continued • To run queries in Athena, you must have the appropriate permissions for the following: • Athena API actions including additional actions for Athena workgroups • Amazon S3 locations where the underlying data to query is stored. • Metadata and resources that you store in the AWS Glue Data Catalog, such as databases and tables, including additional actions for encrypted metadata.

rights reserved. 3.4 Workgroups

rights reserved. Athena Workgroups Athena Workgroups are used to isolate queries between different teams, workloads or applications, and to set limits on amount of data each query or the entire workgroup can process Workload Isolation Query Metrics Cost Controls

rights reserved. Workgroups – Workload Isolation Unique query output location per Workgroup Encrypt results with unique AWS KMS key per Workgroup Collect and publish aggregated metrics per Workgroup to AWS CloudWatch Use Workgroup settings eliminating need to configure individual users

rights reserved. Workgroups – Metric Reporting Total bytes scanned per Workgroup Total failed queries per Workgroup Total successful queries per Workgroup Total query execution time per Workgroup

rights reserved. Workgroups – Cost Controls • Per query data scanned threshold; exceeding, will cancel query • Trigger alarms to notify of increasing usage and cost • Disable Workgroup when all queries exceed a maximum threshold Any Athena metric: successful/failed & total queries, query run time, etc.

rights reserved. Setting Up Workgroups 1. Decide which workgroups to create. For example, you can decide the following: • Who can run queries in each workgroup, and who owns workgroup configuration. This determines IAM policies you create. 2. Create workgroups as needed, and add tags to them. • Open the Athena console, choose the Workgroup:<workgroup_name> tab, and then choose Create workgroup. 3. Create IAM policies for your users, groups, or roles to enable their access to workgroups. 4. Specify a location in Amazon S3 for query results and encryption settings

rights reserved. Default Workgroup By default, if you have not created any workgroups, all queries in your account run in the primary workgroup

rights reserved. 3.5 Views

rights reserved. View • A view in Amazon Athena is a logical, not a physical table. • The query that defines a view runs each time the view is referenced in a query. • Views do not contain any data and do not write data. • Instead, the query specified by the view runs each time you reference the view by another query.

rights reserved. Create a View • Creates a new view from a specified SELECT query. • The view is a logical table that can be referenced by future queries. • Flow: • Create View: • Write View:

rights reserved. 3.6 CTAS

rights reserved. CREATE TABLE AS SELECT (CTAS) • Creates a new table in Athena from the results of a SELECT statement from another query • Athena stores data files created by the CTAS statement in a specified location in Amazon S3

rights reserved. CREATE TABLE new_table WITH ( external_location = 's3://my_athena_results/my_parquet_stas_table/', format = 'Parquet', parquet_compression = 'SNAPPY' ) AS SELECT * FROM old_table; See more examples in https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html CTAS Example

rights reserved. Use CTAS queries to: • Create tables from query results in one step, without repeatedly querying raw data sets. This makes it easier to work with raw data sets. • Transform query results into other storage formats, such as Parquet and ORC. This improves query performance and reduces query costs in Athena. • Create copies of existing tables that contain only the data you need.

rights reserved. Data Processing: ETL vs. ELT Source1 Source2 Target Source1 Source2 Staging tables Final tables Target (MPP database) Extract Transform Load E → T → L Extract & Load Transform E → L → T Source3 Source3

rights reserved. 3.7 Monitoring & Auditing

rights reserved. Workgroups • Separate users, teams, applications, or workloads • Enable queries on Requester Pays buckets in Amazon S3 • In order to: • Set limits on amount of data each query or the entire workgroup can process • Track costs Ø Decide which workgroups to create. For example, you can decide the following: • Who can run queries in each workgroup, and who owns workgroup configuration. This determines IAM policies you create. Ø Create workgroups as needed, and add tags to them. • Open the Athena console, choose the Workgroup:<workgroup_name> tab, and then choose Create workgroup. Ø Create IAM policies for your users, groups, or roles to enable their access to workgroups. Ø Specify a location in Amazon S3 for query results and encryption settings

rights reserved. CloudTrail Monitor Athena with AWS CloudTrail

rights reserved. Monitoring Athena Queries with CloudWatch Metrics Use CloudWatch Events with Athena: • EngineExecutionTime – in milliseconds • ProcessedBytes – the total amount of data scanned per DML query • QueryPlanningTime – in milliseconds • QueryQueueTime – in milliseconds • ServiceProcessingTime – in milliseconds • TotalExecutionTime – in milliseconds, for DDL and DML queries

rights reserved. Monitoring Athena Queries with CloudWatch Events • Use simple rules to match events and route them to one or more target functions or streams. • Respond to operational changes and takes corrective action as necessary, by • Sending messages to respond to the environment • Activating functions • Making changes • Capturing state information

rights reserved. 4. Summary

rights reserved. Key Benefits of Athena • Fast ad-hoc queries • Query directly against data in S3 • Use standard ANSI SQL • No infrastructure to setup and manage • Use IAM for security • JDBC driver for BI tools and SQL clients • Pay per query, depending on data scanned

rights reserved. Getting Started Tutorials • Amazon Athena in Action: Workshop • Amazon Athena Getting Started Cookbook Examples • Data visualization and anomaly detection using Amazon Athena and Pandas from Amazon SageMaker (2020-09-25) • Prepare data for model-training and invoke machine learning models with Amazon Athena (2019-11-26) Tools • AWS Data Wrangler: Pandas on AWS • PyAthena: a python client for Amazon Athena

rights reserved. Questions?

Introduction to Amazon Athena

Introduction to Amazon Athena

More Decks by Sungmin Kim

Other Decks in Technology

Featured

Transcript