Slide 55
Slide 55 text
Data
Source
Buffering Limits:
1~15min OR 1~128MiB
v Amazon Athena Pricing = (a) run SQL-based queries + (b) the number of bytes scanned
ü Data Compression reduces the number of bytes scanned
ü Columnar Data Format (e.g., Parquet, ORC) allow you to scan a certain set of columns.
ü Data Partitioning filters records to be scanned
s3://bucket/csv/year=?/month=?/day=?/hour=?/
1.csv, 10MiB
2.csv, 9.5MiB
…
100.csv, 11MiB
many small files
s3://bucket/parquet/year=?/month=?/day=?/hour=?/
1.parquet, 100MiB
2. parquet, 90.5MiB
…
5.parquet, 110MiB
a few of large files
[CTAS Query]
CREATE TABLE new_table
WITH (
external_location='{location}',
format = 'PARQUET',
parquet_compression = 'SNAPPY')
AS SELECT *
FROM old_table
WHERE year={year} AND
month={month} AND day={day}
AND hour={hour}
WITH DATA
trigger
Amazon Athena Performance Tips
run CTAS query
every hour to
merge many small
files into a few of
large files
S3 (tier-0)
S3 (tier-1)
Athena
time-based event (e.g., 1hr)
Kinesis Data
Streams
Kinesis Data
Firehose
Dashboard