Slide 1

Slide 1 text

Logs, AWS Athena, AWS Glue

Slide 2

Slide 2 text

Logging ( Log Types ) 1. Nginx server logs - Combined Log Format 2. Impression / click / device logs -> Nested JSON Format 3. Allocation logs -> JSON 4. Buzzscreen API logs -> IDK

Slide 3

Slide 3 text

Logging ( Log Types ) -> Various formats, Various data locations, -> For simple, unrepeated queries : Use Athena for short interactive queries -> For complex or repeated analysis on the same data schema : Relationalize (Transform) and load onto Redshift.

Slide 4

Slide 4 text

Athena Benefits 1. We only pay for amount of data accessed -> Fast and free data aggregation a. $5 for every TB accessed 2. Serverless -> No need to manage computing resources 3. Freaky fast.

Slide 5

Slide 5 text

Purpose : Access only the data you need 1. Automatic Version a. MSCK REPAIR TABLE bs_go_general 2. Manual Version a. ALTER TABLE bs_go_general ADD IF NOT EXISTS PARTITION (year = {year}, month = {month}, day = {day}) LOCATION 's3://buzzvil-log/prod/buzzscreen/buzzscreen-go-general/{year}/{month}/{day}/'; Table Partitioning

Slide 6

Slide 6 text

Query Language Presto based SQL 1. Supports wide range of aggregate, window, and array functions a. https://prestodb.io/docs/0.172/functions.html 2. In memory SQL engine a. We don’t need to manage this 3. Optimized for star schema joins (1 large Fact table and many smaller dimension tables)

Slide 7

Slide 7 text

Improving Athena Performance 1. Columnar Formats (Parquet, ORC) 2. Narrower partitions (hours instead of days) 3. Eliminate the need for joins

Slide 8

Slide 8 text

Improving Athena Performance - Columnar Formats (Parquet, ORC)

Slide 9

Slide 9 text

Glue 1. Fully Managed ETL 2. Easily track and propagate changes in data schema from S3 to data sink 3. Manage Athena, S3, and Redshift at the same time 4. Relationalize data in S3, keep track of schema, sync necessary data to redshift or use Spectrum if applicable

Slide 10

Slide 10 text

Glue - Examples https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glu e-relationalize-transform/ https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and -amazon-s3/

Slide 11

Slide 11 text

Utilize Glue? 1. Convert various logs (Nested JSON, nginx logs) into more efficiently queryable formats ( Parquet, Relationalized JSON, etc ) 2. Decouple ETL process from the rest of the service 3. Easily monitor data loading procedures (CloudWatch) a. https://docs.aws.amazon.com/glue/latest/dg/monitor-glue.html