Logs, AWS Athena, AWS Glue

Logging ( Log Types ) 1. Nginx server logs -
Combined Log Format 2. Impression / click / device logs -> Nested JSON Format 3. Allocation logs -> JSON 4. Buzzscreen API logs -> IDK

Logging ( Log Types ) -> Various formats, Various data
locations, -> For simple, unrepeated queries : Use Athena for short interactive queries -> For complex or repeated analysis on the same data schema : Relationalize (Transform) and load onto Redshift.

Athena Benefits 1. We only pay for amount of data
accessed -> Fast and free data aggregation a. $5 for every TB accessed 2. Serverless -> No need to manage computing resources 3. Freaky fast.

Purpose : Access only the data you need 1. Automatic
Version a. MSCK REPAIR TABLE bs_go_general 2. Manual Version a. ALTER TABLE bs_go_general ADD IF NOT EXISTS PARTITION (year = {year}, month = {month}, day = {day}) LOCATION 's3://buzzvil-log/prod/buzzscreen/buzzscreen-go-general/{year}/{month}/{day}/'; Table Partitioning

Query Language Presto based SQL 1. Supports wide range of
aggregate, window, and array functions a. https://prestodb.io/docs/0.172/functions.html 2. In memory SQL engine a. We don’t need to manage this 3. Optimized for star schema joins (1 large Fact table and many smaller dimension tables)

Improving Athena Performance 1. Columnar Formats (Parquet, ORC) 2. Narrower
partitions (hours instead of days) 3. Eliminate the need for joins

Improving Athena Performance - Columnar Formats (Parquet, ORC)

Glue 1. Fully Managed ETL 2. Easily track and propagate
changes in data schema from S3 to data sink 3. Manage Athena, S3, and Redshift at the same time 4. Relationalize data in S3, keep track of schema, sync necessary data to redshift or use Spectrum if applicable

Glue - Examples https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glu e-relationalize-transform/ https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and -amazon-s3/

Utilize Glue? 1. Convert various logs (Nested JSON, nginx logs)
into more efficiently queryable formats ( Parquet, Relationalized JSON, etc ) 2. Decouple ETL process from the rest of the service 3. Easily monitor data loading procedures (CloudWatch) a. https://docs.aws.amazon.com/glue/latest/dg/monitor-glue.html

Logs, AWS Athena, AWS Glue

Logs, AWS Athena, AWS Glue

Buzzvil

More Decks by Buzzvil

Other Decks in Programming

Featured

Transcript