Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Logs, AWS Athena, AWS Glue

Buzzvil
February 27, 2018

Logs, AWS Athena, AWS Glue

By Andy

Buzzvil

February 27, 2018
Tweet

More Decks by Buzzvil

Other Decks in Programming

Transcript

  1. Logging ( Log Types ) 1. Nginx server logs -

    Combined Log Format 2. Impression / click / device logs -> Nested JSON Format 3. Allocation logs -> JSON 4. Buzzscreen API logs -> IDK
  2. Logging ( Log Types ) -> Various formats, Various data

    locations, -> For simple, unrepeated queries : Use Athena for short interactive queries -> For complex or repeated analysis on the same data schema : Relationalize (Transform) and load onto Redshift.
  3. Athena Benefits 1. We only pay for amount of data

    accessed -> Fast and free data aggregation a. $5 for every TB accessed 2. Serverless -> No need to manage computing resources 3. Freaky fast.
  4. Purpose : Access only the data you need 1. Automatic

    Version a. MSCK REPAIR TABLE bs_go_general 2. Manual Version a. ALTER TABLE bs_go_general ADD IF NOT EXISTS PARTITION (year = {year}, month = {month}, day = {day}) LOCATION 's3://buzzvil-log/prod/buzzscreen/buzzscreen-go-general/{year}/{month}/{day}/'; Table Partitioning
  5. Query Language Presto based SQL 1. Supports wide range of

    aggregate, window, and array functions a. https://prestodb.io/docs/0.172/functions.html 2. In memory SQL engine a. We don’t need to manage this 3. Optimized for star schema joins (1 large Fact table and many smaller dimension tables)
  6. Improving Athena Performance 1. Columnar Formats (Parquet, ORC) 2. Narrower

    partitions (hours instead of days) 3. Eliminate the need for joins
  7. Glue 1. Fully Managed ETL 2. Easily track and propagate

    changes in data schema from S3 to data sink 3. Manage Athena, S3, and Redshift at the same time 4. Relationalize data in S3, keep track of schema, sync necessary data to redshift or use Spectrum if applicable
  8. Utilize Glue? 1. Convert various logs (Nested JSON, nginx logs)

    into more efficiently queryable formats ( Parquet, Relationalized JSON, etc ) 2. Decouple ETL process from the rest of the service 3. Easily monitor data loading procedures (CloudWatch) a. https://docs.aws.amazon.com/glue/latest/dg/monitor-glue.html