Slide 1

Slide 1 text

Cost-Effective SLO Error Budget Monitoring with Athena and CloudWatch 2024-08-24 - 2024-08-25 JAWS PANKRATION 2024 https://jawspankration2024.jaws-ug.jp/en/ Takashi Iwamoto VPoT (Vice President of Technology) at ENECHANGE Ltd.

Slide 2

Slide 2 text

Hi, I Am Takashi Iwamoto AWS Community Builder (Cloud Operations) Former Cloud Support Associate at AWS Japan https://x.com/iwamot https://www.linkedin.com/in/iwamot/

Slide 3

Slide 3 text

Today, I'll Share My Ideas on SLO Monitoring in AWS 3/19

Slide 4

Slide 4 text

The Site Reliability Workbook Is an SRE Must-Read https://sre.google/workbook/table-of-contents/ 4/19

Slide 5

Slide 5 text

It Recommends Multiwindow, Multi-Burn-Rate Alerts expr: ( job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001) and job:slo_errors_per_request:ratio_rate5m{job="myjob"} > (14.4*0.001) ) or ( job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001) and job:slo_errors_per_request:ratio_rate30m{job="myjob"} > (6*0.001) ) https://sre.google/workbook/alerting-on-slos/ 5/19

Slide 6

Slide 6 text

An Alert Implicitly Requires 8 Metrics and 4 Alarms job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001) Metric #1: Bad events in 1 hour Metric #2: Total events in 1 hour Alarm #1: Metric #1 / Metric #2 > 1.44% job:slo_errors_per_request:ratio_rate5m{job="myjob"} > (14.4*0.001) job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001) job:slo_errors_per_request:ratio_rate30m{job="myjob"} > (6*0.001) 6/19

Slide 7

Slide 7 text

Additionally, an Alert Requires 1 Composite Alarm ( ALARM("error-rate-in-1h-greater-than-1.44%") -- Alarm #1 AND ALARM("error-rate-in-5m-greater-than-1.44%") -- Alarm #2 ) OR ( ALARM("error-rate-in-6h-greater-than-0.6%") -- Alarm #3 AND ALARM("error-rate-in-30m-greater-than-0.6%") -- Alarm #4 ) 7/19

Slide 8

Slide 8 text

CloudWatch Provides All These Capabilities! Custom Metrics Alarms Composite Alarms: ( and ) or ( and ) 8/19

Slide 9

Slide 9 text

Plus, CloudWatch Is Cost-Effective Custom Metrics: $0.30/month * 8 Alarms: $0.10/month * 4 Composite Alarms: $0.50/month * 1 Total: $3.30/month (based on us-east-1 region) 9/19

Slide 10

Slide 10 text

Now, Let's Consider How to Monitor ALB 10/19

Slide 11

Slide 11 text

We Can Aggregate ALB Access Logs with Athena SELECT COUNT(request_verb) AS count, request_verb, client_ip FROM alb_access_logs GROUP BY request_verb, client_ip LIMIT 100; https://docs.aws.amazon.com/athena/latest/ug/query-alb-access-logs-examples.html 11/19

Slide 12

Slide 12 text

Athena Allows for Flexible SQL Queries SELECT count(*) AS total_events, -- 429 Too Many Requests errors indicate server unavailability count_if(elb_status_code >= 500 OR elb_status_code = 429) AS bad_events FROM alb_access_logs WHERE request_verb = 'POST' AND url_extract_path(request_url) = '/path/to/critical-user-journey' AND time BETWEEN '2024-08-24T14:20:00' AND '2024-08-24T14:25:00' AND day = '2024/08/24' -- partition key ; 12/19

Slide 13

Slide 13 text

Executing Queries Periodically Can Create 8 Metrics Dimension: TimeWindow Metric Name 5m BadEvents 5m TotalEvents 30m BadEvents 30m TotalEvents 1h BadEvents 1h TotalEvents 6h BadEvents 6h TotalEvents 13/19

Slide 14

Slide 14 text

EventBridge and Lambda Are Fit for These Jobs 14/19

Slide 15

Slide 15 text

Option: Leverage RDS to Reduce Athena Costs TimeWindow Aggregate with Put Metric Data to 5m Athena CloudWatch, RDS 30m RDS CloudWatch 1h RDS CloudWatch 6h RDS CloudWatch 15/19

Slide 16

Slide 16 text

Alarm Setup Is Simple, So I'll Skip the Details Here 16/19

Slide 17

Slide 17 text

For Notifications, Use Alarms with Chatbot 17/19

Slide 18

Slide 18 text

For Visualization, We Have CloudWatch Dashboards Dimension: TimeWindow Metric Name 30d BadEvents 30d TotalEvents 18/19

Slide 19

Slide 19 text

These Are My Frugal Monitoring Ideas. Thank You! CloudWatch provides all SLO monitoring capabilities and is cost-effective Athena can aggregate ALB access logs with flexible SQL queries EventBridge and Lambda are fit to execute queries periodically 19/19