Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cost-Effective SLO Error Budget Monitoring with...

iwamot
August 24, 2024

Cost-Effective SLO Error Budget Monitoring with Athena and CloudWatch

2024-08-24 - 2024-08-25
JAWS PANKRATION 2024
https://jawspankration2024.jaws-ug.jp/en/

iwamot

August 24, 2024
Tweet

More Decks by iwamot

Other Decks in Technology

Transcript

  1. Cost-Effective SLO Error Budget Monitoring with Athena and CloudWatch 2024-08-24

    - 2024-08-25 JAWS PANKRATION 2024 https://jawspankration2024.jaws-ug.jp/en/ Takashi Iwamoto VPoT (Vice President of Technology) at ENECHANGE Ltd.
  2. Hi, I Am Takashi Iwamoto AWS Community Builder (Cloud Operations)

    Former Cloud Support Associate at AWS Japan https://x.com/iwamot https://www.linkedin.com/in/iwamot/
  3. It Recommends Multiwindow, Multi-Burn-Rate Alerts expr: ( job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001)

    and job:slo_errors_per_request:ratio_rate5m{job="myjob"} > (14.4*0.001) ) or ( job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001) and job:slo_errors_per_request:ratio_rate30m{job="myjob"} > (6*0.001) ) https://sre.google/workbook/alerting-on-slos/ 5/19
  4. An Alert Implicitly Requires 8 Metrics and 4 Alarms job:slo_errors_per_request:ratio_rate1h{job="myjob"}

    > (14.4*0.001) Metric #1: Bad events in 1 hour Metric #2: Total events in 1 hour Alarm #1: Metric #1 / Metric #2 > 1.44% job:slo_errors_per_request:ratio_rate5m{job="myjob"} > (14.4*0.001) job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001) job:slo_errors_per_request:ratio_rate30m{job="myjob"} > (6*0.001) 6/19
  5. Additionally, an Alert Requires 1 Composite Alarm ( ALARM("error-rate-in-1h-greater-than-1.44%") --

    Alarm #1 AND ALARM("error-rate-in-5m-greater-than-1.44%") -- Alarm #2 ) OR ( ALARM("error-rate-in-6h-greater-than-0.6%") -- Alarm #3 AND ALARM("error-rate-in-30m-greater-than-0.6%") -- Alarm #4 ) 7/19
  6. Plus, CloudWatch Is Cost-Effective Custom Metrics: $0.30/month * 8 Alarms:

    $0.10/month * 4 Composite Alarms: $0.50/month * 1 Total: $3.30/month (based on us-east-1 region) 9/19
  7. We Can Aggregate ALB Access Logs with Athena SELECT COUNT(request_verb)

    AS count, request_verb, client_ip FROM alb_access_logs GROUP BY request_verb, client_ip LIMIT 100; https://docs.aws.amazon.com/athena/latest/ug/query-alb-access-logs-examples.html 11/19
  8. Athena Allows for Flexible SQL Queries SELECT count(*) AS total_events,

    -- 429 Too Many Requests errors indicate server unavailability count_if(elb_status_code >= 500 OR elb_status_code = 429) AS bad_events FROM alb_access_logs WHERE request_verb = 'POST' AND url_extract_path(request_url) = '/path/to/critical-user-journey' AND time BETWEEN '2024-08-24T14:20:00' AND '2024-08-24T14:25:00' AND day = '2024/08/24' -- partition key ; 12/19
  9. Executing Queries Periodically Can Create 8 Metrics Dimension: TimeWindow Metric

    Name 5m BadEvents 5m TotalEvents 30m BadEvents 30m TotalEvents 1h BadEvents 1h TotalEvents 6h BadEvents 6h TotalEvents 13/19
  10. Option: Leverage RDS to Reduce Athena Costs TimeWindow Aggregate with

    Put Metric Data to 5m Athena CloudWatch, RDS 30m RDS CloudWatch 1h RDS CloudWatch 6h RDS CloudWatch 15/19
  11. These Are My Frugal Monitoring Ideas. Thank You! CloudWatch provides

    all SLO monitoring capabilities and is cost-effective Athena can aggregate ALB access logs with flexible SQL queries EventBridge and Lambda are fit to execute queries periodically 19/19