Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AthenaとCloudWatchで始める低コストなSLOエラーバジェット監視

iwamot
June 23, 2023

 AthenaとCloudWatchで始める低コストなSLOエラーバジェット監視

2023-06-24
Reject Day 2023
https://connpass.com/event/282843/

iwamot

June 23, 2023
Tweet

More Decks by iwamot

Other Decks in Technology

Transcript

  1. 複数バーンレートの監視が推奨 expr: ( job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001) and job:slo_errors_per_request:ratio_rate5m{job="myjob"} > (14.4*0.001)

    ) or ( job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001) and job:slo_errors_per_request:ratio_rate30m{job="myjob"} > (6*0.001) ) severity: page
  2. 複合アラームで条件式を実装 expr: ( job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001) and job:slo_errors_per_request:ratio_rate5m{job="myjob"} > (14.4*0.001)

    ) or ( job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001) and job:slo_errors_per_request:ratio_rate30m{job="myjob"} > (6*0.001) ) severity: page
  3. カスタムメトリクスを設計 EnvironmentName CriticalUserJourneyName Category TimeWindow MetricName prod-example purchase Availability 5m

    TotalCount prod-example purchase Availability 5m BadCount prod-example purchase Availability 30m TotalCount prod-example purchase Availability 30m BadCount prod-example purchase Availability 1h TotalCount prod-example purchase Availability 1h BadCount prod-example purchase Availability 6h TotalCount prod-example purchase Availability 6h BadCount prod-example purchase Availability ... ...
  4. カスタムメトリクスを設計 EnvironmentName CriticalUserJourneyName Category Percentile TimeWindow MetricName prod-example purchase Latency

    5m TotalCount prod-example purchase Latency 95 5m BadCount prod-example purchase Latency 50 5m BadCount prod-example purchase Latency 30m TotalCount prod-example purchase Latency 95 30m BadCount prod-example purchase Latency 50 30m BadCount prod-example purchase Latency ... ... ...
  5. AthenaでALBログを集計 WITH params AS ( SELECT ? param_data_point_time, ? param_http_method,

    ? param_path, ? param_latency_p50_threshold, ? param_latency_p95_threshold ) SELECT count(*), count_if(is_bad_for_availability), count_if(NOT is_bad_for_availability AND latency > 0), count_if(NOT is_bad_for_availability AND latency > param_latency_p50_threshold), count_if(NOT is_bad_for_availability AND latency > param_latency_p95_threshold) FROM ( SELECT (elb_status_code >= 500 OR elb_status_code = 429) is_bad_for_availability, (request_processing_time + target_processing_time + response_processing_time) latency FROM alb_logs_table, params WHERE day >= ? AND request_verb = param_http_method AND regexp_like(url_extract_path(request_url), concat(concat('^', param_path), '$') AND date_format( from_unixtime(CAST(floor(to_unixtime(from_iso8601_timestamp(time))) AS int) / 300 * 300 + 300), '%Y-%m-%d %H:%i:%s' ) = param_data_point_time ) t, params
  6. RDSでALBログを再集計 critical_user_journey_name data_point_time purchase 2023-05-21 11:35:00 total_count_for_availability bad_count_for_availability 35 0

    total_count_for_latency bad_count_for_latency_p50 bad_count_for_latency_p95 35 15 1 http_method path latency_p50_threshold latency_p95_threshold month POST /orders 0.8 2.0 2023/05
  7. RDSでALBログを再集計 WITH params AS ( SELECT TO_TIMESTAMP(%s, 'YYYY-MM-DD HH24:MI:SS') AT

    TIME ZONE 'UTC' AS param_to_data_point_timestamp, %s AS param_critical_user_journey_name ), durations AS ( SELECT INTERVAL '30 minutes' AS duration UNION ALL SELECT INTERVAL '1 hour' UNION ALL SELECT INTERVAL '6 hours' UNION ALL SELECT INTERVAL '3 days' UNION ALL SELECT INTERVAL '28 days' ) SELECT SUM(c.total_count_for_availability), SUM(c.bad_count_for_availability), SUM(c.total_count_for_latency), SUM(c.bad_count_for_latency_p50), SUM(c.bad_count_for_latency_p95) FROM counts_table c, params p, durations d WHERE c.critical_user_journey_name = p.param_critical_user_journey_name AND c.data_point_time <= p.param_to_data_point_timestamp AND c.data_point_time > p.param_to_data_point_timestamp - d.duration GROUP BY d.duration ORDER BY d.duration