ops DB query too slow CPU/Memory is too High Health Check Failed Noti cations to private slack channels Alerts always ring, when they aren't peeps assumes something is wrong
of response time over 5m response error ratios over 5m SLO -> Objective: a availability goal 99% response time below 500ms 99% during 4 weeks i.e lower than 1% rate of errors Alert when your service objective (SLO) is affected.
Description -> Alert by proportion of traf c i.e 5% of requests are failing as measured over the last 5 minutes. Avoid alerting on static thresholds (> 5 errors seen) and small period such as 1 min, they are too noise.
10s Description -> Alert on 95th of good slow users requests as measured over the last 5 minutes. Avoid alerting on 5xx (they're typically slow), 99th percentiles and small period as 1 min, they are too noise
to investigate the issue How to mitigate the issue, with bash scripts and so on Hint: If there's neither clear action nor dashboards to investigate the alert, just remove the alert
-> ingress:api_v2_http_5xx_increased:rate15m > 0.01 Description -> Percentage of http 5xx status code has increased on My App. Avoid alerting on small period as 1-10 min, they are too noise for this kind of alert