leak; A high process crash (panic/OOM) rate is clearly abnormal in the system and that information should be displayed to people debugging issues, even if it is one of many potential causes rather than a symptom of user pain. Thus, it should be made a diagnostic message rather than a paging alert. And rather than outright crash and thereby lose internal telemetry, we will consider adding backpressure (returning an unhealthy status code) to incoming requests when we are resource constrained to keep the telemetry with that vital data flowing if we’re tight on resources “ “
longer time window. See the table below: Severity Long Window Short Window Burn Rate Factor Error Budget Consumed P1 1h 5m 14.4 2% P2 6h 30m 6 5% Tickets for all the rest
alerts incurs too much toil. Stick with few alerts, alert on SLOs. Guide lines doesn't scale, invest on tooling instead. Leverage automation to make a product out of alerts.
1h group_wait => bundle alerts for first notification group_interval => send notification for new or resolved alerts repeat_interval => remind users that alerts are still firing