without proper monitoring. Has API success rate slightly dropped over time? What if a portion of customers are getting logged out? What if CDN caches 4-5xx response codes for 1m due to a miscon g? What if users from Japan/Australia complains about website home page being slow?
on data to drive decisions, otherwise they will always have to depend exclusively from the opinion and feelings of the most senior engineer around. You don't have software ownership if you don't monitor it.
on top of Kubernetes as easy as possible. Introduces additional resources in Kubernetes: Prometheus, ServiceMonitor, Alertmanager and PrometheusRule. Example apiVersion: monitoring.coreos.com/v1 kind: Prometheus
Full k8s cluster monitoring Golden Signals w/ Ingress, cAdvisor metrics Thanos for global view & LTS compactor don't like empty blocks , ok we've removed all from S3 it couldn't compact istio metrics , ok we go w/out compactor then
term queries can bring huge amount of data affecting overall store availability. Product engineering services needs historical data for data driven decisions.
metrics with high cardinality or with too many time series from S3 bucket If object store has metrics with high cardinality or too many series and compactor fails, purging data might be the only solution
when SLI is in danger rules: - alert: TooManyThanosQueriesErrors annotations: summary: Percentage of gRPC errors are too high on Thanos description: 'Thanos Query has too many gRPC queries errors runbook_url: https://github.com/hellofresh/runbooks/blob/mas expr: | sum(thanos_query:grpc_queries_errors:rate5m) / sum(thanos_query:grpc_queries:rate5m) > 0.05
Targets Find rules that take take too long to evaluate prometheus_rule_group_iterations_missed_tot al Find metrics that have the most time series topk(10, count by (__name__, job ({__name__=~".+"})) Find expensive recording rules prometheus_rule_group_last_duration_second s
which introduces high cardinality or you don't use Increase scrape_interval and evaluation_interval sample_limit causes a scrape to fail if more than the given number of time series is returned
Eng. services relies on it for SLOs Queries towards it were Prom and Thanos metric_relabel_configs: - separator: ; regex: (controller_namespace|controller_pod|endpoint|exporte replacement: $1 action: labeldrop
Query Availability Attack: Thanos Sidecar Pod Deletion Scope: Multiple pods Expected Results: Rate of good thanos querier gRPC queries are not affected TooManyQueriesErrors alert should not re