How we monitor
Monitoring Stack
Prometheus, Operator: we k8s
Thanos, Grafana
Legacy
Graphite & StatsD
Prometheus on EC2
Slide 5
Slide 5 text
Deployment
Prometheus operator helm chart:
Prometheus, Alertmanager, Grafana servers
Host node_exporter
kube-state-metrics
Default set of Alerts and Dashboards
The helm chart IS NOT kube-prometheus
Limitations
Alerting on longer periods are not possible, retention is too low
Slide 15
Slide 15 text
Wrapping up
HA Prometheus per namespace are automatically provisioned,
altho they all run at monitoring namespace.
Platform on-call team are paged if smth goes wrong.
Shared responsibility between Cloud Runtime and SRE squads.
Namespaced Prometheus pods have 12h of retention.
Slide 16
Slide 16 text
Routing Alerts
charts/external/kube-prometheus/values-live-eks.yaml
alertmanager:
config:
route:
receiver: 'default-receiver'
routes:
- match:
tribe: platform
receiver: 'platform-opsgenie'
# we need to do the step above for every tribe
- match:
tribe: page
receiver: 'opsgenie'
Slide 17
Slide 17 text
Why Thanos?
Prometheus federation is hard
Global View
Long term storage (LTS)
Downsampling
Slide 18
Slide 18 text
Prometheus Federation
Global Prometheus that pulls aggregated metrics from datacenter
Prometheus servers.
Reduce Load
Increase scrape_interval and evaluation_interval
On service monitor
kubectl edit servicemonitors service-monitor-name
endpoints:
interval: 45s
...
PS: Keep in mind that it’s not practical to increase these beyond 2
minutes