Slide 1

Slide 1 text

Effective infrastructure monitoring with Grafana David Kaltschmidt @davkals All Systems Go! Sept. 2019 #AllSystemsGo

Slide 2

Slide 2 text

I’m David Working on Explore, Prometheus, and Loki at Grafana Labs Previously: Unifying metrics/logs/traces at Kausal, work on WeaveScope david@grafana.com Twitter: @davkals

Slide 3

Slide 3 text

Monitoring at Grafana Labs

Slide 4

Slide 4 text

Monitoring at Grafana Labs K8s on GKE Prometheus for metrics Loki for logs Jaeger for distributed tracing Monitoring mixins

Slide 5

Slide 5 text

Monitoring by alerting Photo by Randy Tarampi on Unsplash

Slide 6

Slide 6 text

Monitoring by alerting Photo by Randy Tarampi on Unsplash Alerts are part of monitoring mixins Ideally linked to a runbook

Slide 7

Slide 7 text

Prometheus with Node Exporter

Slide 8

Slide 8 text

Node exporter arp bcache bondig boottime conntrack cpu diskstats entropy filesystem ipvs loadavg meminfo netclass netstat nfs pressure sockstat stat textfile time uname vmstat xfs zfs

Slide 9

Slide 9 text

Node exporter arp bcache bondig boottime conntrack cpu diskstats entropy filesystem ipvs loadavg meminfo netclass netstat nfs pressure sockstat stat textfile time uname vmstat xfs zfs Kimchi!!! https://gitlab.com/bjk-gitlab

Slide 10

Slide 10 text

Mapping of Linux observability tools

Slide 11

Slide 11 text

System chart Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU

Slide 12

Slide 12 text

System chart with node_exporter metrics: node_…. Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU up vmstat filesystem xfs zfs drbd diskstats mdadm bcache conntrack arp netclass netdev wifi infiniband bonding netstat Hardware thermal_zone edac entropy hwmon timex

Slide 13

Slide 13 text

CPU utilisation: seconds spent on the CPU per second Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU (1 - avg without (cpu, mode) ( rate(node_cpu_seconds_total{mode="idle"}[1m]))) * 100

Slide 14

Slide 14 text

CPU saturation: Load average / number of CPUs Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU node_load1 / count without (cpu) ( count without (mode) ( node_cpu_seconds_total ) )

Slide 15

Slide 15 text

CPU saturation: Track waiting time instead of number of processes Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU node_load1 node_pressure... Needs Linux (kernel 4.20+ and/or CONFIG_PSI)

Slide 16

Slide 16 text

Memory utilisation and “saturation” Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU Virtual Memory DRAM 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) rate(node_vmstat_pgpgin[1m]) + rate(node_vmstat_pgpgout[1m])

Slide 17

Slide 17 text

Disk utilisation and disk IO queue length Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU rate(node_disk_io_time_seconds_total[5m]) iostat avgqu-sz equivalent: rate(node_disk_io_time_weighted_seconds_total[5m]) Details: https://www.robustperception.io/mapping-iostat-to-the-node-exporters-node_disk_-metrics I/O Controller Disk Swap

Slide 18

Slide 18 text

Available disk space and disk space alerts; multiple alerts with varying severity Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU 1 - (max by (device) (node_filesystem_avail_bytes) / max by (device) (node_filesystem_size_bytes)) node_filesystem_avail_bytes/node_filesystem_size_bytes < 0.4 and predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0 and node_filesystem_readonly == 0 for 1h severity: 'warning' Details: https://github.com/prometheus/node_exporter/tree/master/docs/node-mixin VFS File Systems Volume Manager Block Device Interface

Slide 19

Slide 19 text

Network throughput and Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port rate(node_network_receive_bytes_total{device!="lo"}[5m]) rate(node_network_transmit_bytes_total{device!="lo"}[5m]) rate(node_network_receive_drop_total{device!="lo"}[5m]) rate(node_network_transmit_drop_total{device!="lo"}[5m]) Sockets TCP/UDP IP Ethernet Network Controller

Slide 20

Slide 20 text

K8s host connectivity issues: conntrack table limits Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port node_nf_conntrack_entries / node_nf_conntrack_entries_limit Sockets TCP/UDP IP Ethernet

Slide 21

Slide 21 text

Let’s build a conntrack alert

Slide 22

Slide 22 text

Collector gotchas - Some collectors produce lots of time-series (filesystem, firewalls, networking, systemd) - Some legacy collectors execute scripts or programs - Dropping via relabeling rules is tedious - Pro tip: run 2 two node exporters - one with minimal, one with full configuration - 10x savings on number of time-series - Low overhead since everything is lazily loaded on scrape https://github.com/RichiH Entropy!!

Slide 23

Slide 23 text

Bonus: Node exporter’s textfile collector Coffee!!! https://github.com/beorn7

Slide 24

Slide 24 text

Textfile collector - Includes text files from a given directory to be part of the scrape - Make sure to write atomically - Don’t let node_exporter run scripts as root, use crontab as root to write output to text file instead # INFO Last time Ansible successfully ran ansible_last_run_timestamp 1568903175 # INFO Last time backup successfully ran backup_last_run_timestamp 1568903175 # INFO Track which features are enabled on host my_bare_metal_feature_enabled 1 # INFO Track SSD wearout smartmon_media_wearout_indicator 95665 https://github.com/prometheus-community/node-exporter-textfile-collector-scripts

Slide 25

Slide 25 text

Fleet monitoring

Slide 26

Slide 26 text

Example from dashboards.gitlab.com based on node_uname_info

Slide 27

Slide 27 text

Deviations: Gamifying fleet management

Slide 28

Slide 28 text

Fleet overview: Example from dashboards.gitlab.com

Slide 29

Slide 29 text

Fleet overview: cluster vs. node; Template query: label_values(node_exporter_build_info, instance) Node exporter mixin: https://github.com/prometheus/node_exporter/tree/master/docs/node-mixin

Slide 30

Slide 30 text

Thank you! Questions? @davkals We're hiring remote/EU/US-east. https://grafana.com/about/careers/

Slide 31

Slide 31 text

Log aggregation with Loki

Slide 32

Slide 32 text

See your logs in Grafana

Slide 33

Slide 33 text

Deploy Loki Bare metal options: - static service discovery - systemd journal support Example documentation: https://github.com/grafana/loki/blob/master/docs/promtail/examples.md