Slide 1

Slide 1 text

#phptek @mheap Move over Graphite Prometheus is here

Slide 2

Slide 2 text

#phptek @mheap Metrics?

Slide 3

Slide 3 text

#phptek @mheap How much disk space are we using? What’s the average CPU utilisation?

Slide 4

Slide 4 text

#phptek @mheap How many 500 errors in the last 5 minutes? Is our data processing rate better, worse, or the same as this time last week? How many concurrent users do we have?

Slide 5

Slide 5 text

#phptek @mheap How many active phone calls are there? What’s the average call duration? How many calls have there been to 441234567890 today?

Slide 6

Slide 6 text

#phptek @mheap

Slide 7

Slide 7 text

#phptek @mheap

Slide 8

Slide 8 text

#phptek @mheap call.ringing

Slide 9

Slide 9 text

#phptek @mheap call.447700900000.ringing

Slide 10

Slide 10 text

#phptek @mheap ded2585.call.447700900000.ringing

Slide 11

Slide 11 text

#phptek @mheap ded2585.call.447700900000.ringing ded2585.call.447700900000.answered ded2585.call.447700900000.complete

Slide 12

Slide 12 text

#phptek @mheap ded2585.call.*.placed

Slide 13

Slide 13 text

#phptek @mheap *.call.*.placed

Slide 14

Slide 14 text

#phptek @mheap *.call.447700900000.placed

Slide 15

Slide 15 text

#phptek @mheap

Slide 16

Slide 16 text

#phptek @mheap

Slide 17

Slide 17 text

#phptek @mheap

Slide 18

Slide 18 text

#phptek @mheap

Slide 19

Slide 19 text

#phptek @mheap Hello, I’m Michael

Slide 20

Slide 20 text

#phptek @mheap @mheap

Slide 21

Slide 21 text

#phptek @mheap

Slide 22

Slide 22 text

#phptek @mheap

Slide 23

Slide 23 text

#phptek @mheap Open Source

Slide 24

Slide 24 text

#phptek @mheap Mostly Go (A little Ruby)

Slide 25

Slide 25 text

#phptek @mheap Cloud Native Computing Foundation

Slide 26

Slide 26 text

#phptek @mheap

Slide 27

Slide 27 text

#phptek @mheap

Slide 28

Slide 28 text

#phptek @mheap How does it work?

Slide 29

Slide 29 text

#phptek @mheap Pulls metrics

Slide 30

Slide 30 text

#phptek @mheap Disk storage

Slide 31

Slide 31 text

#phptek @mheap Efficient collection

Slide 32

Slide 32 text

#phptek @mheap Consul integration

Slide 33

Slide 33 text

#phptek @mheap Prometheus.yml

Slide 34

Slide 34 text

#phptek @mheap # HELP node_filesystem_free_bytes Filesystem free space in bytes. # TYPE node_filesystem_free_bytes gauge node_filesystem_free_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/ Volumes/Macintosh HD"} 4.9138515968e+10 node_filesystem_free_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 3.62441240576e+11 node_filesystem_free_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/ private/var/vm"} 4.36741931008e+11

Slide 35

Slide 35 text

#phptek @mheap # HELP node_filesystem_free_bytes Filesystem free space in bytes.

Slide 36

Slide 36 text

#phptek @mheap node_filesystem_free_bytes{ device="/dev/disk1s1", fstype="apfs", mountpoint="/Volumes/Macintosh HD" } 4.9138515968e+10

Slide 37

Slide 37 text

#phptek @mheap node_filesystem_free_bytes{ device="/dev/disk1s1", fstype="apfs", mountpoint="/Volumes/Macintosh HD" } 4.9138515968e+10

Slide 38

Slide 38 text

#phptek @mheap node_filesystem_free_bytes billing_notifications_total process_cpu_seconds_total http_request_duration_seconds

Slide 39

Slide 39 text

#phptek @mheap node_filesystem_free_bytes billing_notifications_total process_cpu_seconds_total http_request_duration_seconds

Slide 40

Slide 40 text

#phptek @mheap node_filesystem_free_bytes billing_notifications_total process_cpu_seconds_total http_request_duration_seconds

Slide 41

Slide 41 text

#phptek @mheap node_filesystem_free_bytes billing_notifications_total process_cpu_seconds_total http_request_duration_seconds

Slide 42

Slide 42 text

#phptek @mheap node_filesystem_free_bytes billing_notifications_total process_cpu_seconds_total http_request_duration_seconds

Slide 43

Slide 43 text

#phptek @mheap node_filesystem_free_bytes billing_notifications_total process_cpu_seconds_total http_request_duration_seconds

Slide 44

Slide 44 text

#phptek @mheap node_filesystem_free_bytes billing_notifications_total process_cpu_seconds_total http_request_duration_seconds

Slide 45

Slide 45 text

#phptek @mheap node_filesystem_free_bytes{ device="/dev/disk1s1", fstype="apfs", mountpoint="/Volumes/Macintosh HD" } 4.9138515968e+10

Slide 46

Slide 46 text

#phptek @mheap node_filesystem_free_bytes{ device="/dev/disk1s1", fstype="apfs", mountpoint="/Volumes/Macintosh HD" } 4.9138515968e+10

Slide 47

Slide 47 text

#phptek @mheap node_filesystem_free_bytes{ device="/dev/disk1s1", fstype="apfs", mountpoint="/Volumes/Macintosh HD" } 4.9138515968e+10

Slide 48

Slide 48 text

#phptek @mheap # HELP go_gc_duration_seconds A summary of the GC invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 0 go_gc_duration_seconds{quantile="0.25"} 0 go_gc_duration_seconds{quantile="0.5"} 0 go_gc_duration_seconds{quantile="0.75"} 0 go_gc_duration_seconds{quantile="1"} 0 go_gc_duration_seconds_sum 0 go_gc_duration_seconds_count 0 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 6 # HELP go_info Information about the Go environment. # TYPE go_info gauge go_info{version="go1.10"} 1 # HELP go_memstats_alloc_bytes Number of bytes allocated and still in use. # TYPE go_memstats_alloc_bytes gauge go_memstats_alloc_bytes 827952 # HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed. # TYPE go_memstats_alloc_bytes_total counter go_memstats_alloc_bytes_total 827952 # HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. # TYPE go_memstats_buck_hash_sys_bytes gauge go_memstats_buck_hash_sys_bytes 1.443286e+06 # HELP go_memstats_frees_total Total number of frees. # TYPE go_memstats_frees_total counter go_memstats_frees_total 243 # HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started. # TYPE go_memstats_gc_cpu_fraction gauge go_memstats_gc_cpu_fraction 0 # HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. # TYPE go_memstats_gc_sys_bytes gauge go_memstats_gc_sys_bytes 169984 # HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use. # TYPE go_memstats_heap_alloc_bytes gauge go_memstats_heap_alloc_bytes 827952 # HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. # TYPE go_memstats_heap_idle_bytes gauge go_memstats_heap_idle_bytes 761856 # HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. # TYPE go_memstats_heap_inuse_bytes gauge go_memstats_heap_inuse_bytes 1.990656e+06 # HELP go_memstats_heap_objects Number of allocated objects. # TYPE go_memstats_heap_objects gauge go_memstats_heap_objects 7710 # HELP go_memstats_heap_released_bytes Number of heap bytes released to OS. # TYPE go_memstats_heap_released_bytes gauge go_memstats_heap_released_bytes 0 # HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. # TYPE go_memstats_heap_sys_bytes gauge go_memstats_heap_sys_bytes 2.752512e+06 # HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection. # TYPE go_memstats_last_gc_time_seconds gauge go_memstats_last_gc_time_seconds 0 # HELP go_memstats_lookups_total Total number of pointer lookups. # TYPE go_memstats_lookups_total counter go_memstats_lookups_total 5 # HELP go_memstats_mallocs_total Total number of mallocs. # TYPE go_memstats_mallocs_total counter go_memstats_mallocs_total 7953 # HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. # TYPE go_memstats_mcache_inuse_bytes gauge go_memstats_mcache_inuse_bytes 6944 # HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. # TYPE go_memstats_mcache_sys_bytes gauge go_memstats_mcache_sys_bytes 16384 # HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. # TYPE go_memstats_mspan_inuse_bytes gauge go_memstats_mspan_inuse_bytes 30096 # HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. # TYPE go_memstats_mspan_sys_bytes gauge go_memstats_mspan_sys_bytes 32768 # HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. # TYPE go_memstats_next_gc_bytes gauge go_memstats_next_gc_bytes 4.473924e+06 # HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. # TYPE go_memstats_other_sys_bytes gauge go_memstats_other_sys_bytes 1.059618e+06 # HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator. # TYPE go_memstats_stack_inuse_bytes gauge go_memstats_stack_inuse_bytes 393216 # HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. # TYPE go_memstats_stack_sys_bytes gauge go_memstats_stack_sys_bytes 393216 # HELP go_memstats_sys_bytes Number of bytes obtained from system. # TYPE go_memstats_sys_bytes gauge go_memstats_sys_bytes 5.867768e+06 # HELP go_threads Number of OS threads created. # TYPE go_threads gauge go_threads 7 # HELP node_cpu_seconds_total Seconds the cpus spent in each mode. # TYPE node_cpu_seconds_total counter node_cpu_seconds_total{cpu="0",mode="idle"} 537140.75 node_cpu_seconds_total{cpu="0",mode="nice"} 0 node_cpu_seconds_total{cpu="0",mode="system"} 202810.13 node_cpu_seconds_total{cpu="0",mode="user"} 236956.35 node_cpu_seconds_total{cpu="1",mode="idle"} 789924.55 node_cpu_seconds_total{cpu="1",mode="nice"} 0 node_cpu_seconds_total{cpu="1",mode="system"} 76430.46 node_cpu_seconds_total{cpu="1",mode="user"} 110379.86 node_cpu_seconds_total{cpu="2",mode="idle"} 521434.82 node_cpu_seconds_total{cpu="2",mode="nice"} 0 node_cpu_seconds_total{cpu="2",mode="system"} 206715.68 node_cpu_seconds_total{cpu="2",mode="user"} 248584.66 node_cpu_seconds_total{cpu="3",mode="idle"} 788754.35 node_cpu_seconds_total{cpu="3",mode="nice"} 0 node_cpu_seconds_total{cpu="3",mode="system"} 76188.77 node_cpu_seconds_total{cpu="3",mode="user"} 111791.47 # HELP node_disk_read_bytes_total The total number of bytes read successfully. # TYPE node_disk_read_bytes_total counter node_disk_read_bytes_total{device="disk0"} 6.22708862976e+11 node_disk_read_bytes_total{device="disk3"} 1.12842752e+08 # HELP node_disk_read_seconds_total The total number of seconds spent by all reads. # TYPE node_disk_read_seconds_total counter node_disk_read_seconds_total{device="disk0"} 22165.627411002 node_disk_read_seconds_total{device="disk3"} 67.88703918 # HELP node_disk_read_sectors_total The total number of sectors read successfully. # TYPE node_disk_read_sectors_total counter node_disk_read_sectors_total{device="disk0"} 4327.06494140625 node_disk_read_sectors_total{device="disk3"} 7.34765625 # HELP node_disk_reads_completed_total The total number of reads completed successfully. # TYPE node_disk_reads_completed_total counter node_disk_reads_completed_total{device="disk0"} 1.7723658e+07 node_disk_reads_completed_total{device="disk3"} 3762 # HELP node_disk_write_seconds_total This is the total number of seconds spent by all writes. # TYPE node_disk_write_seconds_total counter node_disk_write_seconds_total{device="disk0"} 8632.255762983 node_disk_write_seconds_total{device="disk3"} 0 # HELP node_disk_writes_completed_total The total number of writes completed successfully. # TYPE node_disk_writes_completed_total counter node_disk_writes_completed_total{device="disk0"} 1.9779856e+07 node_disk_writes_completed_total{device="disk3"} 0 # HELP node_disk_written_bytes_total The total number of bytes written successfully. # TYPE node_disk_written_bytes_total counter node_disk_written_bytes_total{device="disk0"} 6.94838308864e+11 node_disk_written_bytes_total{device="disk3"} 0 # HELP node_disk_written_sectors_total The total number of sectors written successfully. # TYPE node_disk_written_sectors_total counter node_disk_written_sectors_total{device="disk0"} 4829.06640625 node_disk_written_sectors_total{device="disk3"} 0 # HELP node_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which node_exporter was built. # TYPE node_exporter_build_info gauge node_exporter_build_info{branch="HEAD",goversion="go1.10",revision="002c1ca02917406cbecc457162e2bdb1f29c2f49",version="0.16.0-rc.0"} 1 # HELP node_filesystem_avail_bytes Filesystem space available to non-root users in bytes. # TYPE node_filesystem_avail_bytes gauge node_filesystem_avail_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 4.7416078336e+10 node_filesystem_avail_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 3.58532878336e+11 node_filesystem_avail_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 3.58565429248e+11 node_filesystem_avail_bytes{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 2.322432e+07 node_filesystem_avail_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_avail_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_filesystem_device_error Whether an error occurred while getting statistics for the given device. # TYPE node_filesystem_device_error gauge node_filesystem_device_error{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 0 node_filesystem_device_error{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 0 node_filesystem_device_error{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 0 node_filesystem_device_error{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 0 node_filesystem_device_error{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_device_error{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_filesystem_files Filesystem total file nodes. # TYPE node_filesystem_files gauge node_filesystem_files{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 9.223372036854776e+18 node_filesystem_files{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 9.223372036854776e+18 node_filesystem_files{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 9.223372036854776e+18 node_filesystem_files{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 4.294967279e+09 node_filesystem_files{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_files{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_filesystem_files_free Filesystem total free file nodes. # TYPE node_filesystem_files_free gauge node_filesystem_files_free{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 9.223372036854352e+18 node_filesystem_files_free{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 9.223372036853541e+18 node_filesystem_files_free{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 9.223372036854776e+18 node_filesystem_files_free{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 4.294964965e+09 node_filesystem_files_free{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_files_free{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_filesystem_free_bytes Filesystem free space in bytes. # TYPE node_filesystem_free_bytes gauge node_filesystem_free_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 4.9138515968e+10 node_filesystem_free_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 3.62441240576e+11 node_filesystem_free_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 4.36741931008e+11 node_filesystem_free_bytes{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 2.322432e+07 node_filesystem_free_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_free_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_filesystem_readonly Filesystem read-only status. # TYPE node_filesystem_readonly gauge node_filesystem_readonly{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 0 node_filesystem_readonly{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 0 node_filesystem_readonly{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 0 node_filesystem_readonly{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 1 node_filesystem_readonly{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_readonly{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_filesystem_size_bytes Filesystem size in bytes. # TYPE node_filesystem_size_bytes gauge node_filesystem_size_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 5.9999997952e+10 node_filesystem_size_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 4.3996317696e+11 node_filesystem_size_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 4.3996317696e+11 node_filesystem_size_bytes{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 1.3418496e+08 node_filesystem_size_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_size_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_load1 1m load average. # TYPE node_load1 gauge node_load1 2.451171875 # HELP node_load15 15m load average. # TYPE node_load15 gauge node_load15 2.7646484375 # HELP node_load5 5m load average. # TYPE node_load5 gauge node_load5 2.6083984375 # HELP node_memory_active_bytes_total Memory information field active_bytes_total. # TYPE node_memory_active_bytes_total gauge node_memory_active_bytes_total 3.251331072e+09 # HELP node_memory_bytes_total Memory information field bytes_total. # TYPE node_memory_bytes_total gauge node_memory_bytes_total 1.7179869184e+10 # HELP node_memory_free_bytes_total Memory information field free_bytes_total. # TYPE node_memory_free_bytes_total gauge node_memory_free_bytes_total 5.61926144e+08 # HELP node_memory_inactive_bytes_total Memory information field inactive_bytes_total. # TYPE node_memory_inactive_bytes_total gauge node_memory_inactive_bytes_total 3.997949952e+09 # HELP node_memory_swapped_in_pages_total Memory information field swapped_in_pages_total. # TYPE node_memory_swapped_in_pages_total gauge node_memory_swapped_in_pages_total 2.51926528e+09 # HELP node_memory_swapped_out_pages_total Memory information field swapped_out_pages_total. # TYPE node_memory_swapped_out_pages_total gauge node_memory_swapped_out_pages_total 3.131211776e+09 # HELP node_memory_wired_bytes_total Memory information field wired_bytes_total. # TYPE node_memory_wired_bytes_total gauge node_memory_wired_bytes_total 3.211726848e+09 # HELP node_network_receive_bytes_total Network device statistic receive_bytes. # TYPE node_network_receive_bytes_total counter node_network_receive_bytes_total{device="XHC0"} 0 node_network_receive_bytes_total{device="XHC1"} 0 node_network_receive_bytes_total{device="XHC20"} 0 node_network_receive_bytes_total{device="awdl0"} 5120 node_network_receive_bytes_total{device="bridge0"} 0 node_network_receive_bytes_total{device="en0"} 1.214772224e+09 node_network_receive_bytes_total{device="en1"} 0 node_network_receive_bytes_total{device="en2"} 0 node_network_receive_bytes_total{device="en3"} 0 node_network_receive_bytes_total{device="en4"} 0 node_network_receive_bytes_total{device="en5"} 1.000448e+06 node_network_receive_bytes_total{device="gif0"} 0 node_network_receive_bytes_total{device="lo0"} 2.01657344e+09 node_network_receive_bytes_total{device="p2p0"} 0 node_network_receive_bytes_total{device="stf0"} 0 node_network_receive_bytes_total{device="utun0"} 0 node_network_receive_bytes_total{device="utun1"} 505856 node_network_receive_bytes_total{device="utun2"} 23552 node_network_receive_bytes_total{device="utun3"} 46080 node_network_receive_bytes_total{device="utun4"} 0 node_network_receive_bytes_total{device="utun5"} 0 node_network_receive_bytes_total{device="utun6"} 0 node_network_receive_bytes_total{device="vboxnet0"} 1.631232e+06 # HELP node_network_receive_errs_total Network device statistic receive_errs. # TYPE node_network_receive_errs_total counter node_network_receive_errs_total{device="XHC0"} 0 node_network_receive_errs_total{device="XHC1"} 0 node_network_receive_errs_total{device="XHC20"} 0 node_network_receive_errs_total{device="awdl0"} 0 node_network_receive_errs_total{device="bridge0"} 0 node_network_receive_errs_total{device="en0"} 0 node_network_receive_errs_total{device="en1"} 0 node_network_receive_errs_total{device="en2"} 0 node_network_receive_errs_total{device="en3"} 0 node_network_receive_errs_total{device="en4"} 0 node_network_receive_errs_total{device="en5"} 0 node_network_receive_errs_total{device="gif0"} 0 node_network_receive_errs_total{device="lo0"} 0 node_network_receive_errs_total{device="p2p0"} 0 node_network_receive_errs_total{device="stf0"} 0 node_network_receive_errs_total{device="utun0"} 0 node_network_receive_errs_total{device="utun1"} 0 node_network_receive_errs_total{device="utun2"} 0 node_network_receive_errs_total{device="utun3"} 0 node_network_receive_errs_total{device="utun4"} 0 node_network_receive_errs_total{device="utun5"} 0 node_network_receive_errs_total{device="utun6"} 0 node_network_receive_errs_total{device="vboxnet0"} 0 # HELP node_network_receive_multicast_total Network device statistic receive_multicast. # TYPE node_network_receive_multicast_total counter node_network_receive_multicast_total{device="XHC0"} 0 node_network_receive_multicast_total{device="XHC1"} 0 node_network_receive_multicast_total{device="XHC20"} 0 node_network_receive_multicast_total{device="awdl0"} 33 node_network_receive_multicast_total{device="bridge0"} 0 node_network_receive_multicast_total{device="en0"} 5.331321e+06 node_network_receive_multicast_total{device="en1"} 0 node_network_receive_multicast_total{device="en2"} 0 node_network_receive_multicast_total{device="en3"} 0 node_network_receive_multicast_total{device="en4"} 0 node_network_receive_multicast_total{device="en5"} 4 node_network_receive_multicast_total{device="gif0"} 0 node_network_receive_multicast_total{device="lo0"} 266605 node_network_receive_multicast_total{device="p2p0"} 0 node_network_receive_multicast_total{device="stf0"} 0 node_network_receive_multicast_total{device="utun0"} 0 node_network_receive_multicast_total{device="utun1"} 0 node_network_receive_multicast_total{device="utun2"} 0 node_network_receive_multicast_total{device="utun3"} 0 node_network_receive_multicast_total{device="utun4"} 0 node_network_receive_multicast_total{device="utun5"} 0 node_network_receive_multicast_total{device="utun6"} 0 node_network_receive_multicast_total{device="vboxnet0"} 98 # HELP node_network_receive_packets_total Network device statistic receive_packets. # TYPE node_network_receive_packets_total counter node_network_receive_packets_total{device="XHC0"} 0 node_network_receive_packets_total{device="XHC1"} 0 node_network_receive_packets_total{device="XHC20"} 0 node_network_receive_packets_total{device="awdl0"} 42 node_network_receive_packets_total{device="bridge0"} 0 node_network_receive_packets_total{device="en0"} 5.6394197e+07 node_network_receive_packets_total{device="en1"} 0 node_network_receive_packets_total{device="en2"} 0 node_network_receive_packets_total{device="en3"} 0 node_network_receive_packets_total{device="en4"} 0 node_network_receive_packets_total{device="en5"} 4299 node_network_receive_packets_total{device="gif0"} 0 node_network_receive_packets_total{device="lo0"} 3.243677e+06 node_network_receive_packets_total{device="p2p0"} 0 node_network_receive_packets_total{device="stf0"} 0 node_network_receive_packets_total{device="utun0"} 0 node_network_receive_packets_total{device="utun1"} 3548 node_network_receive_packets_total{device="utun2"} 168 node_network_receive_packets_total{device="utun3"} 226 node_network_receive_packets_total{device="utun4"} 0 node_network_receive_packets_total{device="utun5"} 0 node_network_receive_packets_total{device="utun6"} 0 node_network_receive_packets_total{device="vboxnet0"} 1533 # HELP node_network_transmit_bytes_total Network device statistic transmit_bytes. # TYPE node_network_transmit_bytes_total counter node_network_transmit_bytes_total{device="XHC0"} 0 node_network_transmit_bytes_total{device="XHC1"} 0 node_network_transmit_bytes_total{device="XHC20"} 0 node_network_transmit_bytes_total{device="awdl0"} 1.50016e+06 node_network_transmit_bytes_total{device="bridge0"} 0 node_network_transmit_bytes_total{device="en0"} 2.575358976e+09 node_network_transmit_bytes_total{device="en1"} 0 node_network_transmit_bytes_total{device="en2"} 0 node_network_transmit_bytes_total{device="en3"} 0 node_network_transmit_bytes_total{device="en4"} 0 node_network_transmit_bytes_total{device="en5"} 483328 node_network_transmit_bytes_total{device="gif0"} 0 node_network_transmit_bytes_total{device="lo0"} 2.01657344e+09 node_network_transmit_bytes_total{device="p2p0"} 0 node_network_transmit_bytes_total{device="stf0"} 0 node_network_transmit_bytes_total{device="utun0"} 0 node_network_transmit_bytes_total{device="utun1"} 493568 node_network_transmit_bytes_total{device="utun2"} 23552 node_network_transmit_bytes_total{device="utun3"} 46080 node_network_transmit_bytes_total{device="utun4"} 0 node_network_transmit_bytes_total{device="utun5"} 0 node_network_transmit_bytes_total{device="utun6"} 0 node_network_transmit_bytes_total{device="vboxnet0"} 1.695744e+06 # HELP node_network_transmit_errs_total Network device statistic transmit_errs. # TYPE node_network_transmit_errs_total counter node_network_transmit_errs_total{device="XHC0"} 0 node_network_transmit_errs_total{device="XHC1"} 0 node_network_transmit_errs_total{device="XHC20"} 0 node_network_transmit_errs_total{device="awdl0"} 0 node_network_transmit_errs_total{device="bridge0"} 0 node_network_transmit_errs_total{device="en0"} 0 node_network_transmit_errs_total{device="en1"} 0 node_network_transmit_errs_total{device="en2"} 0 node_network_transmit_errs_total{device="en3"} 0 node_network_transmit_errs_total{device="en4"} 0 node_network_transmit_errs_total{device="en5"} 0 node_network_transmit_errs_total{device="gif0"} 0 node_network_transmit_errs_total{device="lo0"} 0 node_network_transmit_errs_total{device="p2p0"} 0 node_network_transmit_errs_total{device="stf0"} 0 node_network_transmit_errs_total{device="utun0"} 0 node_network_transmit_errs_total{device="utun1"} 0 node_network_transmit_errs_total{device="utun2"} 0 node_network_transmit_errs_total{device="utun3"} 0 node_network_transmit_errs_total{device="utun4"} 0 node_network_transmit_errs_total{device="utun5"} 0 node_network_transmit_errs_total{device="utun6"} 0 node_network_transmit_errs_total{device="vboxnet0"} 0 # HELP node_network_transmit_multicast_total Network device statistic transmit_multicast. # TYPE node_network_transmit_multicast_total counter node_network_transmit_multicast_total{device="XHC0"} 0 node_network_transmit_multicast_total{device="XHC1"} 0 node_network_transmit_multicast_total{device="XHC20"} 0 node_network_transmit_multicast_total{device="awdl0"} 0 node_network_transmit_multicast_total{device="bridge0"} 0 node_network_transmit_multicast_total{device="en0"} 0 node_network_transmit_multicast_total{device="en1"} 0 node_network_transmit_multicast_total{device="en2"} 0 node_network_transmit_multicast_total{device="en3"} 0 node_network_transmit_multicast_total{device="en4"} 0 node_network_transmit_multicast_total{device="en5"} 0 node_network_transmit_multicast_total{device="gif0"} 0 node_network_transmit_multicast_total{device="lo0"} 0 node_network_transmit_multicast_total{device="p2p0"} 0 node_network_transmit_multicast_total{device="stf0"} 0 node_network_transmit_multicast_total{device="utun0"} 0 node_network_transmit_multicast_total{device="utun1"} 0 node_network_transmit_multicast_total{device="utun2"} 0 node_network_transmit_multicast_total{device="utun3"} 0 node_network_transmit_multicast_total{device="utun4"} 0 node_network_transmit_multicast_total{device="utun5"} 0 node_network_transmit_multicast_total{device="utun6"} 0 node_network_transmit_multicast_total{device="vboxnet0"} 0 # HELP node_network_transmit_packets_total Network device statistic transmit_packets. # TYPE node_network_transmit_packets_total counter node_network_transmit_packets_total{device="XHC0"} 0 node_network_transmit_packets_total{device="XHC1"} 0 node_network_transmit_packets_total{device="XHC20"} 0 node_network_transmit_packets_total{device="awdl0"} 6691 node_network_transmit_packets_total{device="bridge0"} 1 node_network_transmit_packets_total{device="en0"} 3.2582836e+07 node_network_transmit_packets_total{device="en1"} 0 node_network_transmit_packets_total{device="en2"} 0 node_network_transmit_packets_total{device="en3"} 0 node_network_transmit_packets_total{device="en4"} 0 node_network_transmit_packets_total{device="en5"} 4145 node_network_transmit_packets_total{device="gif0"} 0 node_network_transmit_packets_total{device="lo0"} 3.243677e+06 node_network_transmit_packets_total{device="p2p0"} 0 node_network_transmit_packets_total{device="stf0"} 0 node_network_transmit_packets_total{device="utun0"} 2 node_network_transmit_packets_total{device="utun1"} 3236 node_network_transmit_packets_total{device="utun2"} 160 node_network_transmit_packets_total{device="utun3"} 223 node_network_transmit_packets_total{device="utun4"} 2 node_network_transmit_packets_total{device="utun5"} 2 node_network_transmit_packets_total{device="utun6"} 2 node_network_transmit_packets_total{device="vboxnet0"} 73766 # HELP node_scrape_collector_duration_seconds node_exporter: Duration of a collector scrape. # TYPE node_scrape_collector_duration_seconds gauge node_scrape_collector_duration_seconds{collector="cpu"} 0.00013298 node_scrape_collector_duration_seconds{collector="diskstats"} 0.000803364 node_scrape_collector_duration_seconds{collector="filesystem"} 0.000119007 node_scrape_collector_duration_seconds{collector="loadavg"} 2.3448e-05 node_scrape_collector_duration_seconds{collector="meminfo"} 5.3036e-05 node_scrape_collector_duration_seconds{collector="netdev"} 0.000338404 node_scrape_collector_duration_seconds{collector="textfile"} 1.7727e-05 node_scrape_collector_duration_seconds{collector="time"} 2.8571e-05 # HELP node_scrape_collector_success node_exporter: Whether a collector succeeded. # TYPE node_scrape_collector_success gauge node_scrape_collector_success{collector="cpu"} 1 node_scrape_collector_success{collector="diskstats"} 1 node_scrape_collector_success{collector="filesystem"} 1 node_scrape_collector_success{collector="loadavg"} 1 node_scrape_collector_success{collector="meminfo"} 1 node_scrape_collector_success{collector="netdev"} 1 node_scrape_collector_success{collector="textfile"} 1 node_scrape_collector_success{collector="time"} 1 # HELP node_textfile_scrape_error 1 if there was an error opening or reading a file, 0 otherwise # TYPE node_textfile_scrape_error gauge node_textfile_scrape_error 0 # HELP node_time_seconds System time in seconds since epoch (1970). # TYPE node_time_seconds gauge node_time_seconds 1.5210412225783854e+09 # HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served. # TYPE promhttp_metric_handler_requests_in_flight gauge promhttp_metric_handler_requests_in_flight 1 # HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code. # TYPE promhttp_metric_handler_requests_total counter promhttp_metric_handler_requests_total{code="200"} 0 promhttp_metric_handler_requests_total{code="500"} 0 promhttp_metric_handler_requests_total{code="503"} 0

Slide 49

Slide 49 text

#phptek @mheap # HELP node_filesystem_readonly Filesystem read-only status. # TYPE node_filesystem_readonly gauge node_filesystem_readonly{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 0 node_filesystem_readonly{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 0 node_filesystem_readonly{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 0 node_filesystem_readonly{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_readonly{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_filesystem_size_bytes Filesystem size in bytes. # TYPE node_filesystem_size_bytes gauge node_filesystem_size_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 5.9999997952e+10 node_filesystem_size_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 4.3996317696e+11 node_filesystem_size_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 4.3996317696e+11 node_filesystem_size_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_size_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_load1 1m load average. # TYPE node_load1 gauge node_load1 2.451171875 # HELP node_load15 15m load average. # TYPE node_load15 gauge node_load15 2.7646484375

Slide 50

Slide 50 text

#phptek @mheap Node Exporter?

Slide 51

Slide 51 text

#phptek @mheap Exporter?

Slide 52

Slide 52 text

#phptek @mheap Exposes metrics

Slide 53

Slide 53 text

#phptek @mheap node_exporter Key Description arp Exposes ARP statistics from /proc/net/arp. cpu Exposes CPU statistics filesystem Exposes filesystem statistics, such as disk space used. Itvs Exposes IPVS status from /proc/net/ip_vs and stats from /proc/net/ip_vs_stats. netstat Exposes network statistics from /proc/net/netstat. This is the same information as netstat -s. uname Exposes system information as provided by the uname system call.

Slide 54

Slide 54 text

#phptek @mheap mysqld_exporter Key Description perf_schema.tablelocks Collect metrics from performance_schema.table_lock_waits_summary_by_table info_schema.processlist Collect thread state counts from information_schema.processlist binlog_size Collect the current size of all registered binlog files auto_increment.columns Collect auto_increment columns and max values from information_schema

Slide 55

Slide 55 text

#phptek @mheap haproxy_exporter Key Description current_queue Current number of queued requests assigned to this server current_sessions Current number of active sessions bytes_in_total Current total of incoming bytes connection_errors_total Total of connection errors

Slide 56

Slide 56 text

#phptek @mheap memcached_exporter Key Description bytes_read Total number of bytes read by this server from network connections_total Total number of connections opened since the server started running items_evicted_total Total number of valid items removed from cache to free memory for new items commands_total Total number of all requests broken down by command (get, set, etc.) and status per slab

Slide 57

Slide 57 text

#phptek @mheap Create your own metrics

Slide 58

Slide 58 text

#phptek @mheap calls_received_total{ network="o2", number="447700900000", type="mobile" } 11

Slide 59

Slide 59 text

#phptek @mheap calls_received_total{ network="o2", number="447700900000", type="mobile" } 11

Slide 60

Slide 60 text

#phptek @mheap calls_received_total{ network="o2", number="447700900000", type="mobile" } 11

Slide 61

Slide 61 text

#phptek @mheap calls_received_total{ network="o2", number="447700900000", type="mobile" } 11

Slide 62

Slide 62 text

#phptek @mheap calls_received_total{ network="o2", number="447700900000", type="mobile" } 11

Slide 63

Slide 63 text

#phptek @mheap calls_received_total{ network="o2", number="447700900000", type="mobile" } 11

Slide 64

Slide 64 text

#phptek @mheap Increment a counter Serve on /metrics

Slide 65

Slide 65 text

#phptek @mheap global: scrape_interval: 15s scrape_configs: - job_name: nexmo_calls static_configs: - targets: ['localhost:3000']

Slide 66

Slide 66 text

#phptek @mheap Increment a counter Serve on /metrics

Slide 67

Slide 67 text

#phptek @mheap Hard in PHP

Slide 68

Slide 68 text

#phptek @mheap Pushgateway The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus https://github.com/prometheus/pushgateway https://github.com/Lazyshot/prometheus-php

Slide 69

Slide 69 text

#phptek @mheap Things to know

Slide 70

Slide 70 text

#phptek @mheap > 5-10 labels is bad

Slide 71

Slide 71 text

#phptek @mheap Secure /metrics

Slide 72

Slide 72 text

#phptek @mheap Metric Types

Slide 73

Slide 73 text

#phptek @mheap Counters calls_placed_total

Slide 74

Slide 74 text

#phptek @mheap Gauges calls_active

Slide 75

Slide 75 text

#phptek @mheap Histograms calls_duration

Slide 76

Slide 76 text

#phptek @mheap Summaries calls_duration

Slide 77

Slide 77 text

#phptek @mheap calls_duration_bucket{le="10",network="o2",number="447700900000",type="mobile"} 16 calls_duration_bucket{le="30",network="o2",number="447700900000",type="mobile"} 63 calls_duration_bucket{le="60",network="o2",number="447700900000",type="mobile"} 123 calls_duration_bucket{le="120",network="o2",number="447700900000",type="mobile"} 253 calls_duration_bucket{le="300",network="o2",number="447700900000",type="mobile"} 618

Slide 78

Slide 78 text

#phptek @mheap calls_duration_bucket{quantile="0.5"} 85 calls_duration_bucket{quantile="0.9"} 123 calls_duration_bucket{quantile="0.99"} 221 calls_duration_sum 13130 calls_duration_count 6

Slide 79

Slide 79 text

#phptek @mheap Counters Use for counting events that happen (e.g. total number of requests) and query using rate() Gauge Use to instrument the current state of a metric (e.g. memory usage, jobs in queue) Histograms Use to sample observations in order to analyse distribution of a data set (e.g. request latency) Summaries Use for pre-calculated quantiles on client side, but be mindful of calculation cost and aggregation limitations

Slide 80

Slide 80 text

#phptek @mheap Show me graphs

Slide 81

Slide 81 text

#phptek @mheap

Slide 82

Slide 82 text

#phptek @mheap PromQL

Slide 83

Slide 83 text

#phptek @mheap calls_placed_total Element Value calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"} 4 calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="BT",number="442079460000",type="landline"} 8 calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="BT",number="442079460000",type="landline"} 1 calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="o2",number="447700900000",type="mobile"} 6 calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="o2",number="447908249481",type="mobile"} 7

Slide 84

Slide 84 text

#phptek @mheap calls_placed_total{number="441234567890"} Element Value calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"} 4

Slide 85

Slide 85 text

#phptek @mheap calls_placed_total{number="441234567890"}[3m] Element Value calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"} 3 @1521482766.23 4 @1521482769.23 12 @1521482772.229 16 @1521482775.229 21 @1521482778.23 25 @1521482781.23 27 @1521482784.229 31 @1521482787.229 35 @1521482790.229

Slide 86

Slide 86 text

#phptek @mheap calls_placed_total{number="441234567890"}[3m] offset 1w Element Value calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"} 2 @1521311766.23 7 @1521311769.23 18 @1523112772.229 20 @1523112775.229 27 @1523112778.23 28 @1523112781.23 30 @1523112784.229 36 @1523112787.229 39 @1523112790.229

Slide 87

Slide 87 text

#phptek @mheap # Total number of calls regardless of any labels sum(calls_placed_total) # Total number of requests, broken down by the number label sum(calls_placed_total[5m]) by (number) # Total per-second rate over the last 5 minutes by number sum(rate(calls_placed_total[5m])) by (number)

Slide 88

Slide 88 text

#phptek @mheap sum(calls_placed_total{network="EE", type="mobile"})

Slide 89

Slide 89 text

#phptek @mheap sum(calls_placed_total{network=~"E.*", type="mobile"})

Slide 90

Slide 90 text

#phptek @mheap sum(calls_placed_total{network!="EE", type="mobile"})

Slide 91

Slide 91 text

#phptek @mheap rate(calls_duration_sum{network="EE"} [5m]) / rate(calls_duration_count{network="EE" }[5m])

Slide 92

Slide 92 text

#phptek @mheap histogram_quantile( 0.95, calls_duration_bucket{number=~"[[number]]"} )

Slide 93

Slide 93 text

#phptek @mheap sum without (duration) (rate(calls_placed_total{number=~"[[number]]"} [3m]))

Slide 94

Slide 94 text

#phptek @mheap predict_linear(calls_active[1h], 86400)

Slide 95

Slide 95 text

#phptek @mheap Grafana

Slide 96

Slide 96 text

No content

Slide 97

Slide 97 text

No content

Slide 98

Slide 98 text

No content

Slide 99

Slide 99 text

#phptek @mheap Gauges calls_active

Slide 100

Slide 100 text

#phptek @mheap

Slide 101

Slide 101 text

#phptek @mheap Counters calls_placed_total

Slide 102

Slide 102 text

#phptek @mheap

Slide 103

Slide 103 text

#phptek @mheap

Slide 104

Slide 104 text

#phptek @mheap Histograms calls_duration

Slide 105

Slide 105 text

#phptek @mheap

Slide 106

Slide 106 text

#phptek @mheap

Slide 107

Slide 107 text

#phptek @mheap

Slide 108

Slide 108 text

#phptek @mheap Version 5

Slide 109

Slide 109 text

#phptek @mheap Alertmanager

Slide 110

Slide 110 text

#phptek @mheap Alertmanager rules

Slide 111

Slide 111 text

#phptek @mheap alert: CallsMonitorDown expr: Up{job="nexmo_calls"} == 0 for: 5m labels: severity: critical

Slide 112

Slide 112 text

#phptek @mheap alert: LotsOfJobsInQueue expr: sum(jobs_in_queue) > 100 for: 5m labels: severity: major

Slide 113

Slide 113 text

#phptek @mheap alert: DiskFullInFourHours expr: predict_linear(node_filesystem_free{job ="node"}[1h], 4 * 3600) < 0 for: 5m labels: severity: major

Slide 114

Slide 114 text

#phptek @mheap alert: HighCallsBeingPlacedOnLandline expr: rate(calls_placed_total{network=~".*", type="landline"} [1m]) >10 for: 5m labels: severity: critical annotations: description: 'Unusually high call count on {{ $labels.network }}' summary: 'High call count on {{ $labels.network }}'

Slide 115

Slide 115 text

#phptek @mheap Alertmanager alerts

Slide 116

Slide 116 text

#phptek @mheap [ smtp_from: ] [ slack_api_url: ] [ victorops_api_key: ] [ victorops_api_url: | default = "https://alert.victorops.com/ integrations/generic/20131114/alert/" ] [ pagerduty_url: | default = "https://events.pagerduty.com/v2/ enqueue" ] [ opsgenie_api_key: ] [ opsgenie_api_url: | default = "https://api.opsgenie.com/" ] [ hipchat_api_url: | default = "https://api.hipchat.com/" ] [ hipchat_auth_token: ]

Slide 117

Slide 117 text

#phptek @mheap Alertmanager routes

Slide 118

Slide 118 text

#phptek @mheap route: receiver: 'default-receiver' group_wait: 30s group_interval: 5m repeat_interval: 4h group_by: [cluster, alertname] - receiver: 'database-pager' group_wait: 10s match_re: service: mysql|cassandra - receiver: 'frontend-pager' group_by: [product, environment] match: team: frontend

Slide 119

Slide 119 text

#phptek @mheap route: receiver: 'default-receiver' group_wait: 30s group_interval: 5m repeat_interval: 4h group_by: [cluster, alertname] - receiver: 'database-pager' group_wait: 10s match_re: service: mysql|cassandra - receiver: 'frontend-pager' group_by: [product, environment] match: team: frontend

Slide 120

Slide 120 text

#phptek @mheap route: receiver: 'default-receiver' group_wait: 30s group_interval: 5m repeat_interval: 4h group_by: [cluster, alertname] - receiver: 'database-pager' group_wait: 10s match_re: service: mysql|cassandra - receiver: 'frontend-pager' group_by: [product, environment] match: team: frontend

Slide 121

Slide 121 text

#phptek @mheap route: receiver: 'default-receiver' group_wait: 30s group_interval: 5m repeat_interval: 4h group_by: [cluster, alertname] - receiver: 'database-pager' group_wait: 10s match_re: service: mysql|cassandra - receiver: 'frontend-pager' group_by: [product, environment] match: team: frontend

Slide 122

Slide 122 text

#phptek @mheap receivers: - name: 'team-X-mails' email_configs: - to: '[email protected] - name: 'team-X-pager' email_configs: - to: '[email protected]' pagerduty_configs: - routing_key: - name: 'team-Y-mails' email_configs: - to: '[email protected]' - name: 'team-Y-pager' pagerduty_configs: - routing_key: - name: 'team-DB-pager' pagerduty_configs: - routing_key:

Slide 123

Slide 123 text

#phptek @mheap Alertmanager inhibits

Slide 124

Slide 124 text

#phptek @mheap ! Database is down ! User login failure > 100 ! Report generation failure > 15 ! GET /healthcheck returned 500

Slide 125

Slide 125 text

#phptek @mheap ! Database is down ! User login failure > 100 ! Report generation failure > 15 ! GET /healthcheck returned 500

Slide 126

Slide 126 text

#phptek @mheap ! Database is down ! User login failure > 100 ! Report generation failure > 15 ! GET /healthcheck returned 500

Slide 127

Slide 127 text

#phptek @mheap ! Database is down ! User login failure > 100 ! Report generation failure > 15 ! GET /healthcheck returned 500

Slide 128

Slide 128 text

#phptek @mheap The DB is the root cause

Slide 129

Slide 129 text

#phptek @mheap inhibit_rules: - source_match: alertname: 'UserLoginFailure' target_match: alertname: 'DatabaseDown' equal: ['instance']

Slide 130

Slide 130 text

#phptek @mheap

Slide 131

Slide 131 text

#phptek @mheap Ingester Grouper Deduplicator Silencer Throttler Notifier

Slide 132

Slide 132 text

#phptek @mheap Ingester Grouper Deduplicator Silencer Throttler Notifier

Slide 133

Slide 133 text

#phptek @mheap Ingester Grouper Deduplicator Silencer Throttler Notifier

Slide 134

Slide 134 text

#phptek @mheap Ingester Grouper Deduplicator Silencer Throttler Notifier

Slide 135

Slide 135 text

#phptek @mheap Ingester Grouper Deduplicator Silencer Throttler Notifier

Slide 136

Slide 136 text

#phptek @mheap Ingester Grouper Deduplicator Silencer Throttler Notifier

Slide 137

Slide 137 text

#phptek @mheap Ingester Grouper Deduplicator Silencer Throttler Notifier

Slide 138

Slide 138 text

#phptek @mheap Does it scale?

Slide 139

Slide 139 text

#phptek @mheap Yes.

Slide 140

Slide 140 text

#phptek @mheap 4.6M time series per server 72k samples ingested per second, per server 185 production prometheus servers

Slide 141

Slide 141 text

#phptek @mheap Prometheus federates

Slide 142

Slide 142 text

#phptek @mheap Alertmanager gossips

Slide 143

Slide 143 text

#phptek @mheap So that’s Prometheus

Slide 144

Slide 144 text

#phptek @mheap So that’s Prometheus (and PromQL, Grafana and Alertmanager)

Slide 145

Slide 145 text

#phptek @mheap @MHEAP [email protected] HTTPS://JOIND.IN/TALK/845A7