Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Move over Graphite, Prometheus is here - php[tek]

Move over Graphite, Prometheus is here - php[tek]

We all agree metrics are important, and Graphite’s a great tool for capturing them. However, in the last few years, the metrics space has released lots of great tools that blow Graphite out of the water—one of which is Prometheus from SoundCloud. Prometheus allows you to query any dimension of your data while still storing it in a highly efficient format.

Together, we’ll take a look at how to get started with Prometheus, including how to create dashboards with Grafana and alerts using AlertManager. By the time you leave, you’ll understand how Prometheus works and will be itching to add it to your projects!

Michael Heap

May 31, 2018
Tweet

More Decks by Michael Heap

Other Decks in Technology

Transcript

  1. #phptek @mheap
    Move over Graphite
    Prometheus is here

    View full-size slide

  2. #phptek @mheap
    Metrics?

    View full-size slide

  3. #phptek @mheap
    How much disk space are we using?
    What’s the average CPU utilisation?

    View full-size slide

  4. #phptek @mheap
    How many 500 errors in the last 5 minutes?
    Is our data processing rate better, worse, or the same as
    this time last week?
    How many concurrent users do we have?

    View full-size slide

  5. #phptek @mheap
    How many active phone calls are there?
    What’s the average call duration?
    How many calls have there been to 441234567890
    today?

    View full-size slide

  6. #phptek @mheap

    View full-size slide

  7. #phptek @mheap

    View full-size slide

  8. #phptek @mheap
    call.ringing

    View full-size slide

  9. #phptek @mheap
    call.447700900000.ringing

    View full-size slide

  10. #phptek @mheap
    ded2585.call.447700900000.ringing

    View full-size slide

  11. #phptek @mheap
    ded2585.call.447700900000.ringing
    ded2585.call.447700900000.answered
    ded2585.call.447700900000.complete

    View full-size slide

  12. #phptek @mheap
    ded2585.call.*.placed

    View full-size slide

  13. #phptek @mheap
    *.call.*.placed

    View full-size slide

  14. #phptek @mheap
    *.call.447700900000.placed

    View full-size slide

  15. #phptek @mheap

    View full-size slide

  16. #phptek @mheap

    View full-size slide

  17. #phptek @mheap

    View full-size slide

  18. #phptek @mheap

    View full-size slide

  19. #phptek @mheap
    Hello, I’m Michael

    View full-size slide

  20. #phptek @mheap
    @mheap

    View full-size slide

  21. #phptek @mheap

    View full-size slide

  22. #phptek @mheap

    View full-size slide

  23. #phptek @mheap
    Open Source

    View full-size slide

  24. #phptek @mheap
    Mostly Go
    (A little Ruby)

    View full-size slide

  25. #phptek @mheap
    Cloud Native
    Computing Foundation

    View full-size slide

  26. #phptek @mheap

    View full-size slide

  27. #phptek @mheap

    View full-size slide

  28. #phptek @mheap
    How does it work?

    View full-size slide

  29. #phptek @mheap
    Pulls metrics

    View full-size slide

  30. #phptek @mheap
    Disk storage

    View full-size slide

  31. #phptek @mheap
    Efficient collection

    View full-size slide

  32. #phptek @mheap
    Consul integration

    View full-size slide

  33. #phptek @mheap
    Prometheus.yml

    View full-size slide

  34. #phptek @mheap
    # HELP node_filesystem_free_bytes Filesystem free space in bytes.
    # TYPE node_filesystem_free_bytes gauge
    node_filesystem_free_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/
    Volumes/Macintosh HD"} 4.9138515968e+10
    node_filesystem_free_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"}
    3.62441240576e+11
    node_filesystem_free_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/
    private/var/vm"} 4.36741931008e+11

    View full-size slide

  35. #phptek @mheap
    # HELP node_filesystem_free_bytes Filesystem free space in bytes.

    View full-size slide

  36. #phptek @mheap
    node_filesystem_free_bytes{
    device="/dev/disk1s1",
    fstype="apfs",
    mountpoint="/Volumes/Macintosh HD"
    }
    4.9138515968e+10

    View full-size slide

  37. #phptek @mheap
    node_filesystem_free_bytes{
    device="/dev/disk1s1",
    fstype="apfs",
    mountpoint="/Volumes/Macintosh HD"
    }
    4.9138515968e+10

    View full-size slide

  38. #phptek @mheap
    node_filesystem_free_bytes
    billing_notifications_total
    process_cpu_seconds_total
    http_request_duration_seconds

    View full-size slide

  39. #phptek @mheap
    node_filesystem_free_bytes
    billing_notifications_total
    process_cpu_seconds_total
    http_request_duration_seconds

    View full-size slide

  40. #phptek @mheap
    node_filesystem_free_bytes
    billing_notifications_total
    process_cpu_seconds_total
    http_request_duration_seconds

    View full-size slide

  41. #phptek @mheap
    node_filesystem_free_bytes
    billing_notifications_total
    process_cpu_seconds_total
    http_request_duration_seconds

    View full-size slide

  42. #phptek @mheap
    node_filesystem_free_bytes
    billing_notifications_total
    process_cpu_seconds_total
    http_request_duration_seconds

    View full-size slide

  43. #phptek @mheap
    node_filesystem_free_bytes
    billing_notifications_total
    process_cpu_seconds_total
    http_request_duration_seconds

    View full-size slide

  44. #phptek @mheap
    node_filesystem_free_bytes
    billing_notifications_total
    process_cpu_seconds_total
    http_request_duration_seconds

    View full-size slide

  45. #phptek @mheap
    node_filesystem_free_bytes{
    device="/dev/disk1s1",
    fstype="apfs",
    mountpoint="/Volumes/Macintosh HD"
    }
    4.9138515968e+10

    View full-size slide

  46. #phptek @mheap
    node_filesystem_free_bytes{
    device="/dev/disk1s1",
    fstype="apfs",
    mountpoint="/Volumes/Macintosh HD"
    }
    4.9138515968e+10

    View full-size slide

  47. #phptek @mheap
    node_filesystem_free_bytes{
    device="/dev/disk1s1",
    fstype="apfs",
    mountpoint="/Volumes/Macintosh HD"
    }
    4.9138515968e+10

    View full-size slide

  48. #phptek @mheap
    # HELP go_gc_duration_seconds A summary of the GC invocation durations.
    # TYPE go_gc_duration_seconds summary
    go_gc_duration_seconds{quantile="0"} 0
    go_gc_duration_seconds{quantile="0.25"} 0
    go_gc_duration_seconds{quantile="0.5"} 0
    go_gc_duration_seconds{quantile="0.75"} 0
    go_gc_duration_seconds{quantile="1"} 0
    go_gc_duration_seconds_sum 0
    go_gc_duration_seconds_count 0
    # HELP go_goroutines Number of goroutines that currently exist.
    # TYPE go_goroutines gauge
    go_goroutines 6
    # HELP go_info Information about the Go environment.
    # TYPE go_info gauge
    go_info{version="go1.10"} 1
    # HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
    # TYPE go_memstats_alloc_bytes gauge
    go_memstats_alloc_bytes 827952
    # HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
    # TYPE go_memstats_alloc_bytes_total counter
    go_memstats_alloc_bytes_total 827952
    # HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
    # TYPE go_memstats_buck_hash_sys_bytes gauge
    go_memstats_buck_hash_sys_bytes 1.443286e+06
    # HELP go_memstats_frees_total Total number of frees.
    # TYPE go_memstats_frees_total counter
    go_memstats_frees_total 243
    # HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
    # TYPE go_memstats_gc_cpu_fraction gauge
    go_memstats_gc_cpu_fraction 0
    # HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
    # TYPE go_memstats_gc_sys_bytes gauge
    go_memstats_gc_sys_bytes 169984
    # HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
    # TYPE go_memstats_heap_alloc_bytes gauge
    go_memstats_heap_alloc_bytes 827952
    # HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
    # TYPE go_memstats_heap_idle_bytes gauge
    go_memstats_heap_idle_bytes 761856
    # HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
    # TYPE go_memstats_heap_inuse_bytes gauge
    go_memstats_heap_inuse_bytes 1.990656e+06
    # HELP go_memstats_heap_objects Number of allocated objects.
    # TYPE go_memstats_heap_objects gauge
    go_memstats_heap_objects 7710
    # HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
    # TYPE go_memstats_heap_released_bytes gauge
    go_memstats_heap_released_bytes 0
    # HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
    # TYPE go_memstats_heap_sys_bytes gauge
    go_memstats_heap_sys_bytes 2.752512e+06
    # HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
    # TYPE go_memstats_last_gc_time_seconds gauge
    go_memstats_last_gc_time_seconds 0
    # HELP go_memstats_lookups_total Total number of pointer lookups.
    # TYPE go_memstats_lookups_total counter
    go_memstats_lookups_total 5
    # HELP go_memstats_mallocs_total Total number of mallocs.
    # TYPE go_memstats_mallocs_total counter
    go_memstats_mallocs_total 7953
    # HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
    # TYPE go_memstats_mcache_inuse_bytes gauge
    go_memstats_mcache_inuse_bytes 6944
    # HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
    # TYPE go_memstats_mcache_sys_bytes gauge
    go_memstats_mcache_sys_bytes 16384
    # HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
    # TYPE go_memstats_mspan_inuse_bytes gauge
    go_memstats_mspan_inuse_bytes 30096
    # HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
    # TYPE go_memstats_mspan_sys_bytes gauge
    go_memstats_mspan_sys_bytes 32768
    # HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
    # TYPE go_memstats_next_gc_bytes gauge
    go_memstats_next_gc_bytes 4.473924e+06
    # HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
    # TYPE go_memstats_other_sys_bytes gauge
    go_memstats_other_sys_bytes 1.059618e+06
    # HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
    # TYPE go_memstats_stack_inuse_bytes gauge
    go_memstats_stack_inuse_bytes 393216
    # HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
    # TYPE go_memstats_stack_sys_bytes gauge
    go_memstats_stack_sys_bytes 393216
    # HELP go_memstats_sys_bytes Number of bytes obtained from system.
    # TYPE go_memstats_sys_bytes gauge
    go_memstats_sys_bytes 5.867768e+06
    # HELP go_threads Number of OS threads created.
    # TYPE go_threads gauge
    go_threads 7
    # HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
    # TYPE node_cpu_seconds_total counter
    node_cpu_seconds_total{cpu="0",mode="idle"} 537140.75
    node_cpu_seconds_total{cpu="0",mode="nice"} 0
    node_cpu_seconds_total{cpu="0",mode="system"} 202810.13
    node_cpu_seconds_total{cpu="0",mode="user"} 236956.35
    node_cpu_seconds_total{cpu="1",mode="idle"} 789924.55
    node_cpu_seconds_total{cpu="1",mode="nice"} 0
    node_cpu_seconds_total{cpu="1",mode="system"} 76430.46
    node_cpu_seconds_total{cpu="1",mode="user"} 110379.86
    node_cpu_seconds_total{cpu="2",mode="idle"} 521434.82
    node_cpu_seconds_total{cpu="2",mode="nice"} 0
    node_cpu_seconds_total{cpu="2",mode="system"} 206715.68
    node_cpu_seconds_total{cpu="2",mode="user"} 248584.66
    node_cpu_seconds_total{cpu="3",mode="idle"} 788754.35
    node_cpu_seconds_total{cpu="3",mode="nice"} 0
    node_cpu_seconds_total{cpu="3",mode="system"} 76188.77
    node_cpu_seconds_total{cpu="3",mode="user"} 111791.47
    # HELP node_disk_read_bytes_total The total number of bytes read successfully.
    # TYPE node_disk_read_bytes_total counter
    node_disk_read_bytes_total{device="disk0"} 6.22708862976e+11
    node_disk_read_bytes_total{device="disk3"} 1.12842752e+08
    # HELP node_disk_read_seconds_total The total number of seconds spent by all reads.
    # TYPE node_disk_read_seconds_total counter
    node_disk_read_seconds_total{device="disk0"} 22165.627411002
    node_disk_read_seconds_total{device="disk3"} 67.88703918
    # HELP node_disk_read_sectors_total The total number of sectors read successfully.
    # TYPE node_disk_read_sectors_total counter
    node_disk_read_sectors_total{device="disk0"} 4327.06494140625
    node_disk_read_sectors_total{device="disk3"} 7.34765625
    # HELP node_disk_reads_completed_total The total number of reads completed successfully.
    # TYPE node_disk_reads_completed_total counter
    node_disk_reads_completed_total{device="disk0"} 1.7723658e+07
    node_disk_reads_completed_total{device="disk3"} 3762
    # HELP node_disk_write_seconds_total This is the total number of seconds spent by all writes.
    # TYPE node_disk_write_seconds_total counter
    node_disk_write_seconds_total{device="disk0"} 8632.255762983
    node_disk_write_seconds_total{device="disk3"} 0
    # HELP node_disk_writes_completed_total The total number of writes completed successfully.
    # TYPE node_disk_writes_completed_total counter
    node_disk_writes_completed_total{device="disk0"} 1.9779856e+07
    node_disk_writes_completed_total{device="disk3"} 0
    # HELP node_disk_written_bytes_total The total number of bytes written successfully.
    # TYPE node_disk_written_bytes_total counter
    node_disk_written_bytes_total{device="disk0"} 6.94838308864e+11
    node_disk_written_bytes_total{device="disk3"} 0
    # HELP node_disk_written_sectors_total The total number of sectors written successfully.
    # TYPE node_disk_written_sectors_total counter
    node_disk_written_sectors_total{device="disk0"} 4829.06640625
    node_disk_written_sectors_total{device="disk3"} 0
    # HELP node_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which node_exporter was built.
    # TYPE node_exporter_build_info gauge
    node_exporter_build_info{branch="HEAD",goversion="go1.10",revision="002c1ca02917406cbecc457162e2bdb1f29c2f49",version="0.16.0-rc.0"} 1
    # HELP node_filesystem_avail_bytes Filesystem space available to non-root users in bytes.
    # TYPE node_filesystem_avail_bytes gauge
    node_filesystem_avail_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 4.7416078336e+10
    node_filesystem_avail_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 3.58532878336e+11
    node_filesystem_avail_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 3.58565429248e+11
    node_filesystem_avail_bytes{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 2.322432e+07
    node_filesystem_avail_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_avail_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_filesystem_device_error Whether an error occurred while getting statistics for the given device.
    # TYPE node_filesystem_device_error gauge
    node_filesystem_device_error{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 0
    node_filesystem_device_error{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 0
    node_filesystem_device_error{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 0
    node_filesystem_device_error{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 0
    node_filesystem_device_error{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_device_error{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_filesystem_files Filesystem total file nodes.
    # TYPE node_filesystem_files gauge
    node_filesystem_files{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 9.223372036854776e+18
    node_filesystem_files{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 9.223372036854776e+18
    node_filesystem_files{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 9.223372036854776e+18
    node_filesystem_files{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 4.294967279e+09
    node_filesystem_files{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_files{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_filesystem_files_free Filesystem total free file nodes.
    # TYPE node_filesystem_files_free gauge
    node_filesystem_files_free{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 9.223372036854352e+18
    node_filesystem_files_free{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 9.223372036853541e+18
    node_filesystem_files_free{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 9.223372036854776e+18
    node_filesystem_files_free{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 4.294964965e+09
    node_filesystem_files_free{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_files_free{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_filesystem_free_bytes Filesystem free space in bytes.
    # TYPE node_filesystem_free_bytes gauge
    node_filesystem_free_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 4.9138515968e+10
    node_filesystem_free_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 3.62441240576e+11
    node_filesystem_free_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 4.36741931008e+11
    node_filesystem_free_bytes{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 2.322432e+07
    node_filesystem_free_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_free_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_filesystem_readonly Filesystem read-only status.
    # TYPE node_filesystem_readonly gauge
    node_filesystem_readonly{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 0
    node_filesystem_readonly{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 0
    node_filesystem_readonly{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 0
    node_filesystem_readonly{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 1
    node_filesystem_readonly{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_readonly{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_filesystem_size_bytes Filesystem size in bytes.
    # TYPE node_filesystem_size_bytes gauge
    node_filesystem_size_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 5.9999997952e+10
    node_filesystem_size_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 4.3996317696e+11
    node_filesystem_size_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 4.3996317696e+11
    node_filesystem_size_bytes{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 1.3418496e+08
    node_filesystem_size_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_size_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_load1 1m load average.
    # TYPE node_load1 gauge
    node_load1 2.451171875
    # HELP node_load15 15m load average.
    # TYPE node_load15 gauge
    node_load15 2.7646484375
    # HELP node_load5 5m load average.
    # TYPE node_load5 gauge
    node_load5 2.6083984375
    # HELP node_memory_active_bytes_total Memory information field active_bytes_total.
    # TYPE node_memory_active_bytes_total gauge
    node_memory_active_bytes_total 3.251331072e+09
    # HELP node_memory_bytes_total Memory information field bytes_total.
    # TYPE node_memory_bytes_total gauge
    node_memory_bytes_total 1.7179869184e+10
    # HELP node_memory_free_bytes_total Memory information field free_bytes_total.
    # TYPE node_memory_free_bytes_total gauge
    node_memory_free_bytes_total 5.61926144e+08
    # HELP node_memory_inactive_bytes_total Memory information field inactive_bytes_total.
    # TYPE node_memory_inactive_bytes_total gauge
    node_memory_inactive_bytes_total 3.997949952e+09
    # HELP node_memory_swapped_in_pages_total Memory information field swapped_in_pages_total.
    # TYPE node_memory_swapped_in_pages_total gauge
    node_memory_swapped_in_pages_total 2.51926528e+09
    # HELP node_memory_swapped_out_pages_total Memory information field swapped_out_pages_total.
    # TYPE node_memory_swapped_out_pages_total gauge
    node_memory_swapped_out_pages_total 3.131211776e+09
    # HELP node_memory_wired_bytes_total Memory information field wired_bytes_total.
    # TYPE node_memory_wired_bytes_total gauge
    node_memory_wired_bytes_total 3.211726848e+09
    # HELP node_network_receive_bytes_total Network device statistic receive_bytes.
    # TYPE node_network_receive_bytes_total counter
    node_network_receive_bytes_total{device="XHC0"} 0
    node_network_receive_bytes_total{device="XHC1"} 0
    node_network_receive_bytes_total{device="XHC20"} 0
    node_network_receive_bytes_total{device="awdl0"} 5120
    node_network_receive_bytes_total{device="bridge0"} 0
    node_network_receive_bytes_total{device="en0"} 1.214772224e+09
    node_network_receive_bytes_total{device="en1"} 0
    node_network_receive_bytes_total{device="en2"} 0
    node_network_receive_bytes_total{device="en3"} 0
    node_network_receive_bytes_total{device="en4"} 0
    node_network_receive_bytes_total{device="en5"} 1.000448e+06
    node_network_receive_bytes_total{device="gif0"} 0
    node_network_receive_bytes_total{device="lo0"} 2.01657344e+09
    node_network_receive_bytes_total{device="p2p0"} 0
    node_network_receive_bytes_total{device="stf0"} 0
    node_network_receive_bytes_total{device="utun0"} 0
    node_network_receive_bytes_total{device="utun1"} 505856
    node_network_receive_bytes_total{device="utun2"} 23552
    node_network_receive_bytes_total{device="utun3"} 46080
    node_network_receive_bytes_total{device="utun4"} 0
    node_network_receive_bytes_total{device="utun5"} 0
    node_network_receive_bytes_total{device="utun6"} 0
    node_network_receive_bytes_total{device="vboxnet0"} 1.631232e+06
    # HELP node_network_receive_errs_total Network device statistic receive_errs.
    # TYPE node_network_receive_errs_total counter
    node_network_receive_errs_total{device="XHC0"} 0
    node_network_receive_errs_total{device="XHC1"} 0
    node_network_receive_errs_total{device="XHC20"} 0
    node_network_receive_errs_total{device="awdl0"} 0
    node_network_receive_errs_total{device="bridge0"} 0
    node_network_receive_errs_total{device="en0"} 0
    node_network_receive_errs_total{device="en1"} 0
    node_network_receive_errs_total{device="en2"} 0
    node_network_receive_errs_total{device="en3"} 0
    node_network_receive_errs_total{device="en4"} 0
    node_network_receive_errs_total{device="en5"} 0
    node_network_receive_errs_total{device="gif0"} 0
    node_network_receive_errs_total{device="lo0"} 0
    node_network_receive_errs_total{device="p2p0"} 0
    node_network_receive_errs_total{device="stf0"} 0
    node_network_receive_errs_total{device="utun0"} 0
    node_network_receive_errs_total{device="utun1"} 0
    node_network_receive_errs_total{device="utun2"} 0
    node_network_receive_errs_total{device="utun3"} 0
    node_network_receive_errs_total{device="utun4"} 0
    node_network_receive_errs_total{device="utun5"} 0
    node_network_receive_errs_total{device="utun6"} 0
    node_network_receive_errs_total{device="vboxnet0"} 0
    # HELP node_network_receive_multicast_total Network device statistic receive_multicast.
    # TYPE node_network_receive_multicast_total counter
    node_network_receive_multicast_total{device="XHC0"} 0
    node_network_receive_multicast_total{device="XHC1"} 0
    node_network_receive_multicast_total{device="XHC20"} 0
    node_network_receive_multicast_total{device="awdl0"} 33
    node_network_receive_multicast_total{device="bridge0"} 0
    node_network_receive_multicast_total{device="en0"} 5.331321e+06
    node_network_receive_multicast_total{device="en1"} 0
    node_network_receive_multicast_total{device="en2"} 0
    node_network_receive_multicast_total{device="en3"} 0
    node_network_receive_multicast_total{device="en4"} 0
    node_network_receive_multicast_total{device="en5"} 4
    node_network_receive_multicast_total{device="gif0"} 0
    node_network_receive_multicast_total{device="lo0"} 266605
    node_network_receive_multicast_total{device="p2p0"} 0
    node_network_receive_multicast_total{device="stf0"} 0
    node_network_receive_multicast_total{device="utun0"} 0
    node_network_receive_multicast_total{device="utun1"} 0
    node_network_receive_multicast_total{device="utun2"} 0
    node_network_receive_multicast_total{device="utun3"} 0
    node_network_receive_multicast_total{device="utun4"} 0
    node_network_receive_multicast_total{device="utun5"} 0
    node_network_receive_multicast_total{device="utun6"} 0
    node_network_receive_multicast_total{device="vboxnet0"} 98
    # HELP node_network_receive_packets_total Network device statistic receive_packets.
    # TYPE node_network_receive_packets_total counter
    node_network_receive_packets_total{device="XHC0"} 0
    node_network_receive_packets_total{device="XHC1"} 0
    node_network_receive_packets_total{device="XHC20"} 0
    node_network_receive_packets_total{device="awdl0"} 42
    node_network_receive_packets_total{device="bridge0"} 0
    node_network_receive_packets_total{device="en0"} 5.6394197e+07
    node_network_receive_packets_total{device="en1"} 0
    node_network_receive_packets_total{device="en2"} 0
    node_network_receive_packets_total{device="en3"} 0
    node_network_receive_packets_total{device="en4"} 0
    node_network_receive_packets_total{device="en5"} 4299
    node_network_receive_packets_total{device="gif0"} 0
    node_network_receive_packets_total{device="lo0"} 3.243677e+06
    node_network_receive_packets_total{device="p2p0"} 0
    node_network_receive_packets_total{device="stf0"} 0
    node_network_receive_packets_total{device="utun0"} 0
    node_network_receive_packets_total{device="utun1"} 3548
    node_network_receive_packets_total{device="utun2"} 168
    node_network_receive_packets_total{device="utun3"} 226
    node_network_receive_packets_total{device="utun4"} 0
    node_network_receive_packets_total{device="utun5"} 0
    node_network_receive_packets_total{device="utun6"} 0
    node_network_receive_packets_total{device="vboxnet0"} 1533
    # HELP node_network_transmit_bytes_total Network device statistic transmit_bytes.
    # TYPE node_network_transmit_bytes_total counter
    node_network_transmit_bytes_total{device="XHC0"} 0
    node_network_transmit_bytes_total{device="XHC1"} 0
    node_network_transmit_bytes_total{device="XHC20"} 0
    node_network_transmit_bytes_total{device="awdl0"} 1.50016e+06
    node_network_transmit_bytes_total{device="bridge0"} 0
    node_network_transmit_bytes_total{device="en0"} 2.575358976e+09
    node_network_transmit_bytes_total{device="en1"} 0
    node_network_transmit_bytes_total{device="en2"} 0
    node_network_transmit_bytes_total{device="en3"} 0
    node_network_transmit_bytes_total{device="en4"} 0
    node_network_transmit_bytes_total{device="en5"} 483328
    node_network_transmit_bytes_total{device="gif0"} 0
    node_network_transmit_bytes_total{device="lo0"} 2.01657344e+09
    node_network_transmit_bytes_total{device="p2p0"} 0
    node_network_transmit_bytes_total{device="stf0"} 0
    node_network_transmit_bytes_total{device="utun0"} 0
    node_network_transmit_bytes_total{device="utun1"} 493568
    node_network_transmit_bytes_total{device="utun2"} 23552
    node_network_transmit_bytes_total{device="utun3"} 46080
    node_network_transmit_bytes_total{device="utun4"} 0
    node_network_transmit_bytes_total{device="utun5"} 0
    node_network_transmit_bytes_total{device="utun6"} 0
    node_network_transmit_bytes_total{device="vboxnet0"} 1.695744e+06
    # HELP node_network_transmit_errs_total Network device statistic transmit_errs.
    # TYPE node_network_transmit_errs_total counter
    node_network_transmit_errs_total{device="XHC0"} 0
    node_network_transmit_errs_total{device="XHC1"} 0
    node_network_transmit_errs_total{device="XHC20"} 0
    node_network_transmit_errs_total{device="awdl0"} 0
    node_network_transmit_errs_total{device="bridge0"} 0
    node_network_transmit_errs_total{device="en0"} 0
    node_network_transmit_errs_total{device="en1"} 0
    node_network_transmit_errs_total{device="en2"} 0
    node_network_transmit_errs_total{device="en3"} 0
    node_network_transmit_errs_total{device="en4"} 0
    node_network_transmit_errs_total{device="en5"} 0
    node_network_transmit_errs_total{device="gif0"} 0
    node_network_transmit_errs_total{device="lo0"} 0
    node_network_transmit_errs_total{device="p2p0"} 0
    node_network_transmit_errs_total{device="stf0"} 0
    node_network_transmit_errs_total{device="utun0"} 0
    node_network_transmit_errs_total{device="utun1"} 0
    node_network_transmit_errs_total{device="utun2"} 0
    node_network_transmit_errs_total{device="utun3"} 0
    node_network_transmit_errs_total{device="utun4"} 0
    node_network_transmit_errs_total{device="utun5"} 0
    node_network_transmit_errs_total{device="utun6"} 0
    node_network_transmit_errs_total{device="vboxnet0"} 0
    # HELP node_network_transmit_multicast_total Network device statistic transmit_multicast.
    # TYPE node_network_transmit_multicast_total counter
    node_network_transmit_multicast_total{device="XHC0"} 0
    node_network_transmit_multicast_total{device="XHC1"} 0
    node_network_transmit_multicast_total{device="XHC20"} 0
    node_network_transmit_multicast_total{device="awdl0"} 0
    node_network_transmit_multicast_total{device="bridge0"} 0
    node_network_transmit_multicast_total{device="en0"} 0
    node_network_transmit_multicast_total{device="en1"} 0
    node_network_transmit_multicast_total{device="en2"} 0
    node_network_transmit_multicast_total{device="en3"} 0
    node_network_transmit_multicast_total{device="en4"} 0
    node_network_transmit_multicast_total{device="en5"} 0
    node_network_transmit_multicast_total{device="gif0"} 0
    node_network_transmit_multicast_total{device="lo0"} 0
    node_network_transmit_multicast_total{device="p2p0"} 0
    node_network_transmit_multicast_total{device="stf0"} 0
    node_network_transmit_multicast_total{device="utun0"} 0
    node_network_transmit_multicast_total{device="utun1"} 0
    node_network_transmit_multicast_total{device="utun2"} 0
    node_network_transmit_multicast_total{device="utun3"} 0
    node_network_transmit_multicast_total{device="utun4"} 0
    node_network_transmit_multicast_total{device="utun5"} 0
    node_network_transmit_multicast_total{device="utun6"} 0
    node_network_transmit_multicast_total{device="vboxnet0"} 0
    # HELP node_network_transmit_packets_total Network device statistic transmit_packets.
    # TYPE node_network_transmit_packets_total counter
    node_network_transmit_packets_total{device="XHC0"} 0
    node_network_transmit_packets_total{device="XHC1"} 0
    node_network_transmit_packets_total{device="XHC20"} 0
    node_network_transmit_packets_total{device="awdl0"} 6691
    node_network_transmit_packets_total{device="bridge0"} 1
    node_network_transmit_packets_total{device="en0"} 3.2582836e+07
    node_network_transmit_packets_total{device="en1"} 0
    node_network_transmit_packets_total{device="en2"} 0
    node_network_transmit_packets_total{device="en3"} 0
    node_network_transmit_packets_total{device="en4"} 0
    node_network_transmit_packets_total{device="en5"} 4145
    node_network_transmit_packets_total{device="gif0"} 0
    node_network_transmit_packets_total{device="lo0"} 3.243677e+06
    node_network_transmit_packets_total{device="p2p0"} 0
    node_network_transmit_packets_total{device="stf0"} 0
    node_network_transmit_packets_total{device="utun0"} 2
    node_network_transmit_packets_total{device="utun1"} 3236
    node_network_transmit_packets_total{device="utun2"} 160
    node_network_transmit_packets_total{device="utun3"} 223
    node_network_transmit_packets_total{device="utun4"} 2
    node_network_transmit_packets_total{device="utun5"} 2
    node_network_transmit_packets_total{device="utun6"} 2
    node_network_transmit_packets_total{device="vboxnet0"} 73766
    # HELP node_scrape_collector_duration_seconds node_exporter: Duration of a collector scrape.
    # TYPE node_scrape_collector_duration_seconds gauge
    node_scrape_collector_duration_seconds{collector="cpu"} 0.00013298
    node_scrape_collector_duration_seconds{collector="diskstats"} 0.000803364
    node_scrape_collector_duration_seconds{collector="filesystem"} 0.000119007
    node_scrape_collector_duration_seconds{collector="loadavg"} 2.3448e-05
    node_scrape_collector_duration_seconds{collector="meminfo"} 5.3036e-05
    node_scrape_collector_duration_seconds{collector="netdev"} 0.000338404
    node_scrape_collector_duration_seconds{collector="textfile"} 1.7727e-05
    node_scrape_collector_duration_seconds{collector="time"} 2.8571e-05
    # HELP node_scrape_collector_success node_exporter: Whether a collector succeeded.
    # TYPE node_scrape_collector_success gauge
    node_scrape_collector_success{collector="cpu"} 1
    node_scrape_collector_success{collector="diskstats"} 1
    node_scrape_collector_success{collector="filesystem"} 1
    node_scrape_collector_success{collector="loadavg"} 1
    node_scrape_collector_success{collector="meminfo"} 1
    node_scrape_collector_success{collector="netdev"} 1
    node_scrape_collector_success{collector="textfile"} 1
    node_scrape_collector_success{collector="time"} 1
    # HELP node_textfile_scrape_error 1 if there was an error opening or reading a file, 0 otherwise
    # TYPE node_textfile_scrape_error gauge
    node_textfile_scrape_error 0
    # HELP node_time_seconds System time in seconds since epoch (1970).
    # TYPE node_time_seconds gauge
    node_time_seconds 1.5210412225783854e+09
    # HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
    # TYPE promhttp_metric_handler_requests_in_flight gauge
    promhttp_metric_handler_requests_in_flight 1
    # HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
    # TYPE promhttp_metric_handler_requests_total counter
    promhttp_metric_handler_requests_total{code="200"} 0
    promhttp_metric_handler_requests_total{code="500"} 0
    promhttp_metric_handler_requests_total{code="503"} 0

    View full-size slide

  49. #phptek @mheap
    # HELP node_filesystem_readonly Filesystem read-only status.
    # TYPE node_filesystem_readonly gauge
    node_filesystem_readonly{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 0
    node_filesystem_readonly{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 0
    node_filesystem_readonly{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 0
    node_filesystem_readonly{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_readonly{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_filesystem_size_bytes Filesystem size in bytes.
    # TYPE node_filesystem_size_bytes gauge
    node_filesystem_size_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"}
    5.9999997952e+10
    node_filesystem_size_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 4.3996317696e+11
    node_filesystem_size_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"}
    4.3996317696e+11
    node_filesystem_size_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_size_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_load1 1m load average.
    # TYPE node_load1 gauge
    node_load1 2.451171875
    # HELP node_load15 15m load average.
    # TYPE node_load15 gauge
    node_load15 2.7646484375

    View full-size slide

  50. #phptek @mheap
    Node Exporter?

    View full-size slide

  51. #phptek @mheap
    Exporter?

    View full-size slide

  52. #phptek @mheap
    Exposes metrics

    View full-size slide

  53. #phptek @mheap
    node_exporter
    Key Description
    arp Exposes ARP statistics from /proc/net/arp.
    cpu Exposes CPU statistics
    filesystem Exposes filesystem statistics, such as disk space used.
    Itvs Exposes IPVS status from /proc/net/ip_vs and stats from /proc/net/ip_vs_stats.
    netstat Exposes network statistics from /proc/net/netstat. This is the same information as netstat -s.
    uname Exposes system information as provided by the uname system call.

    View full-size slide

  54. #phptek @mheap
    mysqld_exporter
    Key Description
    perf_schema.tablelocks Collect metrics from performance_schema.table_lock_waits_summary_by_table
    info_schema.processlist Collect thread state counts from information_schema.processlist
    binlog_size Collect the current size of all registered binlog files
    auto_increment.columns Collect auto_increment columns and max values from information_schema

    View full-size slide

  55. #phptek @mheap
    haproxy_exporter
    Key Description
    current_queue Current number of queued requests assigned to this server
    current_sessions Current number of active sessions
    bytes_in_total Current total of incoming bytes
    connection_errors_total Total of connection errors

    View full-size slide

  56. #phptek @mheap
    memcached_exporter
    Key Description
    bytes_read Total number of bytes read by this server from network
    connections_total Total number of connections opened since the server started running
    items_evicted_total Total number of valid items removed from cache to free memory for new items
    commands_total Total number of all requests broken down by command (get, set, etc.) and status per slab

    View full-size slide

  57. #phptek @mheap
    Create your
    own metrics

    View full-size slide

  58. #phptek @mheap
    calls_received_total{
    network="o2",
    number="447700900000",
    type="mobile"
    }
    11

    View full-size slide

  59. #phptek @mheap
    calls_received_total{
    network="o2",
    number="447700900000",
    type="mobile"
    }
    11

    View full-size slide

  60. #phptek @mheap
    calls_received_total{
    network="o2",
    number="447700900000",
    type="mobile"
    }
    11

    View full-size slide

  61. #phptek @mheap
    calls_received_total{
    network="o2",
    number="447700900000",
    type="mobile"
    }
    11

    View full-size slide

  62. #phptek @mheap
    calls_received_total{
    network="o2",
    number="447700900000",
    type="mobile"
    }
    11

    View full-size slide

  63. #phptek @mheap
    calls_received_total{
    network="o2",
    number="447700900000",
    type="mobile"
    }
    11

    View full-size slide

  64. #phptek @mheap
    Increment a counter
    Serve on /metrics

    View full-size slide

  65. #phptek @mheap
    global:
    scrape_interval: 15s
    scrape_configs:
    - job_name: nexmo_calls
    static_configs:
    - targets: ['localhost:3000']

    View full-size slide

  66. #phptek @mheap
    Increment a counter
    Serve on /metrics

    View full-size slide

  67. #phptek @mheap
    Hard in PHP

    View full-size slide

  68. #phptek @mheap
    Pushgateway
    The Prometheus Pushgateway exists to allow ephemeral and batch
    jobs to expose their metrics to Prometheus. Since these kinds of
    jobs may not exist long enough to be scraped, they can instead
    push their metrics to a Pushgateway. The Pushgateway then
    exposes these metrics to Prometheus
    https://github.com/prometheus/pushgateway
    https://github.com/Lazyshot/prometheus-php

    View full-size slide

  69. #phptek @mheap
    Things to know

    View full-size slide

  70. #phptek @mheap
    > 5-10 labels is bad

    View full-size slide

  71. #phptek @mheap
    Secure /metrics

    View full-size slide

  72. #phptek @mheap
    Metric Types

    View full-size slide

  73. #phptek @mheap
    Counters
    calls_placed_total

    View full-size slide

  74. #phptek @mheap
    Gauges
    calls_active

    View full-size slide

  75. #phptek @mheap
    Histograms
    calls_duration

    View full-size slide

  76. #phptek @mheap
    Summaries
    calls_duration

    View full-size slide

  77. #phptek @mheap
    calls_duration_bucket{le="10",network="o2",number="447700900000",type="mobile"} 16
    calls_duration_bucket{le="30",network="o2",number="447700900000",type="mobile"} 63
    calls_duration_bucket{le="60",network="o2",number="447700900000",type="mobile"} 123
    calls_duration_bucket{le="120",network="o2",number="447700900000",type="mobile"} 253
    calls_duration_bucket{le="300",network="o2",number="447700900000",type="mobile"} 618

    View full-size slide

  78. #phptek @mheap
    calls_duration_bucket{quantile="0.5"} 85
    calls_duration_bucket{quantile="0.9"} 123
    calls_duration_bucket{quantile="0.99"} 221
    calls_duration_sum 13130
    calls_duration_count 6

    View full-size slide

  79. #phptek @mheap
    Counters
    Use for counting events that happen (e.g. total number of requests) and query using rate()
    Gauge
    Use to instrument the current state of a metric (e.g. memory usage, jobs in queue)
    Histograms
    Use to sample observations in order to analyse distribution of a data set (e.g. request latency)
    Summaries
    Use for pre-calculated quantiles on client side, but be mindful of calculation cost and aggregation limitations

    View full-size slide

  80. #phptek @mheap
    Show me graphs

    View full-size slide

  81. #phptek @mheap

    View full-size slide

  82. #phptek @mheap
    PromQL

    View full-size slide

  83. #phptek @mheap
    calls_placed_total
    Element Value
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"}
    4
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="BT",number="442079460000",type="landline"}
    8
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="BT",number="442079460000",type="landline"}
    1
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="o2",number="447700900000",type="mobile"}
    6
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="o2",number="447908249481",type="mobile"}
    7

    View full-size slide

  84. #phptek @mheap
    calls_placed_total{number="441234567890"}
    Element Value
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"}
    4

    View full-size slide

  85. #phptek @mheap
    calls_placed_total{number="441234567890"}[3m]
    Element Value
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"}
    3 @1521482766.23

    4 @1521482769.23

    12 @1521482772.229

    16 @1521482775.229

    21 @1521482778.23

    25 @1521482781.23

    27 @1521482784.229

    31 @1521482787.229

    35 @1521482790.229

    View full-size slide

  86. #phptek @mheap
    calls_placed_total{number="441234567890"}[3m]
    offset 1w
    Element Value
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"}
    2 @1521311766.23

    7 @1521311769.23

    18 @1523112772.229

    20 @1523112775.229

    27 @1523112778.23

    28 @1523112781.23

    30 @1523112784.229

    36 @1523112787.229

    39 @1523112790.229

    View full-size slide

  87. #phptek @mheap
    # Total number of calls regardless of any labels
    sum(calls_placed_total)
    # Total number of requests, broken down by the number label
    sum(calls_placed_total[5m]) by (number)
    # Total per-second rate over the last 5 minutes by number
    sum(rate(calls_placed_total[5m])) by (number)

    View full-size slide

  88. #phptek @mheap
    sum(calls_placed_total{network="EE", type="mobile"})

    View full-size slide

  89. #phptek @mheap
    sum(calls_placed_total{network=~"E.*", type="mobile"})

    View full-size slide

  90. #phptek @mheap
    sum(calls_placed_total{network!="EE", type="mobile"})

    View full-size slide

  91. #phptek @mheap
    rate(calls_duration_sum{network="EE"}
    [5m])
    /
    rate(calls_duration_count{network="EE"
    }[5m])

    View full-size slide

  92. #phptek @mheap
    histogram_quantile(
    0.95,
    calls_duration_bucket{number=~"[[number]]"}
    )

    View full-size slide

  93. #phptek @mheap
    sum without (duration)
    (rate(calls_placed_total{number=~"[[number]]"}
    [3m]))

    View full-size slide

  94. #phptek @mheap
    predict_linear(calls_active[1h], 86400)

    View full-size slide

  95. #phptek @mheap
    Grafana

    View full-size slide

  96. #phptek @mheap
    Gauges
    calls_active

    View full-size slide

  97. #phptek @mheap

    View full-size slide

  98. #phptek @mheap
    Counters
    calls_placed_total

    View full-size slide

  99. #phptek @mheap

    View full-size slide

  100. #phptek @mheap

    View full-size slide

  101. #phptek @mheap
    Histograms
    calls_duration

    View full-size slide

  102. #phptek @mheap

    View full-size slide

  103. #phptek @mheap

    View full-size slide

  104. #phptek @mheap

    View full-size slide

  105. #phptek @mheap
    Version 5

    View full-size slide

  106. #phptek @mheap
    Alertmanager

    View full-size slide

  107. #phptek @mheap
    Alertmanager
    rules

    View full-size slide

  108. #phptek @mheap
    alert: CallsMonitorDown
    expr: Up{job="nexmo_calls"} == 0
    for: 5m
    labels:
    severity: critical

    View full-size slide

  109. #phptek @mheap
    alert: LotsOfJobsInQueue
    expr: sum(jobs_in_queue) > 100
    for: 5m
    labels:
    severity: major

    View full-size slide

  110. #phptek @mheap
    alert: DiskFullInFourHours
    expr:
    predict_linear(node_filesystem_free{job
    ="node"}[1h], 4 * 3600) < 0
    for: 5m
    labels:
    severity: major

    View full-size slide

  111. #phptek @mheap
    alert: HighCallsBeingPlacedOnLandline
    expr: rate(calls_placed_total{network=~".*", type="landline"}
    [1m]) >10
    for: 5m
    labels:
    severity: critical
    annotations:
    description: 'Unusually high call count on
    {{ $labels.network }}'
    summary: 'High call count on {{ $labels.network }}'

    View full-size slide

  112. #phptek @mheap
    Alertmanager
    alerts

    View full-size slide

  113. #phptek @mheap
    [ smtp_from: ]
    [ slack_api_url: ]
    [ victorops_api_key: ]
    [ victorops_api_url: | default = "https://alert.victorops.com/
    integrations/generic/20131114/alert/" ]
    [ pagerduty_url: | default = "https://events.pagerduty.com/v2/
    enqueue" ]
    [ opsgenie_api_key: ]
    [ opsgenie_api_url: | default = "https://api.opsgenie.com/" ]
    [ hipchat_api_url: | default = "https://api.hipchat.com/" ]
    [ hipchat_auth_token: ]

    View full-size slide

  114. #phptek @mheap
    Alertmanager
    routes

    View full-size slide

  115. #phptek @mheap
    route:
    receiver: 'default-receiver'
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    group_by: [cluster, alertname]
    - receiver: 'database-pager'
    group_wait: 10s
    match_re:
    service: mysql|cassandra
    - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
    team: frontend

    View full-size slide

  116. #phptek @mheap
    route:
    receiver: 'default-receiver'
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    group_by: [cluster, alertname]
    - receiver: 'database-pager'
    group_wait: 10s
    match_re:
    service: mysql|cassandra
    - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
    team: frontend

    View full-size slide

  117. #phptek @mheap
    route:
    receiver: 'default-receiver'
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    group_by: [cluster, alertname]
    - receiver: 'database-pager'
    group_wait: 10s
    match_re:
    service: mysql|cassandra
    - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
    team: frontend

    View full-size slide

  118. #phptek @mheap
    route:
    receiver: 'default-receiver'
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    group_by: [cluster, alertname]
    - receiver: 'database-pager'
    group_wait: 10s
    match_re:
    service: mysql|cassandra
    - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
    team: frontend

    View full-size slide

  119. #phptek @mheap
    receivers:
    - name: 'team-X-mails'
    email_configs:
    - to: '[email protected]
    - name: 'team-X-pager'
    email_configs:
    - to: '[email protected]'
    pagerduty_configs:
    - routing_key:
    - name: 'team-Y-mails'
    email_configs:
    - to: '[email protected]'
    - name: 'team-Y-pager'
    pagerduty_configs:
    - routing_key:
    - name: 'team-DB-pager'
    pagerduty_configs:
    - routing_key:

    View full-size slide

  120. #phptek @mheap
    Alertmanager
    inhibits

    View full-size slide

  121. #phptek @mheap
    ! Database is down
    ! User login failure > 100
    ! Report generation failure > 15
    ! GET /healthcheck returned 500

    View full-size slide

  122. #phptek @mheap
    ! Database is down
    ! User login failure > 100
    ! Report generation failure > 15
    ! GET /healthcheck returned 500

    View full-size slide

  123. #phptek @mheap
    ! Database is down
    ! User login failure > 100
    ! Report generation failure > 15
    ! GET /healthcheck returned 500

    View full-size slide

  124. #phptek @mheap
    ! Database is down
    ! User login failure > 100
    ! Report generation failure > 15
    ! GET /healthcheck returned 500

    View full-size slide

  125. #phptek @mheap
    The DB is the
    root cause

    View full-size slide

  126. #phptek @mheap
    inhibit_rules:
    - source_match:
    alertname: 'UserLoginFailure'
    target_match:
    alertname: 'DatabaseDown'
    equal: ['instance']

    View full-size slide

  127. #phptek @mheap

    View full-size slide

  128. #phptek @mheap
    Ingester
    Grouper
    Deduplicator
    Silencer
    Throttler
    Notifier

    View full-size slide

  129. #phptek @mheap
    Ingester
    Grouper
    Deduplicator
    Silencer
    Throttler
    Notifier

    View full-size slide

  130. #phptek @mheap
    Ingester
    Grouper
    Deduplicator
    Silencer
    Throttler
    Notifier

    View full-size slide

  131. #phptek @mheap
    Ingester
    Grouper
    Deduplicator
    Silencer
    Throttler
    Notifier

    View full-size slide

  132. #phptek @mheap
    Ingester
    Grouper
    Deduplicator
    Silencer
    Throttler
    Notifier

    View full-size slide

  133. #phptek @mheap
    Ingester
    Grouper
    Deduplicator
    Silencer
    Throttler
    Notifier

    View full-size slide

  134. #phptek @mheap
    Ingester
    Grouper
    Deduplicator
    Silencer
    Throttler
    Notifier

    View full-size slide

  135. #phptek @mheap
    Does it scale?

    View full-size slide

  136. #phptek @mheap
    Yes.

    View full-size slide

  137. #phptek @mheap
    4.6M time series per server
    72k samples ingested per second, per server
    185 production prometheus servers

    View full-size slide

  138. #phptek @mheap
    Prometheus
    federates

    View full-size slide

  139. #phptek @mheap
    Alertmanager
    gossips

    View full-size slide

  140. #phptek @mheap
    So that’s Prometheus

    View full-size slide

  141. #phptek @mheap
    So that’s Prometheus
    (and PromQL, Grafana and Alertmanager)

    View full-size slide

  142. #phptek @mheap
    @MHEAP
    [email protected]
    HTTPS://JOIND.IN/TALK/845A7

    View full-size slide