Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Prometheus & Time-Series monitoring

Prometheus & Time-Series monitoring

Micah Hausler

August 04, 2016
Tweet

More Decks by Micah Hausler

Other Decks in Programming

Transcript

  1. there are some drawbacks metric data in multiple formats additional

    processes running on hosts report failure is not a signal
  2. a single metric 1 data point = 24 bytes 1

    min scrape interval 12 hours of data
  3. a single metric 1 data point = 24 bytes 1

    min scrape interval 12 hours of data 12h * 60m/h * 24b = 16.88 kb
  4. a single metric 1 data point = 24 bytes 1

    min scrape interval 12 hours of data 12h * 60m/h * 24b = 16.88 kb 17 * 2^30 = 17gb
  5. metric types • counter - a number that only increases

    • gauge - dynamic changing number • histogram - samples observations, counts in buckets - useful for things like Apdex • summary - samples observations, calculates quantiles over a sliding time window
  6. histogram # HELP state_store_seconds Registry write latency in seconds #

    TYPE state_store_seconds histogram state_store_seconds_bucket{le="0.005"} 0 state_store_seconds_bucket{le="0.01"} 0 state_store_seconds_bucket{le="0.025"} 0 state_store_seconds_bucket{le="0.05"} 0 state_store_seconds_bucket{le="0.1"} 0 state_store_seconds_bucket{le="0.25"} 0 state_store_seconds_bucket{le="0.5"} 0 state_store_seconds_bucket{le="1"} 0 state_store_seconds_bucket{le="2.5"} 0 state_store_seconds_bucket{le="5"} 0 state_store_seconds_bucket{le="10"} 0 state_store_seconds_bucket{le="+Inf"} 0 state_store_seconds_sum 0 state_store_seconds_count 0
  7. summary # HELP go_gc_duration_seconds A summary of the GC invocation

    durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 0.00019505300000000002 go_gc_duration_seconds{quantile="0.25"} 0.000299606 go_gc_duration_seconds{quantile="0.5"} 0.00039634 go_gc_duration_seconds{quantile="0.75"} 0.000436071 go_gc_duration_seconds{quantile="1"} 0.000758476 go_gc_duration_seconds_sum 240.533560116 go_gc_duration_seconds_count 660994
  8. Beyer, Betsy, Chris Jones, Jennifer Petoff, and Niall Richard Murphy.

    Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, 2016. Print. bibliography
  9. Micah Hausler @micahhausler [email protected] thanks! skuid is hiring • Site

    Reliability Engineers • DevOps Engineers • NodeJS Backend Engineers • Frontend Engineers • Quality Engineers • System Engineers