Prometheus: monitoring and alerting with an open source solution

Prometheus: monitoring and alerting with an open source solution

#DevOpsTaipei2017
#DevOpsTaipei

9dc1fb93b959c0d838bf6a900306d9b9?s=128

Cheng-Lung Sung

September 08, 2017
Tweet

Transcript

  1. Prometheus: monitoring and alerting with an open source solution Cheng-Lung

    Sung (clsung@)
  2. Outline (preface) • What is Prometheus? • What is metrics

    monitoring • Time series DB • Component introduction • Why choose Prometheus against alternatives. • How we adopt Prometheus (code examples in Go) • What metrics to measure? • Why 50%, 90%, 99%? • Not only Prometheus • Other open source solutions for DevOps to save the day.
  3. Outline (actually) • The history of HTC Cloud Service Infrastructure

    • As a DevOps, what you will need • Why/How we adopt Prometheus (code examples in Go) • Metrics can talk, why 50%, 95%, 99%?
  4. About.Me/clsung • Manager of Product Development, HTC Health Care •

    Cloud Service Infrastructure (Golang, Python) • Mobile App Development (Golang, Java, Swift, Node.js) • Deep Learning Platform (Golang, Python, Node.js) • Open Source contributor • FreeBSD clsung@FreeBSD.org • Golang golang.org/AUTHORS • Plurk API www.plurk.com/API We’re hiring!
  5. Jan Mar May Jul Sep Nov Feb Apr Jun Aug

    Oct Dec 2014 Re-join
 HTC Studio • Docker 1.0 Release • Integrate Jenkins with Docker CSI kickoff in Golang Introduce Docker Cloud Service Infrastructure
 Timeline First App
 Official Launch embrace GCP 2015 2013 Oct
  6. JIRA-Kanban “⼀一定要把 Dev 看板 + Ops 看板放⼀一起 = 全貌” -Ruddy

  7. –Gene Kim, The Phoenix Project: A Novel About IT, DevOps,

    and Helping Your Business Win “Any improvements made anywhere besides the bottleneck are an illusion”
  8. Docker for binary on Android

  9. Go 1.3 vendor

  10. Go 1.3 vendor

  11. “Focus on product/process, not technology”

  12. None
  13. –Google Container Engineer “Run Docker containers on Google Cloud Platform,

    powered by Kubernetes.”
  14. Jan Mar May Jul Sep Nov Feb Apr Jun Aug

    Oct Dec 2014 Re-join
 HTC Studio CSI kickoff in Golang Introduce Docker Cloud Service Infrastructure
 Timeline First App
 Official Launch Dockerized Everything embrace GCP 2015 2013 Oct • Docker 1.0 Release • Integrate Jenkins with Docker
  15. –Gene Kim, The Phoenix Project: A Novel About IT, DevOps,

    and Helping Your Business Win “It’s not the upfront capital that kills you, it’s the operations and maintenance on the back end”
  16. Handcrafted deployment script • build / test docker images •

    update GCE services • health-check • autoscaler • instance-template / instance- group • push configuration
  17. Docker with EFK Also deployed with autoscaler on custom metrics

  18. None
  19. DevOps! What will you need?

  20. –有⽤用的 DevOps 「DevOps 三寶:Logging、Tracing、 Monitoring」

  21. Tracing Appdash, Jaeger, LightStep, Zipkin… (Golang)

  22. System level metrics Prometheus + Grafana

  23. Application metrics Prometheus + Grafana

  24. Time Series Metrics • Time Series is • a series

    of numeric data points of some particular metric over time. • each consists of a metric plus one or more tags associated with this metric • Metric is • any particular piece of data to track over time • e.g. hits to an Apache hosted file http://opentsdb.net/docs/build/html/user_guide/query/timeseries.html HTC CSI Repo
  25. Site Reliability Engineering – How Google Runs Production Systems “The

    Four Golden Signals: Latency, Traffic, Errors and Saturation”
  26. DevOps vs SRE https://en.wikipedia.org/wiki/Site_reliability_engineering#DevOps_vs_SRE

  27. How to measure? https://honeycomb.io/blog/2017/01/instrumentation-the-first-four-things-you-measure/

  28. // Count is a go-restful filter that counts REST call

    statistics. // It counts the following: request count, request round-trip time, response success count // and rate, and response length. func (lf *CounterFilter) Count(req *restful.Request, resp *restful.Response, c *restful.FilterChain) { start := time.Now() target := lf.getTarget(req) // Record request duration defer ctr.Time(makeReqTime(strOverall), start, Alert|Avg) defer ctr.Time(makeReqTime(target), start) // Count request sizes ctr.Val(makeReqSize(strOverall), req.Request.ContentLength, Avg|Rate) ctr.Val(makeReqSize(target), req.Request.ContentLength, Avg|Rate) // Count request QPS ctr.Event(makeReqEvent(strOverall), 1, Alert|Rate) ctr.Event(makeReqEvent(target), 1) c.ProcessFilter(req, resp) // Count response codes statusCode := resp.StatusCode() ctr.Event(makeRespCode(strOverall, statusCode), 1) ctr.Event(makeRespCode(target, statusCode), 1)
  29. Why Prometheus?

  30. 不要重新發明輪輪⼦子
 Don't Reinvent The Wheel Unless You Plan on Learning

    More About Wheels
  31. Prometheus Architecture https://prometheus.io/

  32. Cloud Native Computing Foundation https://www.cncf.io/

  33. Integrations, Third-party libraries https://promcon.io/2016-berlin/talks/prometheus-design-and-philosophy/

  34. Prometheus Metric Types Counter Gauge Histogram Summary

  35. // Counter is a wrapper of metrics.* type Counter struct

    { ctr metrics.Counter err metrics.Counter dur metrics.Histogram dus metrics.Histogram gau metrics.Gauge } // NewCounter return a new Counter func NewCounter(namespace, subsystem string) *Counter { … return &Counter{ ctr: kitprometheus.NewCounter(ctr), err: kitprometheus.NewCounter(errCtr), dur: kitprometheus.NewHistogram(dur), dus: kitprometheus.NewSummary(dus), gau: kitprometheus.NewGauge(gau), } } // Duration records the duration in seconds func (c *Counter) Duration(function string, value float64) { c.dur.With("func", function).Observe(value) c.dus.With("func", function).Observe(value) } // Event records the function call event func (c *Counter) Event(function string, value int) { c.ctr.With("func", function).Add(float64(value)) } // Err records the function call error func (c *Counter) Err(function string, value int) { c.err.With("func", function).Add(float64(value))
  36. // Record request duration defer ctr.Duration(makeReqTime(target), time.Now()) // Expand the

    span for tracing span := utils.StartSpanFromContextWithSpan(ctx, target, opentracing.SpanFromContext(ctx)) defer span.Finish() // Count request sizes ctr.Event(makeReqSize(target), req.Request.ContentLength) // Count request QPS ctr.Event(makeReqEvent(target), 1) c.ProcessFilter(req, resp) // Count response codes statusCode := resp.StatusCode() ctr.Event(makeRespCode(target, statusCode), 1) // StatusBadRequest is 400 if statusCode >= http.StatusBadRequest { ctr.Err(makeRespErrorEvent(target), 1) log.Errorf("[http status] %s: %d (ms: %d)", target, statusCode, elapsedInMs)
  37. Why 50%, 95%, 99% matter? https://www.dynatrace.com/blog/why-averages-suck-and-percentiles-are-great/

  38. 平均 台灣平均年年薪 54.7 萬、貧富差距達 12.6 倍

  39. Average don't tell stories https://www.dynatrace.com/blog/why-averages-suck-and-percentiles-are-great/ http://collider.com/the-flash-movie-ezra-miller-barry-allen/

  40. None
  41. None
  42. None
  43. 99%

  44. –On Metrics Monitoring ”Prevention is better than cure“

  45. Thank you!