Prometheus: monitoring and alerting with an open source solution

Prometheus: monitoring and alerting with an open source solution

#DevOpsTaipei2017
#DevOpsTaipei

9dc1fb93b959c0d838bf6a900306d9b9?s=128

Cheng-Lung Sung

September 08, 2017
Tweet

Transcript

  1. 2.

    Outline (preface) • What is Prometheus? • What is metrics

    monitoring • Time series DB • Component introduction • Why choose Prometheus against alternatives. • How we adopt Prometheus (code examples in Go) • What metrics to measure? • Why 50%, 90%, 99%? • Not only Prometheus • Other open source solutions for DevOps to save the day.
  2. 3.

    Outline (actually) • The history of HTC Cloud Service Infrastructure

    • As a DevOps, what you will need • Why/How we adopt Prometheus (code examples in Go) • Metrics can talk, why 50%, 95%, 99%?
  3. 4.

    About.Me/clsung • Manager of Product Development, HTC Health Care •

    Cloud Service Infrastructure (Golang, Python) • Mobile App Development (Golang, Java, Swift, Node.js) • Deep Learning Platform (Golang, Python, Node.js) • Open Source contributor • FreeBSD clsung@FreeBSD.org • Golang golang.org/AUTHORS • Plurk API www.plurk.com/API We’re hiring!
  4. 5.

    Jan Mar May Jul Sep Nov Feb Apr Jun Aug

    Oct Dec 2014 Re-join
 HTC Studio • Docker 1.0 Release • Integrate Jenkins with Docker CSI kickoff in Golang Introduce Docker Cloud Service Infrastructure
 Timeline First App
 Official Launch embrace GCP 2015 2013 Oct
  5. 7.

    –Gene Kim, The Phoenix Project: A Novel About IT, DevOps,

    and Helping Your Business Win “Any improvements made anywhere besides the bottleneck are an illusion”
  6. 12.
  7. 14.

    Jan Mar May Jul Sep Nov Feb Apr Jun Aug

    Oct Dec 2014 Re-join
 HTC Studio CSI kickoff in Golang Introduce Docker Cloud Service Infrastructure
 Timeline First App
 Official Launch Dockerized Everything embrace GCP 2015 2013 Oct • Docker 1.0 Release • Integrate Jenkins with Docker
  8. 15.

    –Gene Kim, The Phoenix Project: A Novel About IT, DevOps,

    and Helping Your Business Win “It’s not the upfront capital that kills you, it’s the operations and maintenance on the back end”
  9. 16.

    Handcrafted deployment script • build / test docker images •

    update GCE services • health-check • autoscaler • instance-template / instance- group • push configuration
  10. 18.
  11. 24.

    Time Series Metrics • Time Series is • a series

    of numeric data points of some particular metric over time. • each consists of a metric plus one or more tags associated with this metric • Metric is • any particular piece of data to track over time • e.g. hits to an Apache hosted file http://opentsdb.net/docs/build/html/user_guide/query/timeseries.html HTC CSI Repo
  12. 25.

    Site Reliability Engineering – How Google Runs Production Systems “The

    Four Golden Signals: Latency, Traffic, Errors and Saturation”
  13. 28.

    // Count is a go-restful filter that counts REST call

    statistics. // It counts the following: request count, request round-trip time, response success count // and rate, and response length. func (lf *CounterFilter) Count(req *restful.Request, resp *restful.Response, c *restful.FilterChain) { start := time.Now() target := lf.getTarget(req) // Record request duration defer ctr.Time(makeReqTime(strOverall), start, Alert|Avg) defer ctr.Time(makeReqTime(target), start) // Count request sizes ctr.Val(makeReqSize(strOverall), req.Request.ContentLength, Avg|Rate) ctr.Val(makeReqSize(target), req.Request.ContentLength, Avg|Rate) // Count request QPS ctr.Event(makeReqEvent(strOverall), 1, Alert|Rate) ctr.Event(makeReqEvent(target), 1) c.ProcessFilter(req, resp) // Count response codes statusCode := resp.StatusCode() ctr.Event(makeRespCode(strOverall, statusCode), 1) ctr.Event(makeRespCode(target, statusCode), 1)
  14. 35.

    // Counter is a wrapper of metrics.* type Counter struct

    { ctr metrics.Counter err metrics.Counter dur metrics.Histogram dus metrics.Histogram gau metrics.Gauge } // NewCounter return a new Counter func NewCounter(namespace, subsystem string) *Counter { … return &Counter{ ctr: kitprometheus.NewCounter(ctr), err: kitprometheus.NewCounter(errCtr), dur: kitprometheus.NewHistogram(dur), dus: kitprometheus.NewSummary(dus), gau: kitprometheus.NewGauge(gau), } } // Duration records the duration in seconds func (c *Counter) Duration(function string, value float64) { c.dur.With("func", function).Observe(value) c.dus.With("func", function).Observe(value) } // Event records the function call event func (c *Counter) Event(function string, value int) { c.ctr.With("func", function).Add(float64(value)) } // Err records the function call error func (c *Counter) Err(function string, value int) { c.err.With("func", function).Add(float64(value))
  15. 36.

    // Record request duration defer ctr.Duration(makeReqTime(target), time.Now()) // Expand the

    span for tracing span := utils.StartSpanFromContextWithSpan(ctx, target, opentracing.SpanFromContext(ctx)) defer span.Finish() // Count request sizes ctr.Event(makeReqSize(target), req.Request.ContentLength) // Count request QPS ctr.Event(makeReqEvent(target), 1) c.ProcessFilter(req, resp) // Count response codes statusCode := resp.StatusCode() ctr.Event(makeRespCode(target, statusCode), 1) // StatusBadRequest is 400 if statusCode >= http.StatusBadRequest { ctr.Err(makeRespErrorEvent(target), 1) log.Errorf("[http status] %s: %d (ms: %d)", target, statusCode, elapsedInMs)
  16. 40.
  17. 41.
  18. 42.
  19. 43.

    99%