Prometheus: monitoring and alerting with an open source solution

Prometheus: monitoring and alerting with an open source solution Cheng-Lung
Sung (clsung@)

Outline (preface) • What is Prometheus? • What is metrics
monitoring • Time series DB • Component introduction • Why choose Prometheus against alternatives. • How we adopt Prometheus (code examples in Go) • What metrics to measure? • Why 50%, 90%, 99%? • Not only Prometheus • Other open source solutions for DevOps to save the day.

Outline (actually) • The history of HTC Cloud Service Infrastructure
• As a DevOps, what you will need • Why/How we adopt Prometheus (code examples in Go) • Metrics can talk, why 50%, 95%, 99%?

About.Me/clsung • Manager of Product Development, HTC Health Care •
Cloud Service Infrastructure (Golang, Python) • Mobile App Development (Golang, Java, Swift, Node.js) • Deep Learning Platform (Golang, Python, Node.js) • Open Source contributor • FreeBSD [email protected] • Golang golang.org/AUTHORS • Plurk API www.plurk.com/API We’re hiring!

Jan Mar May Jul Sep Nov Feb Apr Jun Aug
Oct Dec 2014 Re-join  HTC Studio • Docker 1.0 Release • Integrate Jenkins with Docker CSI kickoff in Golang Introduce Docker Cloud Service Infrastructure  Timeline First App  Ofﬁcial Launch embrace GCP 2015 2013 Oct

JIRA-Kanban “⼀一定要把 Dev 看板 + Ops 看板放⼀一起＝全貌” -Ruddy

–Gene Kim, The Phoenix Project: A Novel About IT, DevOps,
and Helping Your Business Win “Any improvements made anywhere besides the bottleneck are an illusion”

Docker for binary on Android

Go 1.3 vendor

“Focus on product/process, not technology”

–Google Container Engineer “Run Docker containers on Google Cloud Platform,
powered by Kubernetes.”

Jan Mar May Jul Sep Nov Feb Apr Jun Aug
Oct Dec 2014 Re-join  HTC Studio CSI kickoff in Golang Introduce Docker Cloud Service Infrastructure  Timeline First App  Ofﬁcial Launch Dockerized Everything embrace GCP 2015 2013 Oct • Docker 1.0 Release • Integrate Jenkins with Docker

–Gene Kim, The Phoenix Project: A Novel About IT, DevOps,
and Helping Your Business Win “It’s not the upfront capital that kills you, it’s the operations and maintenance on the back end”

Handcrafted deployment script • build / test docker images •
update GCE services • health-check • autoscaler • instance-template / instance- group • push conﬁguration

Docker with EFK Also deployed with autoscaler on custom metrics

DevOps! What will you need?

–有⽤用的 DevOps 「DevOps 三寶：Logging、Tracing、 Monitoring」

Tracing Appdash, Jaeger, LightStep, Zipkin… (Golang)

System level metrics Prometheus + Grafana

Application metrics Prometheus + Grafana

Time Series Metrics • Time Series is • a series
of numeric data points of some particular metric over time. • each consists of a metric plus one or more tags associated with this metric • Metric is • any particular piece of data to track over time • e.g. hits to an Apache hosted ﬁle http://opentsdb.net/docs/build/html/user_guide/query/timeseries.html HTC CSI Repo

Site Reliability Engineering – How Google Runs Production Systems “The
Four Golden Signals: Latency, Trafﬁc, Errors and Saturation”

DevOps vs SRE https://en.wikipedia.org/wiki/Site_reliability_engineering#DevOps_vs_SRE

How to measure? https://honeycomb.io/blog/2017/01/instrumentation-the-ﬁrst-four-things-you-measure/

// Count is a go-restful filter that counts REST call
statistics. // It counts the following: request count, request round-trip time, response success count // and rate, and response length. func (lf *CounterFilter) Count(req *restful.Request, resp *restful.Response, c *restful.FilterChain) { start := time.Now() target := lf.getTarget(req) // Record request duration defer ctr.Time(makeReqTime(strOverall), start, Alert|Avg) defer ctr.Time(makeReqTime(target), start) // Count request sizes ctr.Val(makeReqSize(strOverall), req.Request.ContentLength, Avg|Rate) ctr.Val(makeReqSize(target), req.Request.ContentLength, Avg|Rate) // Count request QPS ctr.Event(makeReqEvent(strOverall), 1, Alert|Rate) ctr.Event(makeReqEvent(target), 1) c.ProcessFilter(req, resp) // Count response codes statusCode := resp.StatusCode() ctr.Event(makeRespCode(strOverall, statusCode), 1) ctr.Event(makeRespCode(target, statusCode), 1)

Why Prometheus?

不要重新發明輪輪⼦子  Don't Reinvent The Wheel Unless You Plan on Learning
More About Wheels

Prometheus Architecture https://prometheus.io/

Cloud Native Computing Foundation https://www.cncf.io/

Integrations, Third-party libraries https://promcon.io/2016-berlin/talks/prometheus-design-and-philosophy/

Prometheus Metric Types Counter Gauge Histogram Summary

// Counter is a wrapper of metrics.* type Counter struct
{ ctr metrics.Counter err metrics.Counter dur metrics.Histogram dus metrics.Histogram gau metrics.Gauge } // NewCounter return a new Counter func NewCounter(namespace, subsystem string) *Counter { … return &Counter{ ctr: kitprometheus.NewCounter(ctr), err: kitprometheus.NewCounter(errCtr), dur: kitprometheus.NewHistogram(dur), dus: kitprometheus.NewSummary(dus), gau: kitprometheus.NewGauge(gau), } } // Duration records the duration in seconds func (c *Counter) Duration(function string, value float64) { c.dur.With("func", function).Observe(value) c.dus.With("func", function).Observe(value) } // Event records the function call event func (c *Counter) Event(function string, value int) { c.ctr.With("func", function).Add(float64(value)) } // Err records the function call error func (c *Counter) Err(function string, value int) { c.err.With("func", function).Add(float64(value))

// Record request duration defer ctr.Duration(makeReqTime(target), time.Now()) // Expand the
span for tracing span := utils.StartSpanFromContextWithSpan(ctx, target, opentracing.SpanFromContext(ctx)) defer span.Finish() // Count request sizes ctr.Event(makeReqSize(target), req.Request.ContentLength) // Count request QPS ctr.Event(makeReqEvent(target), 1) c.ProcessFilter(req, resp) // Count response codes statusCode := resp.StatusCode() ctr.Event(makeRespCode(target, statusCode), 1) // StatusBadRequest is 400 if statusCode >= http.StatusBadRequest { ctr.Err(makeRespErrorEvent(target), 1) log.Errorf("[http status] %s: %d (ms: %d)", target, statusCode, elapsedInMs)

Why 50%, 95%, 99% matter? https://www.dynatrace.com/blog/why-averages-suck-and-percentiles-are-great/

平均台灣平均年年薪 54.7 萬、貧富差距達 12.6 倍

Average don't tell stories https://www.dynatrace.com/blog/why-averages-suck-and-percentiles-are-great/ http://collider.com/the-ﬂash-movie-ezra-miller-barry-allen/

–On Metrics Monitoring ”Prevention is better than cure“

Thank you!

Prometheus: monitoring and alerting with an ope...

Prometheus: monitoring and alerting with an open source solution

More Decks by Cheng-Lung Sung

Other Decks in Technology

Featured

Transcript