Slide 1

Slide 1 text

Prometheus: monitoring and alerting with an open source solution Cheng-Lung Sung (clsung@)

Slide 2

Slide 2 text

Outline (preface) • What is Prometheus? • What is metrics monitoring • Time series DB • Component introduction • Why choose Prometheus against alternatives. • How we adopt Prometheus (code examples in Go) • What metrics to measure? • Why 50%, 90%, 99%? • Not only Prometheus • Other open source solutions for DevOps to save the day.

Slide 3

Slide 3 text

Outline (actually) • The history of HTC Cloud Service Infrastructure • As a DevOps, what you will need • Why/How we adopt Prometheus (code examples in Go) • Metrics can talk, why 50%, 95%, 99%?

Slide 4

Slide 4 text

About.Me/clsung • Manager of Product Development, HTC Health Care • Cloud Service Infrastructure (Golang, Python) • Mobile App Development (Golang, Java, Swift, Node.js) • Deep Learning Platform (Golang, Python, Node.js) • Open Source contributor • FreeBSD [email protected] • Golang golang.org/AUTHORS • Plurk API www.plurk.com/API We’re hiring!

Slide 5

Slide 5 text

Jan Mar May Jul Sep Nov Feb Apr Jun Aug Oct Dec 2014 Re-join
 HTC Studio • Docker 1.0 Release • Integrate Jenkins with Docker CSI kickoff in Golang Introduce Docker Cloud Service Infrastructure
 Timeline First App
 Official Launch embrace GCP 2015 2013 Oct

Slide 6

Slide 6 text

JIRA-Kanban “⼀一定要把 Dev 看板 + Ops 看板放⼀一起 = 全貌” -Ruddy

Slide 7

Slide 7 text

–Gene Kim, The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win “Any improvements made anywhere besides the bottleneck are an illusion”

Slide 8

Slide 8 text

Docker for binary on Android

Slide 9

Slide 9 text

Go 1.3 vendor

Slide 10

Slide 10 text

Go 1.3 vendor

Slide 11

Slide 11 text

“Focus on product/process, not technology”

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

–Google Container Engineer “Run Docker containers on Google Cloud Platform, powered by Kubernetes.”

Slide 14

Slide 14 text

Jan Mar May Jul Sep Nov Feb Apr Jun Aug Oct Dec 2014 Re-join
 HTC Studio CSI kickoff in Golang Introduce Docker Cloud Service Infrastructure
 Timeline First App
 Official Launch Dockerized Everything embrace GCP 2015 2013 Oct • Docker 1.0 Release • Integrate Jenkins with Docker

Slide 15

Slide 15 text

–Gene Kim, The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win “It’s not the upfront capital that kills you, it’s the operations and maintenance on the back end”

Slide 16

Slide 16 text

Handcrafted deployment script • build / test docker images • update GCE services • health-check • autoscaler • instance-template / instance- group • push configuration

Slide 17

Slide 17 text

Docker with EFK Also deployed with autoscaler on custom metrics

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

DevOps! What will you need?

Slide 20

Slide 20 text

–有⽤用的 DevOps 「DevOps 三寶:Logging、Tracing、 Monitoring」

Slide 21

Slide 21 text

Tracing Appdash, Jaeger, LightStep, Zipkin… (Golang)

Slide 22

Slide 22 text

System level metrics Prometheus + Grafana

Slide 23

Slide 23 text

Application metrics Prometheus + Grafana

Slide 24

Slide 24 text

Time Series Metrics • Time Series is • a series of numeric data points of some particular metric over time. • each consists of a metric plus one or more tags associated with this metric • Metric is • any particular piece of data to track over time • e.g. hits to an Apache hosted file http://opentsdb.net/docs/build/html/user_guide/query/timeseries.html HTC CSI Repo

Slide 25

Slide 25 text

Site Reliability Engineering – How Google Runs Production Systems “The Four Golden Signals: Latency, Traffic, Errors and Saturation”

Slide 26

Slide 26 text

DevOps vs SRE https://en.wikipedia.org/wiki/Site_reliability_engineering#DevOps_vs_SRE

Slide 27

Slide 27 text

How to measure? https://honeycomb.io/blog/2017/01/instrumentation-the-first-four-things-you-measure/

Slide 28

Slide 28 text

// Count is a go-restful filter that counts REST call statistics. // It counts the following: request count, request round-trip time, response success count // and rate, and response length. func (lf *CounterFilter) Count(req *restful.Request, resp *restful.Response, c *restful.FilterChain) { start := time.Now() target := lf.getTarget(req) // Record request duration defer ctr.Time(makeReqTime(strOverall), start, Alert|Avg) defer ctr.Time(makeReqTime(target), start) // Count request sizes ctr.Val(makeReqSize(strOverall), req.Request.ContentLength, Avg|Rate) ctr.Val(makeReqSize(target), req.Request.ContentLength, Avg|Rate) // Count request QPS ctr.Event(makeReqEvent(strOverall), 1, Alert|Rate) ctr.Event(makeReqEvent(target), 1) c.ProcessFilter(req, resp) // Count response codes statusCode := resp.StatusCode() ctr.Event(makeRespCode(strOverall, statusCode), 1) ctr.Event(makeRespCode(target, statusCode), 1)

Slide 29

Slide 29 text

Why Prometheus?

Slide 30

Slide 30 text

不要重新發明輪輪⼦子
 Don't Reinvent The Wheel Unless You Plan on Learning More About Wheels

Slide 31

Slide 31 text

Prometheus Architecture https://prometheus.io/

Slide 32

Slide 32 text

Cloud Native Computing Foundation https://www.cncf.io/

Slide 33

Slide 33 text

Integrations, Third-party libraries https://promcon.io/2016-berlin/talks/prometheus-design-and-philosophy/

Slide 34

Slide 34 text

Prometheus Metric Types Counter Gauge Histogram Summary

Slide 35

Slide 35 text

// Counter is a wrapper of metrics.* type Counter struct { ctr metrics.Counter err metrics.Counter dur metrics.Histogram dus metrics.Histogram gau metrics.Gauge } // NewCounter return a new Counter func NewCounter(namespace, subsystem string) *Counter { … return &Counter{ ctr: kitprometheus.NewCounter(ctr), err: kitprometheus.NewCounter(errCtr), dur: kitprometheus.NewHistogram(dur), dus: kitprometheus.NewSummary(dus), gau: kitprometheus.NewGauge(gau), } } // Duration records the duration in seconds func (c *Counter) Duration(function string, value float64) { c.dur.With("func", function).Observe(value) c.dus.With("func", function).Observe(value) } // Event records the function call event func (c *Counter) Event(function string, value int) { c.ctr.With("func", function).Add(float64(value)) } // Err records the function call error func (c *Counter) Err(function string, value int) { c.err.With("func", function).Add(float64(value))

Slide 36

Slide 36 text

// Record request duration defer ctr.Duration(makeReqTime(target), time.Now()) // Expand the span for tracing span := utils.StartSpanFromContextWithSpan(ctx, target, opentracing.SpanFromContext(ctx)) defer span.Finish() // Count request sizes ctr.Event(makeReqSize(target), req.Request.ContentLength) // Count request QPS ctr.Event(makeReqEvent(target), 1) c.ProcessFilter(req, resp) // Count response codes statusCode := resp.StatusCode() ctr.Event(makeRespCode(target, statusCode), 1) // StatusBadRequest is 400 if statusCode >= http.StatusBadRequest { ctr.Err(makeRespErrorEvent(target), 1) log.Errorf("[http status] %s: %d (ms: %d)", target, statusCode, elapsedInMs)

Slide 37

Slide 37 text

Why 50%, 95%, 99% matter? https://www.dynatrace.com/blog/why-averages-suck-and-percentiles-are-great/

Slide 38

Slide 38 text

平均 台灣平均年年薪 54.7 萬、貧富差距達 12.6 倍

Slide 39

Slide 39 text

Average don't tell stories https://www.dynatrace.com/blog/why-averages-suck-and-percentiles-are-great/ http://collider.com/the-flash-movie-ezra-miller-barry-allen/

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

99%

Slide 44

Slide 44 text

–On Metrics Monitoring ”Prevention is better than cure“

Slide 45

Slide 45 text

Thank you!