Observability Deep Dive

Slide 1

Slide 1 text

Slide 2

Slide 2 text

What is Observability? Its about being able to ask questions of your systems and get answers based on the existing telemetry it produces. If a service need to get re-configured or modified to get answers to your questions you haven’t achieved observability yet.

Slide 3

Slide 3 text

Can you understand whatever internal state the system has gotten itself into? Just by inspecting and interrogating its output? Even if (especially if) you have never seen it happen before?

Slide 4

Slide 4 text

Principles of Observability Well defined SLIs. Availability, latency, throughput, and error rate are common (and excellent) service-level indicators, they should really matter for customers or service end users. Powerful Telemetry Data. Logs, events, spans, metrics - whatever format works for you - need to be emitted by services, and collected for analysis and monitoring in a cost-effective way.

Slide 5

Slide 5 text

How to make observable systems? Auto instrumentation Infrastructure Culture Learning from Outages

Slide 6

Slide 6 text

Auto Instrumentation: http client package main import ( "net/http" "go.opencensus.io/plugin/ochttp" ) func main() { client := &http.Client{Transport: new(ochttp.Transport)} // Use the client _ = client }

Slide 7

Slide 7 text

package main import ( "context" "io" "io/ioutil" "log" "net/http" "go.opencensus.io/plugin/ochttp" ) func main() { ctx := context.Background() // In other usages, the context would have been passed down after starting some traces. req, _ := http.NewRequest("GET", "https://opencensus.io/", nil) // It is imperative that req.WithContext is used to // propagate context and use it in the request. req = req.WithContext(ctx) client := &http.Client{Transport: new(ochttp.Transport)} res, err := client.Do(req) if err != nil { log.Fatalf("Failed to make the request: %v", err) } // Consume the body and close it. io.Copy(ioutil.Discard, res.Body) _ = res.Body.Close() }

Slide 8

Slide 8 text

Auto Instrumentation: http server package main import ( "log" "math/rand" "net/http" "time" "go.opencensus.io/plugin/ochttp" "go.opencensus.io/stats/view" ) func main () { if err := view.Register(ochttp.DefaultServerViews...); err != nil { log.Fatalf("Failed to register server views for HTTP metrics: %v", err) } http.Handle("/users", ochttp.WithRouteTag(usersHandler, "/users")) log.Fatal(http.ListenAndServe("localhost:8080", &ochttp.Handler{}) }

Slide 9

Slide 9 text

Auto Instrumentation: sql queries package main import ( "database/sql" "log" "contrib.go.opencensus.io/integrations/ocsql" ) func main() { var driverName string // For example "mysql", "sqlite3" etc. // First step is to register the driver and // then reuse that driver name while invoking sql.Open n, err := ocsql.Register(driverName) if err != nil { log.Fatalf("Failed to register the ocsql driver: %v", err) } db, err := sql.Open(n, "resource.db") if err != nil { log.Fatalf("Failed to open the SQL database: %v", err) } defer db.Close() ocsql.RegisterAllViews() }

Slide 10

Slide 10 text

Examining the metrics

Slide 11

Slide 11 text

Examining the traces

Slide 12

Slide 12 text

Use Case: Service Instrumentation

Slide 13

Slide 13 text

Asking basic questions

Slide 14

Slide 14 text

Instrumenting service inputs and outputs

Slide 15

Slide 15 text

Use Case: Database Instrumentation

Slide 16

Slide 16 text

Network topology in k8s

Slide 17

Slide 17 text

Observability Infrastructure

Slide 18

Slide 18 text

OC/Otel Collector Features field manipulation processors: attributes/example: actions: - key: account_password action: delete - key: request_id from_attribute: x.request.id action: update

Slide 19

Slide 19 text

OC/Otel Collector Features Smart sampling Head-based sampled at the beginning of the trace Tail-based sampled at the end of the trace

Slide 20

Slide 20 text

Opencensus vs OpenTelemetry Opencensus OpenTelemetry

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Wrapping up Auto instrumentation is the way to go Service troubleshooting became damn it hard in containers You now on need to operate Observability services Observability is for better software operations, less downtime, better product for end users

Slide 23

Slide 23 text

Only Production is Production Incident Elevated database connection refused errors Hypothesis Service receives traffic before istio-proxy is ready Telemetry container logs, traces and metrics from both the service and istio-proxy container, also database slow logs queries.

Slide 24

Slide 24 text

Question Did service have a spike on traffic? Answer Yes, available metrics tell traffic spike for the service

Slide 25

Slide 25 text

Question Did pod auto scaling event kicked in? Answer Likely yes, the kube_deployment_status_replicas_available metrics tells more pods are running

Slide 26

Slide 26 text

Question Did fpm container receive traffic while istio-proxy wasn't ready? Answer No. istio-proxy, nginx and fpm containers timestamp logs and k8s events, tell fpm didn't get any incoming traffic before istio-proxy being ready

Slide 27

Slide 27 text

Question Why fpm process hangs with SIGSEGV 11 without sending any output? [06-Dec-2019 10:02:20] NOTICE: fpm is running, pid 163 [06-Dec-2019 10:02:20] NOTICE: ready to handle connections (errors start here) [06-Dec-2019 10:02:47] WARNING: [pool www] child 169 exited on signal 11 (SIGSEGV) after 27.274999 seconds from start [06-Dec-2019 10:02:47] NOTICE: [pool www] child 183 started (errors stop here) Answer Zero answers <- Opportunity to increase Observability

Slide 28

Slide 28 text

Only Production is Production Incident Elevated dns timeout errors Hypothesis DNS in k8s is broken Telemetry container logs, traces and metrics.

Slide 29

Slide 29 text

Question What is 99th DNS resolution latency per service measure at the client side? Answer Zero Answers <- Opportunity to increase Observability

Slide 30

Slide 30 text

package main import ( "net/http/httptrace" "github.com/opentracing/opentracing-go" "github.com/opentracing/opentracing-go/log" ) type ClientTrace struct { // DNSStart is called when a DNS lookup begins. DNSStart func(DNSStartInfo) // DNSDone is called when a DNS lookup ends. DNSDone func(DNSDoneInfo) } func NewClientTrace(span opentracing.Span) *httptrace.ClientTrace { trace := &clientTrace{span: span} return &httptrace.ClientTrace{ DNSStart: trace.dnsStart, DNSDone: trace.dnsDone, } } // clientTrace holds a reference to the Span and provides methods used as ClientTrace callbacks type clientTrace struct { span opentracing.Span } func (h *clientTrace) dnsStart(info httptrace.DNSStartInfo) { h.span.LogKV( log.String("event", "DNS start"), log.Object("host", info.Host), ) } func (h *clientTrace) dnsDone(httptrace.DNSDoneInfo) { h.span.LogKV(log.String("event", "DNS done")) // measure the time it took from the start to done. }

Slide 31

Slide 31 text

The hard way

Slide 32

Slide 32 text

Question What is packet drop rate per node? Answer Zero Answers <- Opportunity to increase Observability

Slide 33

Slide 33 text

Question Are nodes running out of resources (cpu, memory, etc...)? Answer Yes, available metrics tell cpu and memory usage on the node broken down per service

Slide 34

Slide 34 text

Wrapping up Observability helps to validate hypothesis When shit hits the fan, kernel Observability FTW No answers, no Observability

Slide 35

Slide 35 text

Culture Enabling engineering teams own their stuff in production w/out fear. Less guide lines, more automation. Education efforts towards mindset shift and not just docs. Increase Observability from service outages.

Slide 36

Slide 36 text

Thank you