Observability Deep Dive

What is Observability? Its about being able to ask questions
of your systems and get answers based on the existing telemetry it produces. If a service need to get re-configured or modified to get answers to your questions you haven’t achieved observability yet.

Can you understand whatever internal state the system has gotten
itself into? Just by inspecting and interrogating its output? Even if (especially if) you have never seen it happen before?

Principles of Observability Well defined SLIs. Availability, latency, throughput, and
error rate are common (and excellent) service-level indicators, they should really matter for customers or service end users. Powerful Telemetry Data. Logs, events, spans, metrics - whatever format works for you - need to be emitted by services, and collected for analysis and monitoring in a cost-effective way.

How to make observable systems? Auto instrumentation Infrastructure Culture Learning
from Outages

Auto Instrumentation: http client package main import ( "net/http" "go.opencensus.io/plugin/ochttp"
) func main() { client := &http.Client{Transport: new(ochttp.Transport)} // Use the client _ = client }

package main import ( "context" "io" "io/ioutil" "log" "net/http" "go.opencensus.io/plugin/ochttp"
) func main() { ctx := context.Background() // In other usages, the context would have been passed down after starting some traces. req, _ := http.NewRequest("GET", "https://opencensus.io/", nil) // It is imperative that req.WithContext is used to // propagate context and use it in the request. req = req.WithContext(ctx) client := &http.Client{Transport: new(ochttp.Transport)} res, err := client.Do(req) if err != nil { log.Fatalf("Failed to make the request: %v", err) } // Consume the body and close it. io.Copy(ioutil.Discard, res.Body) _ = res.Body.Close() }

Auto Instrumentation: http server package main import ( "log" "math/rand"
"net/http" "time" "go.opencensus.io/plugin/ochttp" "go.opencensus.io/stats/view" ) func main () { if err := view.Register(ochttp.DefaultServerViews...); err != nil { log.Fatalf("Failed to register server views for HTTP metrics: %v", err) } http.Handle("/users", ochttp.WithRouteTag(usersHandler, "/users")) log.Fatal(http.ListenAndServe("localhost:8080", &ochttp.Handler{}) }

Auto Instrumentation: sql queries package main import ( "database/sql" "log"
"contrib.go.opencensus.io/integrations/ocsql" ) func main() { var driverName string // For example "mysql", "sqlite3" etc. // First step is to register the driver and // then reuse that driver name while invoking sql.Open n, err := ocsql.Register(driverName) if err != nil { log.Fatalf("Failed to register the ocsql driver: %v", err) } db, err := sql.Open(n, "resource.db") if err != nil { log.Fatalf("Failed to open the SQL database: %v", err) } defer db.Close() ocsql.RegisterAllViews() }

Examining the metrics

Examining the traces

Use Case: Service Instrumentation

Asking basic questions

Instrumenting service inputs and outputs

Use Case: Database Instrumentation

Network topology in k8s

Observability Infrastructure

OC/Otel Collector Features field manipulation processors: attributes/example: actions: - key:
account_password action: delete - key: request_id from_attribute: x.request.id action: update

OC/Otel Collector Features Smart sampling Head-based sampled at the beginning
of the trace Tail-based sampled at the end of the trace

Opencensus vs OpenTelemetry Opencensus OpenTelemetry

Wrapping up Auto instrumentation is the way to go Service
troubleshooting became damn it hard in containers You now on need to operate Observability services Observability is for better software operations, less downtime, better product for end users

Only Production is Production Incident Elevated database connection refused errors
Hypothesis Service receives traffic before istio-proxy is ready Telemetry container logs, traces and metrics from both the service and istio-proxy container, also database slow logs queries.

Question Did service have a spike on traffic? Answer Yes,
available metrics tell traffic spike for the service

Question Did pod auto scaling event kicked in? Answer Likely
yes, the kube_deployment_status_replicas_available metrics tells more pods are running

Question Did fpm container receive traffic while istio-proxy wasn't ready?
Answer No. istio-proxy, nginx and fpm containers timestamp logs and k8s events, tell fpm didn't get any incoming traffic before istio-proxy being ready

Question Why fpm process hangs with SIGSEGV 11 without sending
any output? [06-Dec-2019 10:02:20] NOTICE: fpm is running, pid 163 [06-Dec-2019 10:02:20] NOTICE: ready to handle connections (errors start here) [06-Dec-2019 10:02:47] WARNING: [pool www] child 169 exited on signal 11 (SIGSEGV) after 27.274999 seconds from start [06-Dec-2019 10:02:47] NOTICE: [pool www] child 183 started (errors stop here) Answer Zero answers <- Opportunity to increase Observability

Only Production is Production Incident Elevated dns timeout errors Hypothesis
DNS in k8s is broken Telemetry container logs, traces and metrics.

Question What is 99th DNS resolution latency per service measure
at the client side? Answer Zero Answers <- Opportunity to increase Observability

package main import ( "net/http/httptrace" "github.com/opentracing/opentracing-go" "github.com/opentracing/opentracing-go/log" ) type ClientTrace
struct { // DNSStart is called when a DNS lookup begins. DNSStart func(DNSStartInfo) // DNSDone is called when a DNS lookup ends. DNSDone func(DNSDoneInfo) } func NewClientTrace(span opentracing.Span) *httptrace.ClientTrace { trace := &clientTrace{span: span} return &httptrace.ClientTrace{ DNSStart: trace.dnsStart, DNSDone: trace.dnsDone, } } // clientTrace holds a reference to the Span and provides methods used as ClientTrace callbacks type clientTrace struct { span opentracing.Span } func (h *clientTrace) dnsStart(info httptrace.DNSStartInfo) { h.span.LogKV( log.String("event", "DNS start"), log.Object("host", info.Host), ) } func (h *clientTrace) dnsDone(httptrace.DNSDoneInfo) { h.span.LogKV(log.String("event", "DNS done")) // measure the time it took from the start to done. }

The hard way

Question What is packet drop rate per node? Answer Zero
Answers <- Opportunity to increase Observability

Question Are nodes running out of resources (cpu, memory, etc...)?
Answer Yes, available metrics tell cpu and memory usage on the node broken down per service

Wrapping up Observability helps to validate hypothesis When shit hits
the fan, kernel Observability FTW No answers, no Observability

Culture Enabling engineering teams own their stuff in production w/out
fear. Less guide lines, more automation. Education efforts towards mindset shift and not just docs. Increase Observability from service outages.

Thank you

Observability Deep Dive

Observability Deep Dive

More Decks by Rafael Jesus

Featured

Transcript