Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability Deep Dive

Rafael Jesus
December 10, 2019
75

Observability Deep Dive

Rafael Jesus

December 10, 2019
Tweet

Transcript

  1. What is Observability? Its about being able to ask questions

    of your systems and get answers based on the existing telemetry it produces. If a service need to get re-configured or modified to get answers to your questions you haven’t achieved observability yet.
  2. Can you understand whatever internal state the system has gotten

    itself into? Just by inspecting and interrogating its output? Even if (especially if) you have never seen it happen before?
  3. Principles of Observability Well defined SLIs. Availability, latency, throughput, and

    error rate are common (and excellent) service-level indicators, they should really matter for customers or service end users. Powerful Telemetry Data. Logs, events, spans, metrics - whatever format works for you - need to be emitted by services, and collected for analysis and monitoring in a cost-effective way.
  4. Auto Instrumentation: http client package main import ( "net/http" "go.opencensus.io/plugin/ochttp"

    ) func main() { client := &http.Client{Transport: new(ochttp.Transport)} // Use the client _ = client }
  5. package main import ( "context" "io" "io/ioutil" "log" "net/http" "go.opencensus.io/plugin/ochttp"

    ) func main() { ctx := context.Background() // In other usages, the context would have been passed down after starting some traces. req, _ := http.NewRequest("GET", "https://opencensus.io/", nil) // It is imperative that req.WithContext is used to // propagate context and use it in the request. req = req.WithContext(ctx) client := &http.Client{Transport: new(ochttp.Transport)} res, err := client.Do(req) if err != nil { log.Fatalf("Failed to make the request: %v", err) } // Consume the body and close it. io.Copy(ioutil.Discard, res.Body) _ = res.Body.Close() }
  6. Auto Instrumentation: http server package main import ( "log" "math/rand"

    "net/http" "time" "go.opencensus.io/plugin/ochttp" "go.opencensus.io/stats/view" ) func main () { if err := view.Register(ochttp.DefaultServerViews...); err != nil { log.Fatalf("Failed to register server views for HTTP metrics: %v", err) } http.Handle("/users", ochttp.WithRouteTag(usersHandler, "/users")) log.Fatal(http.ListenAndServe("localhost:8080", &ochttp.Handler{}) }
  7. Auto Instrumentation: sql queries package main import ( "database/sql" "log"

    "contrib.go.opencensus.io/integrations/ocsql" ) func main() { var driverName string // For example "mysql", "sqlite3" etc. // First step is to register the driver and // then reuse that driver name while invoking sql.Open n, err := ocsql.Register(driverName) if err != nil { log.Fatalf("Failed to register the ocsql driver: %v", err) } db, err := sql.Open(n, "resource.db") if err != nil { log.Fatalf("Failed to open the SQL database: %v", err) } defer db.Close() ocsql.RegisterAllViews() }
  8. OC/Otel Collector Features field manipulation processors: attributes/example: actions: - key:

    account_password action: delete - key: request_id from_attribute: x.request.id action: update
  9. OC/Otel Collector Features Smart sampling Head-based sampled at the beginning

    of the trace Tail-based sampled at the end of the trace
  10. Wrapping up Auto instrumentation is the way to go Service

    troubleshooting became damn it hard in containers You now on need to operate Observability services Observability is for better software operations, less downtime, better product for end users
  11. Only Production is Production Incident Elevated database connection refused errors

    Hypothesis Service receives traffic before istio-proxy is ready Telemetry container logs, traces and metrics from both the service and istio-proxy container, also database slow logs queries.
  12. Question Did service have a spike on traffic? Answer Yes,

    available metrics tell traffic spike for the service
  13. Question Did pod auto scaling event kicked in? Answer Likely

    yes, the kube_deployment_status_replicas_available metrics tells more pods are running
  14. Question Did fpm container receive traffic while istio-proxy wasn't ready?

    Answer No. istio-proxy, nginx and fpm containers timestamp logs and k8s events, tell fpm didn't get any incoming traffic before istio-proxy being ready
  15. Question Why fpm process hangs with SIGSEGV 11 without sending

    any output? [06-Dec-2019 10:02:20] NOTICE: fpm is running, pid 163 [06-Dec-2019 10:02:20] NOTICE: ready to handle connections (errors start here) [06-Dec-2019 10:02:47] WARNING: [pool www] child 169 exited on signal 11 (SIGSEGV) after 27.274999 seconds from start [06-Dec-2019 10:02:47] NOTICE: [pool www] child 183 started (errors stop here) Answer Zero answers <- Opportunity to increase Observability
  16. Only Production is Production Incident Elevated dns timeout errors Hypothesis

    DNS in k8s is broken Telemetry container logs, traces and metrics.
  17. Question What is 99th DNS resolution latency per service measure

    at the client side? Answer Zero Answers <- Opportunity to increase Observability
  18. package main import ( "net/http/httptrace" "github.com/opentracing/opentracing-go" "github.com/opentracing/opentracing-go/log" ) type ClientTrace

    struct { // DNSStart is called when a DNS lookup begins. DNSStart func(DNSStartInfo) // DNSDone is called when a DNS lookup ends. DNSDone func(DNSDoneInfo) } func NewClientTrace(span opentracing.Span) *httptrace.ClientTrace { trace := &clientTrace{span: span} return &httptrace.ClientTrace{ DNSStart: trace.dnsStart, DNSDone: trace.dnsDone, } } // clientTrace holds a reference to the Span and provides methods used as ClientTrace callbacks type clientTrace struct { span opentracing.Span } func (h *clientTrace) dnsStart(info httptrace.DNSStartInfo) { h.span.LogKV( log.String("event", "DNS start"), log.Object("host", info.Host), ) } func (h *clientTrace) dnsDone(httptrace.DNSDoneInfo) { h.span.LogKV(log.String("event", "DNS done")) // measure the time it took from the start to done. }
  19. Question What is packet drop rate per node? Answer Zero

    Answers <- Opportunity to increase Observability
  20. Question Are nodes running out of resources (cpu, memory, etc...)?

    Answer Yes, available metrics tell cpu and memory usage on the node broken down per service
  21. Wrapping up Observability helps to validate hypothesis When shit hits

    the fan, kernel Observability FTW No answers, no Observability
  22. Culture Enabling engineering teams own their stuff in production w/out

    fear. Less guide lines, more automation. Education efforts towards mindset shift and not just docs. Increase Observability from service outages.