Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ScaleConf Cape Town - Debug like a pro on Kubernetes

ScaleConf Cape Town - Debug like a pro on Kubernetes

Golang applications are perfect to be run inside a container. You can build a single binary, a tiny Docker image and you can ship them on your Kubernetes cluster. A successful production environment requires stability and simplicity, it needs to be easy to troubleshoot and operators need to be able to get all the information developers will need to fix a bug. During this talk, Gianluca will share what influxData is doing to allow developers and system administrator to work together, understanding problems running live at scale on Kubernetes and how to escalate them down to Software Engineer using logs, delve, gdb, core dumps, and traces to replicate and fix issues.

Gianluca Arbezzano

March 08, 2019

More Decks by Gianluca Arbezzano

Other Decks in Technology


  1. ~ @gianarb - https://gianarb.it ~ Debug like a pro on

    Kubernetes ScaleConf - Cape Town - 2019
  2. ~ @gianarb - https://gianarb.it ~ Gianluca Arbezzano Site Reliability Engineer

    @InfluxData • http://gianarb.it • @gianarb What I like: • I make dirty hacks that look awesome • I grow my vegetables • Travel for fun and work
  3. 1. Yo n ! Your team knows and use Docker

    for local development and testing 2. Kub te ! Everyone speaks about kubernetes. 3. Hir ! You don’t know why but you hired a DevOps that kind of know k8s. 3. Ex i m ! You are moving everything and everyone to kubernetes
  4. You need a good book 1. Short 2. Driven by

    experiences 3. Practical 4. Easy
  5. K8s as code: From YAML to code (golang) 1. You

    have the ability to use Golang autocomplete as documentation, reference for every kubernetes resources 2. You feel less a YAML engineer (great feeling btw) 3. Code is better than YAML! You can reuse it, compile it, embed it in other projects.
  6. K8s as code: From YAML to code (golang) Tiny cli

    to make the migration to golang Some manual refactoring
  7. K8s as code: From YAML to code (golang) Tiny cli

    to make the migration to golang Some manual refactoring • Continue to improve our CI to validate that YAML and Go file are the same, and the resources in Kubernetes are like the Go file. • Maybe we will be able to remove the YAML at some point.
  8. GitOps Your Git repository is the entrypoint for all your

    code changes. Infrastructure is ‘as code’, so the place where you make it happen should be Git.
  9. Examples • Everything has an API because you should USE

    it TO make something good! (cURL is good but you can make something better) • Some of our tools: ◦ Backup and Restore Operator for Persistent Volumes ◦ We have a service to create runtime isolated environment to allow devs to test or product people to have a safe environment to demo, try. We also use it for the integration and smoke tests. ◦ We have a tool to replicate environment locally on minikube to install and configure all the dependencies
  10. We need to have processes and tools that give us

    the ability to take a real time picture of our system
  11. Instrumentation code is a first citizen in your codebase: OpenCensus

    • Open Source project sponsored by Google • It is a SPEC plus a set of libraries in different languages to instrument your application • To collect metrics, traces and events.
  12. OpenCensus Common Interface to collect stats and traces from your

    app Different exporters to persist your data
  13. gianarb.it ~ @gianarb # HELP http_requests_total The total number of

    HTTP requests. # TYPE http_requests_total counter http_requests_total{method="post",code="200"} 1027 1395066363000 http_requests_total{method="post",code="400"} 3 1395066363000 # Escaping in label values: msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9 # Minimalistic line: metric_without_timestamp_and_labels 12.47 # A weird metric from before the epoch: something_weird{problem="division by zero"} +Inf -3982045 # A histogram, which has a pretty complex representation in the text format: # HELP http_request_duration_seconds A histogram of the request duration. # TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{le="0.05"} 24054 http_request_duration_seconds_bucket{le="0.1"} 33444 http_request_duration_seconds_bucket{le="0.2"} 100392 http_request_duration_seconds_bucket{le="0.5"} 129389 http_request_duration_seconds_bucket{le="1"} 133988 http_request_duration_seconds_bucket{le="+Inf"} 144320 http_request_duration_seconds_sum 53423 http_request_duration_seconds_count 144320
  14. gianarb.it ~ @gianarb func FetchMetricFamilies(url string, ch chan<- *dto.MetricFamily, certificate

    string, key string, skipServerCertCheck bool) error { defer close(ch) var transport *http.Transport if certificate != "" && key != "" { cert, err := tls.LoadX509KeyPair(certificate, key) if err != nil { return err } tlsConfig := &tls.Config{ Certificates: []tls.Certificate{cert}, InsecureSkipVerify: skipServerCertCheck, } tlsConfig.BuildNameToCertificate() transport = &http.Transport{TLSClientConfig: tlsConfig} } else { transport = &http.Transport{ TLSClientConfig: &tls.Config{InsecureSkipVerify: skipServerCertCheck}, } } client := &http.Client{Transport: transport} return decodeContent(client, url, ch) } https://github.com/prometheus/prom2json/blob/master/prom2json.go#L123
  15. gianarb.it ~ @gianarb More to read • OpenMetrics: https://github.com/OpenObservability/OpenMetrics •

    OpenMetrics mailing list: https://groups.google.com/forum/#\protect\kern-. 1667em\relaxforum/openmetrics • WIP branch for Python library https://github.com/prometheus/client_python/tree/openmetrics • Thanks RICHARD for your work! I got some slides from here: https://promcon.io/2018-munich/slides/openmetrics-transforming-the-prometh eus-exposition-format-into-a-global-standard.pdf
  16. © 2017 InfluxData. All rights reserved. 37 Typical problems with

    logs ¨ Which library do I need to use? ¨ Every library has a different format ¨ Every languages exposes a different format
  17. © 2017 InfluxData. All rights reserved. 38 Tracing is not

    something new ¨ There are vendors ¨ Every vendor has their own format
  18. © 2017 InfluxData. All rights reserved. 39 log log log

    log log log Parent Span Span Context / Baggage Child Child Child Span ¨ Spans - Basic unit of timing and causality. Can be tagged with key/value pairs. ¨ Logs - Structured data recorded on a span. ¨ Span Context - serializable format for linking spans across network boundaries. Carries baggage, such as a request and client IDs. ¨ Tracers - Anything that plugs into the OpenTracing API to record information. ¨ ZipKin, Jaeger, LightStep, others ¨ Also metrics (Prometheus) and logging
  19. © 2017 InfluxData. All rights reserved. 40 OpenTracing API application

    logic µ-service frameworks Lambda functions RPC & control-flow frameworks existing instrumentation tracing infrastructure main() I N S T A N A J a e g e r microservice process
  20. © 2017 InfluxData. All rights reserved. 41 import "github.com/opentracing/opentracing-go" import

    ".../some_tracing_impl" func main() { opentracing.SetGlobalTracer( // tracing impl specific: some_tracing_impl.New(...), ) ... } https://github.com/opentracing/opentracing-go Opentracing: Configure the GlobalTracer
  21. © 2017 InfluxData. All rights reserved. 42 func xyz(ctx context.Context,

    ...) { ... span, ctx := opentracing.StartSpanFromContext(ctx, "op_name") defer span.Finish() span.LogFields( log.String("event", "soft error"), log.String("type", "cache timeout"), log.Int("waited.millis", 1500)) ... } https://github.com/opentracing/opentracing-go Opentracing: Create a Span from the Context
  22. © 2017 InfluxData. All rights reserved. 43 func xyz(parentSpan opentracing.Span,

    ...) { ... sp := opentracing.StartSpan( "operation_name", opentracing.ChildOf(parentSpan.Context())) defer sp.Finish() ... } https://github.com/opentracing/opentracing-go Opentracing: Create a Child Span
  23. Golang and Kubernetes: pprof • It is the Golang native

    profiler • You can use it via the `go pprof` command • `import "runtime/pprof"` writes runtime profiling data • `import "net/http/pprof"` serves via HTTP server runtime profiling data
  24. Golang and Kubernetes: pprof package main import ( "log" "net/http"

    _ "net/http/pprof" ) func main() { log.Println(http.ListenAndServe("localhost:6060", nil)) }
  25. $ go tool pprof http://localhost:6060/debug/pprof/heap Fetching profile over HTTP from

    http://localhost:6060/debug/pprof/heap Saved profile in /home/gianarb/pprof/pprof.main.alloc_objects.alloc_space.inuse_objects.inuse_space.00 9.pb.gz File: main Type: inuse_space Time: Oct 15, 2018 at 4:22pm (CEST) Entering interactive mode (type "help" for commands, "o" for options) (pprof) png Generating report in profile001.png
  26. Golang and Kubernetes: pprof https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go // handleProfiles determines which profile

    to return to the requester. func (h *Handler) handleProfiles(w http.ResponseWriter, r *http.Request) { switch r.URL.Path { case "/debug/pprof/cmdline": httppprof.Cmdline(w, r) case "/debug/pprof/profile": httppprof.Profile(w, r) case "/debug/pprof/symbol": httppprof.Symbol(w, r) case "/debug/pprof/all": h.archiveProfilesAndQueries(w, r) default: httppprof.Index(w, r) } }
  27. Golang and Kubernetes: pprof https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go // archiveProfilesAndQueries collects the following

    profiles: // - goroutine profile // - heap profile // - blocking profile // - mutex profile // - (optionally) CPU profile // // It also collects the following query results: // // - SHOW SHARDS // - SHOW STATS // - SHOW DIAGNOSTICS // // All information is added to a tar archive and then compressed, before being // returned to the requester as an archive file.
  28. gz := gzip.NewWriter(&resp) tw := tar.NewWriter(gz) // Collect and write

    out profiles. for _, profile := range allProfs { if profile.Name == "cpu" { if err := pprof.StartCPUProfile(&buf); err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } sleep(w, time.Duration(profile.Debug)*time.Second) pprof.StopCPUProfile() } else { prof := pprof.Lookup(profile.Name) if prof == nil { http.Error(w, "unable to find profile "+profile.Name, 500) return } if err := prof.WriteTo(&buf, int(profile.Debug)); err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } } }
  29. Golang and Kubernetes: pprof https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go // Collect and write out

    the queries. var allQueries = []struct { name string fn func() ([]*models.Row, error) }{ {"shards", h.showShards}, {"stats", h.showStats}, {"diagnostics", h.showDiagnostics}, } tabW := tabwriter.NewWriter(&buf, 8, 8, 1, '\t', 0) for _, query := range allQueries { rows, err := query.fn() if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) }
  30. func (h *Handler) showDiagnostics() ([]*models.Row, error) { diags, err :=

    h.Monitor.Diagnostics() if err != nil { return nil, err } // Get a sorted list of diagnostics keys. sortedKeys := make([]string, 0, len(diags)) for k := range diags { sortedKeys = append(sortedKeys, k) } sort.Strings(sortedKeys) rows := make([]*models.Row, 0, len(diags)) for _, k := range sortedKeys { row := &models.Row{Name: k} row.Columns = diags[k].Columns row.Values = diags[k].Rows rows = append(rows, row) } return rows, nil }
  31. That’s not totally true. You can expose the debugger endpoint.

    It works. It doesn’t apply for production workload.
  32. Summarize • Make your hands dirty • Bring developers in

    the ops loop • Write tools! • Observability events, Metrics, Logs, Traces • Instrumentation code is a first citizen code in our app • Tracing
  33. Credits and articles to read • https://gianarb.it • https://www.influxdata.com/blog/monitoring-kubernetes-architecture/ •

    https://www.honeycomb.io/observability/ • https://www.weave.works/technologies/gitops/