ScaleConf Cape Town - Debug like a pro on Kubernetes

ScaleConf Cape Town - Debug like a pro on Kubernetes

Golang applications are perfect to be run inside a container. You can build a single binary, a tiny Docker image and you can ship them on your Kubernetes cluster. A successful production environment requires stability and simplicity, it needs to be easy to troubleshoot and operators need to be able to get all the information developers will need to fix a bug. During this talk, Gianluca will share what influxData is doing to allow developers and system administrator to work together, understanding problems running live at scale on Kubernetes and how to escalate them down to Software Engineer using logs, delve, gdb, core dumps, and traces to replicate and fix issues.

Fa5fd3405808cc6a9fe4b126b1ec39bd?s=128

Gianluca Arbezzano

March 08, 2019
Tweet

Transcript

  1. ~ @gianarb - https://gianarb.it ~ Debug like a pro on

    Kubernetes ScaleConf - Cape Town - 2019
  2. ~ @gianarb - https://gianarb.it ~ Gianluca Arbezzano Site Reliability Engineer

    @InfluxData • http://gianarb.it • @gianarb What I like: • I make dirty hacks that look awesome • I grow my vegetables • Travel for fun and work
  3. 1. Yo n ! Your team knows and use Docker

    for local development and testing 2. Kub te ! Everyone speaks about kubernetes. 3. Hir ! You don’t know why but you hired a DevOps that kind of know k8s. 3. Ex i m ! You are moving everything and everyone to kubernetes
  4. Inspired by a true story

  5. None
  6. You need a good book 1. Short 2. Driven by

    experiences 3. Practical 4. Easy
  7. We need to make our hands dirty

  8. Spin up a cluster that you can break Bring developers

    in the loop
  9. Deploy CI on Kubernetes Bring developers in the loop

  10. Run your code in prod Bring developers in the loop

  11. Don’t be scared and write your own tools!

  12. K8s as code: From YAML to code (golang) 1. You

    have the ability to use Golang autocomplete as documentation, reference for every kubernetes resources 2. You feel less a YAML engineer (great feeling btw) 3. Code is better than YAML! You can reuse it, compile it, embed it in other projects.
  13. K8s as code: From YAML to code (golang) Tiny cli

    to make the migration to golang Some manual refactoring
  14. K8s as code: From YAML to code (golang) Tiny cli

    to make the migration to golang Some manual refactoring • Continue to improve our CI to validate that YAML and Go file are the same, and the resources in Kubernetes are like the Go file. • Maybe we will be able to remove the YAML at some point.
  15. GitOps Your Git repository is the entrypoint for all your

    code changes. Infrastructure is ‘as code’, so the place where you make it happen should be Git.
  16. Examples • Everything has an API because you should USE

    it TO make something good! (cURL is good but you can make something better) • Some of our tools: ◦ Backup and Restore Operator for Persistent Volumes ◦ We have a service to create runtime isolated environment to allow devs to test or product people to have a safe environment to demo, try. We also use it for the integration and smoke tests. ◦ We have a tool to replicate environment locally on minikube to install and configure all the dependencies
  17. Instrumentation and Observability

  18. We need to have processes and tools that give us

    the ability to take a real time picture of our system
  19. Observability Events, metrics, logs and traces

  20. Observability It is all about how we collect and aggregate

    the data
  21. Normal state vs Current state

  22. Bring developers in the loop You need knowledgeable devs to

    drive the team
  23. Instrumentation code is a first citizen in your codebase: OpenCensus

    • Open Source project sponsored by Google • It is a SPEC plus a set of libraries in different languages to instrument your application • To collect metrics, traces and events.
  24. OpenCensus Common Interface to collect stats and traces from your

    app Different exporters to persist your data
  25. gianarb.it ~ @gianarb # HELP http_requests_total The total number of

    HTTP requests. # TYPE http_requests_total counter http_requests_total{method="post",code="200"} 1027 1395066363000 http_requests_total{method="post",code="400"} 3 1395066363000 # Escaping in label values: msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9 # Minimalistic line: metric_without_timestamp_and_labels 12.47 # A weird metric from before the epoch: something_weird{problem="division by zero"} +Inf -3982045 # A histogram, which has a pretty complex representation in the text format: # HELP http_request_duration_seconds A histogram of the request duration. # TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{le="0.05"} 24054 http_request_duration_seconds_bucket{le="0.1"} 33444 http_request_duration_seconds_bucket{le="0.2"} 100392 http_request_duration_seconds_bucket{le="0.5"} 129389 http_request_duration_seconds_bucket{le="1"} 133988 http_request_duration_seconds_bucket{le="+Inf"} 144320 http_request_duration_seconds_sum 53423 http_request_duration_seconds_count 144320
  26. OpenMetrics v2 Prometheus exposition format

  27. gianarb.it ~ @gianarb

  28. gianarb.it ~ @gianarb func FetchMetricFamilies(url string, ch chan<- *dto.MetricFamily, certificate

    string, key string, skipServerCertCheck bool) error { defer close(ch) var transport *http.Transport if certificate != "" && key != "" { cert, err := tls.LoadX509KeyPair(certificate, key) if err != nil { return err } tlsConfig := &tls.Config{ Certificates: []tls.Certificate{cert}, InsecureSkipVerify: skipServerCertCheck, } tlsConfig.BuildNameToCertificate() transport = &http.Transport{TLSClientConfig: tlsConfig} } else { transport = &http.Transport{ TLSClientConfig: &tls.Config{InsecureSkipVerify: skipServerCertCheck}, } } client := &http.Client{Transport: transport} return decodeContent(client, url, ch) } https://github.com/prometheus/prom2json/blob/master/prom2json.go#L123
  29. gianarb.it ~ @gianarb More to read • OpenMetrics: https://github.com/OpenObservability/OpenMetrics •

    OpenMetrics mailing list: https://groups.google.com/forum/#\protect\kern-. 1667em\relaxforum/openmetrics • WIP branch for Python library https://github.com/prometheus/client_python/tree/openmetrics • Thanks RICHARD for your work! I got some slides from here: https://promcon.io/2018-munich/slides/openmetrics-transforming-the-prometh eus-exposition-format-into-a-global-standard.pdf
  30. @gianarb - gianluca@influxdb.com

  31. @gianarb - gianluca@influxdb.com

  32. How do you “tell stories” about concurrent systems?

  33. None
  34. OpenTracing

  35. gianarb.it ~ @gianarb OpenTracing

  36. I was just waiting for a new standard! cit. Troll

  37. © 2017 InfluxData. All rights reserved. 37 Typical problems with

    logs ¨ Which library do I need to use? ¨ Every library has a different format ¨ Every languages exposes a different format
  38. © 2017 InfluxData. All rights reserved. 38 Tracing is not

    something new ¨ There are vendors ¨ Every vendor has their own format
  39. © 2017 InfluxData. All rights reserved. 39 log log log

    log log log Parent Span Span Context / Baggage Child Child Child Span ¨ Spans - Basic unit of timing and causality. Can be tagged with key/value pairs. ¨ Logs - Structured data recorded on a span. ¨ Span Context - serializable format for linking spans across network boundaries. Carries baggage, such as a request and client IDs. ¨ Tracers - Anything that plugs into the OpenTracing API to record information. ¨ ZipKin, Jaeger, LightStep, others ¨ Also metrics (Prometheus) and logging
  40. © 2017 InfluxData. All rights reserved. 40 OpenTracing API application

    logic µ-service frameworks Lambda functions RPC & control-flow frameworks existing instrumentation tracing infrastructure main() I N S T A N A J a e g e r microservice process
  41. © 2017 InfluxData. All rights reserved. 41 import "github.com/opentracing/opentracing-go" import

    ".../some_tracing_impl" func main() { opentracing.SetGlobalTracer( // tracing impl specific: some_tracing_impl.New(...), ) ... } https://github.com/opentracing/opentracing-go Opentracing: Configure the GlobalTracer
  42. © 2017 InfluxData. All rights reserved. 42 func xyz(ctx context.Context,

    ...) { ... span, ctx := opentracing.StartSpanFromContext(ctx, "op_name") defer span.Finish() span.LogFields( log.String("event", "soft error"), log.String("type", "cache timeout"), log.Int("waited.millis", 1500)) ... } https://github.com/opentracing/opentracing-go Opentracing: Create a Span from the Context
  43. © 2017 InfluxData. All rights reserved. 43 func xyz(parentSpan opentracing.Span,

    ...) { ... sp := opentracing.StartSpan( "operation_name", opentracing.ChildOf(parentSpan.Context())) defer sp.Finish() ... } https://github.com/opentracing/opentracing-go Opentracing: Create a Child Span
  44. Golang and Kubernetes: pprof • It is the Golang native

    profiler • You can use it via the `go pprof` command • `import "runtime/pprof"` writes runtime profiling data • `import "net/http/pprof"` serves via HTTP server runtime profiling data
  45. Golang and Kubernetes: pprof package main import ( "log" "net/http"

    _ "net/http/pprof" ) func main() { log.Println(http.ListenAndServe("localhost:6060", nil)) }
  46. $ go tool pprof http://localhost:6060/debug/pprof/heap Fetching profile over HTTP from

    http://localhost:6060/debug/pprof/heap Saved profile in /home/gianarb/pprof/pprof.main.alloc_objects.alloc_space.inuse_objects.inuse_space.00 9.pb.gz File: main Type: inuse_space Time: Oct 15, 2018 at 4:22pm (CEST) Entering interactive mode (type "help" for commands, "o" for options) (pprof) png Generating report in profile001.png
  47. None
  48. Golang and Kubernetes: pprof https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go // handleProfiles determines which profile

    to return to the requester. func (h *Handler) handleProfiles(w http.ResponseWriter, r *http.Request) { switch r.URL.Path { case "/debug/pprof/cmdline": httppprof.Cmdline(w, r) case "/debug/pprof/profile": httppprof.Profile(w, r) case "/debug/pprof/symbol": httppprof.Symbol(w, r) case "/debug/pprof/all": h.archiveProfilesAndQueries(w, r) default: httppprof.Index(w, r) } }
  49. Golang and Kubernetes: pprof https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go // archiveProfilesAndQueries collects the following

    profiles: // - goroutine profile // - heap profile // - blocking profile // - mutex profile // - (optionally) CPU profile // // It also collects the following query results: // // - SHOW SHARDS // - SHOW STATS // - SHOW DIAGNOSTICS // // All information is added to a tar archive and then compressed, before being // returned to the requester as an archive file.
  50. Golang and Kubernetes: pprof https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go var allProfs = []*prof{ {Name:

    "goroutine", Debug: 1}, {Name: "block", Debug: 1}, {Name: "mutex", Debug: 1}, {Name: "heap", Debug: 1}, }
  51. gz := gzip.NewWriter(&resp) tw := tar.NewWriter(gz) // Collect and write

    out profiles. for _, profile := range allProfs { if profile.Name == "cpu" { if err := pprof.StartCPUProfile(&buf); err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } sleep(w, time.Duration(profile.Debug)*time.Second) pprof.StopCPUProfile() } else { prof := pprof.Lookup(profile.Name) if prof == nil { http.Error(w, "unable to find profile "+profile.Name, 500) return } if err := prof.WriteTo(&buf, int(profile.Debug)); err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } } }
  52. Golang and Kubernetes: pprof https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go // Collect and write out

    the queries. var allQueries = []struct { name string fn func() ([]*models.Row, error) }{ {"shards", h.showShards}, {"stats", h.showStats}, {"diagnostics", h.showDiagnostics}, } tabW := tabwriter.NewWriter(&buf, 8, 8, 1, '\t', 0) for _, query := range allQueries { rows, err := query.fn() if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) }
  53. func (h *Handler) showDiagnostics() ([]*models.Row, error) { diags, err :=

    h.Monitor.Diagnostics() if err != nil { return nil, err } // Get a sorted list of diagnostics keys. sortedKeys := make([]string, 0, len(diags)) for k := range diags { sortedKeys = append(sortedKeys, k) } sort.Strings(sortedKeys) rows := make([]*models.Row, 0, len(diags)) for _, k := range sortedKeys { row := &models.Row{Name: k} row.Columns = diags[k].Columns row.Values = diags[k].Rows rows = append(rows, row) } return rows, nil }
  54. Back to tracing - the cost of a retry

  55. Distributed Tracing - the cost of a retry

  56. Tracing is a debugging tool

  57. None
  58. Remote Debugging in Kubernetes Delve, GDB

  59. Remote Debugging In Kubernetes DOESN’T APPLY!

  60. That’s not totally true. You can expose the debugger endpoint.

    It works. It doesn’t apply for production workload.
  61. Summarize • Make your hands dirty • Bring developers in

    the ops loop • Write tools! • Observability events, Metrics, Logs, Traces • Instrumentation code is a first citizen code in our app • Tracing
  62. Credits and articles to read • https://gianarb.it • https://www.influxdata.com/blog/monitoring-kubernetes-architecture/ •

    https://www.honeycomb.io/observability/ • https://www.weave.works/technologies/gitops/
  63. ~ @gianarb - https://gianarb.it ~ Thanks @gianarb