Slide 1

Slide 1 text

~ @gianarb - https://gianarb.it ~ Debug like a pro on Kubernetes ScaleConf - Cape Town - 2019

Slide 2

Slide 2 text

~ @gianarb - https://gianarb.it ~ Gianluca Arbezzano Site Reliability Engineer @InfluxData ● http://gianarb.it ● @gianarb What I like: ● I make dirty hacks that look awesome ● I grow my vegetables ● Travel for fun and work

Slide 3

Slide 3 text

1. Yo n ! Your team knows and use Docker for local development and testing 2. Kub te ! Everyone speaks about kubernetes. 3. Hir ! You don’t know why but you hired a DevOps that kind of know k8s. 3. Ex i m ! You are moving everything and everyone to kubernetes

Slide 4

Slide 4 text

Inspired by a true story

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

You need a good book 1. Short 2. Driven by experiences 3. Practical 4. Easy

Slide 7

Slide 7 text

We need to make our hands dirty

Slide 8

Slide 8 text

Spin up a cluster that you can break Bring developers in the loop

Slide 9

Slide 9 text

Deploy CI on Kubernetes Bring developers in the loop

Slide 10

Slide 10 text

Run your code in prod Bring developers in the loop

Slide 11

Slide 11 text

Don’t be scared and write your own tools!

Slide 12

Slide 12 text

K8s as code: From YAML to code (golang) 1. You have the ability to use Golang autocomplete as documentation, reference for every kubernetes resources 2. You feel less a YAML engineer (great feeling btw) 3. Code is better than YAML! You can reuse it, compile it, embed it in other projects.

Slide 13

Slide 13 text

K8s as code: From YAML to code (golang) Tiny cli to make the migration to golang Some manual refactoring

Slide 14

Slide 14 text

K8s as code: From YAML to code (golang) Tiny cli to make the migration to golang Some manual refactoring ● Continue to improve our CI to validate that YAML and Go file are the same, and the resources in Kubernetes are like the Go file. ● Maybe we will be able to remove the YAML at some point.

Slide 15

Slide 15 text

GitOps Your Git repository is the entrypoint for all your code changes. Infrastructure is ‘as code’, so the place where you make it happen should be Git.

Slide 16

Slide 16 text

Examples ● Everything has an API because you should USE it TO make something good! (cURL is good but you can make something better) ● Some of our tools: ○ Backup and Restore Operator for Persistent Volumes ○ We have a service to create runtime isolated environment to allow devs to test or product people to have a safe environment to demo, try. We also use it for the integration and smoke tests. ○ We have a tool to replicate environment locally on minikube to install and configure all the dependencies

Slide 17

Slide 17 text

Instrumentation and Observability

Slide 18

Slide 18 text

We need to have processes and tools that give us the ability to take a real time picture of our system

Slide 19

Slide 19 text

Observability Events, metrics, logs and traces

Slide 20

Slide 20 text

Observability It is all about how we collect and aggregate the data

Slide 21

Slide 21 text

Normal state vs Current state

Slide 22

Slide 22 text

Bring developers in the loop You need knowledgeable devs to drive the team

Slide 23

Slide 23 text

Instrumentation code is a first citizen in your codebase: OpenCensus ● Open Source project sponsored by Google ● It is a SPEC plus a set of libraries in different languages to instrument your application ● To collect metrics, traces and events.

Slide 24

Slide 24 text

OpenCensus Common Interface to collect stats and traces from your app Different exporters to persist your data

Slide 25

Slide 25 text

gianarb.it ~ @gianarb # HELP http_requests_total The total number of HTTP requests. # TYPE http_requests_total counter http_requests_total{method="post",code="200"} 1027 1395066363000 http_requests_total{method="post",code="400"} 3 1395066363000 # Escaping in label values: msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9 # Minimalistic line: metric_without_timestamp_and_labels 12.47 # A weird metric from before the epoch: something_weird{problem="division by zero"} +Inf -3982045 # A histogram, which has a pretty complex representation in the text format: # HELP http_request_duration_seconds A histogram of the request duration. # TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{le="0.05"} 24054 http_request_duration_seconds_bucket{le="0.1"} 33444 http_request_duration_seconds_bucket{le="0.2"} 100392 http_request_duration_seconds_bucket{le="0.5"} 129389 http_request_duration_seconds_bucket{le="1"} 133988 http_request_duration_seconds_bucket{le="+Inf"} 144320 http_request_duration_seconds_sum 53423 http_request_duration_seconds_count 144320

Slide 26

Slide 26 text

OpenMetrics v2 Prometheus exposition format

Slide 27

Slide 27 text

gianarb.it ~ @gianarb

Slide 28

Slide 28 text

gianarb.it ~ @gianarb func FetchMetricFamilies(url string, ch chan<- *dto.MetricFamily, certificate string, key string, skipServerCertCheck bool) error { defer close(ch) var transport *http.Transport if certificate != "" && key != "" { cert, err := tls.LoadX509KeyPair(certificate, key) if err != nil { return err } tlsConfig := &tls.Config{ Certificates: []tls.Certificate{cert}, InsecureSkipVerify: skipServerCertCheck, } tlsConfig.BuildNameToCertificate() transport = &http.Transport{TLSClientConfig: tlsConfig} } else { transport = &http.Transport{ TLSClientConfig: &tls.Config{InsecureSkipVerify: skipServerCertCheck}, } } client := &http.Client{Transport: transport} return decodeContent(client, url, ch) } https://github.com/prometheus/prom2json/blob/master/prom2json.go#L123

Slide 29

Slide 29 text

gianarb.it ~ @gianarb More to read ● OpenMetrics: https://github.com/OpenObservability/OpenMetrics ● OpenMetrics mailing list: https://groups.google.com/forum/#\protect\kern-. 1667em\relaxforum/openmetrics ● WIP branch for Python library https://github.com/prometheus/client_python/tree/openmetrics ● Thanks RICHARD for your work! I got some slides from here: https://promcon.io/2018-munich/slides/openmetrics-transforming-the-prometh eus-exposition-format-into-a-global-standard.pdf

Slide 30

Slide 30 text

@gianarb - [email protected]

Slide 31

Slide 31 text

@gianarb - [email protected]

Slide 32

Slide 32 text

How do you “tell stories” about concurrent systems?

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

OpenTracing

Slide 35

Slide 35 text

gianarb.it ~ @gianarb OpenTracing

Slide 36

Slide 36 text

I was just waiting for a new standard! cit. Troll

Slide 37

Slide 37 text

© 2017 InfluxData. All rights reserved. 37 Typical problems with logs ¨ Which library do I need to use? ¨ Every library has a different format ¨ Every languages exposes a different format

Slide 38

Slide 38 text

© 2017 InfluxData. All rights reserved. 38 Tracing is not something new ¨ There are vendors ¨ Every vendor has their own format

Slide 39

Slide 39 text

© 2017 InfluxData. All rights reserved. 39 log log log log log log Parent Span Span Context / Baggage Child Child Child Span ¨ Spans - Basic unit of timing and causality. Can be tagged with key/value pairs. ¨ Logs - Structured data recorded on a span. ¨ Span Context - serializable format for linking spans across network boundaries. Carries baggage, such as a request and client IDs. ¨ Tracers - Anything that plugs into the OpenTracing API to record information. ¨ ZipKin, Jaeger, LightStep, others ¨ Also metrics (Prometheus) and logging

Slide 40

Slide 40 text

© 2017 InfluxData. All rights reserved. 40 OpenTracing API application logic µ-service frameworks Lambda functions RPC & control-flow frameworks existing instrumentation tracing infrastructure main() I N S T A N A J a e g e r microservice process

Slide 41

Slide 41 text

© 2017 InfluxData. All rights reserved. 41 import "github.com/opentracing/opentracing-go" import ".../some_tracing_impl" func main() { opentracing.SetGlobalTracer( // tracing impl specific: some_tracing_impl.New(...), ) ... } https://github.com/opentracing/opentracing-go Opentracing: Configure the GlobalTracer

Slide 42

Slide 42 text

© 2017 InfluxData. All rights reserved. 42 func xyz(ctx context.Context, ...) { ... span, ctx := opentracing.StartSpanFromContext(ctx, "op_name") defer span.Finish() span.LogFields( log.String("event", "soft error"), log.String("type", "cache timeout"), log.Int("waited.millis", 1500)) ... } https://github.com/opentracing/opentracing-go Opentracing: Create a Span from the Context

Slide 43

Slide 43 text

© 2017 InfluxData. All rights reserved. 43 func xyz(parentSpan opentracing.Span, ...) { ... sp := opentracing.StartSpan( "operation_name", opentracing.ChildOf(parentSpan.Context())) defer sp.Finish() ... } https://github.com/opentracing/opentracing-go Opentracing: Create a Child Span

Slide 44

Slide 44 text

Golang and Kubernetes: pprof ● It is the Golang native profiler ● You can use it via the `go pprof` command ● `import "runtime/pprof"` writes runtime profiling data ● `import "net/http/pprof"` serves via HTTP server runtime profiling data

Slide 45

Slide 45 text

Golang and Kubernetes: pprof package main import ( "log" "net/http" _ "net/http/pprof" ) func main() { log.Println(http.ListenAndServe("localhost:6060", nil)) }

Slide 46

Slide 46 text

$ go tool pprof http://localhost:6060/debug/pprof/heap Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap Saved profile in /home/gianarb/pprof/pprof.main.alloc_objects.alloc_space.inuse_objects.inuse_space.00 9.pb.gz File: main Type: inuse_space Time: Oct 15, 2018 at 4:22pm (CEST) Entering interactive mode (type "help" for commands, "o" for options) (pprof) png Generating report in profile001.png

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Golang and Kubernetes: pprof https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go // handleProfiles determines which profile to return to the requester. func (h *Handler) handleProfiles(w http.ResponseWriter, r *http.Request) { switch r.URL.Path { case "/debug/pprof/cmdline": httppprof.Cmdline(w, r) case "/debug/pprof/profile": httppprof.Profile(w, r) case "/debug/pprof/symbol": httppprof.Symbol(w, r) case "/debug/pprof/all": h.archiveProfilesAndQueries(w, r) default: httppprof.Index(w, r) } }

Slide 49

Slide 49 text

Golang and Kubernetes: pprof https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go // archiveProfilesAndQueries collects the following profiles: // - goroutine profile // - heap profile // - blocking profile // - mutex profile // - (optionally) CPU profile // // It also collects the following query results: // // - SHOW SHARDS // - SHOW STATS // - SHOW DIAGNOSTICS // // All information is added to a tar archive and then compressed, before being // returned to the requester as an archive file.

Slide 50

Slide 50 text

Golang and Kubernetes: pprof https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go var allProfs = []*prof{ {Name: "goroutine", Debug: 1}, {Name: "block", Debug: 1}, {Name: "mutex", Debug: 1}, {Name: "heap", Debug: 1}, }

Slide 51

Slide 51 text

gz := gzip.NewWriter(&resp) tw := tar.NewWriter(gz) // Collect and write out profiles. for _, profile := range allProfs { if profile.Name == "cpu" { if err := pprof.StartCPUProfile(&buf); err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } sleep(w, time.Duration(profile.Debug)*time.Second) pprof.StopCPUProfile() } else { prof := pprof.Lookup(profile.Name) if prof == nil { http.Error(w, "unable to find profile "+profile.Name, 500) return } if err := prof.WriteTo(&buf, int(profile.Debug)); err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) return } } }

Slide 52

Slide 52 text

Golang and Kubernetes: pprof https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go // Collect and write out the queries. var allQueries = []struct { name string fn func() ([]*models.Row, error) }{ {"shards", h.showShards}, {"stats", h.showStats}, {"diagnostics", h.showDiagnostics}, } tabW := tabwriter.NewWriter(&buf, 8, 8, 1, '\t', 0) for _, query := range allQueries { rows, err := query.fn() if err != nil { http.Error(w, err.Error(), http.StatusInternalServerError) }

Slide 53

Slide 53 text

func (h *Handler) showDiagnostics() ([]*models.Row, error) { diags, err := h.Monitor.Diagnostics() if err != nil { return nil, err } // Get a sorted list of diagnostics keys. sortedKeys := make([]string, 0, len(diags)) for k := range diags { sortedKeys = append(sortedKeys, k) } sort.Strings(sortedKeys) rows := make([]*models.Row, 0, len(diags)) for _, k := range sortedKeys { row := &models.Row{Name: k} row.Columns = diags[k].Columns row.Values = diags[k].Rows rows = append(rows, row) } return rows, nil }

Slide 54

Slide 54 text

Back to tracing - the cost of a retry

Slide 55

Slide 55 text

Distributed Tracing - the cost of a retry

Slide 56

Slide 56 text

Tracing is a debugging tool

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

Remote Debugging in Kubernetes Delve, GDB

Slide 59

Slide 59 text

Remote Debugging In Kubernetes DOESN’T APPLY!

Slide 60

Slide 60 text

That’s not totally true. You can expose the debugger endpoint. It works. It doesn’t apply for production workload.

Slide 61

Slide 61 text

Summarize ● Make your hands dirty ● Bring developers in the ops loop ● Write tools! ● Observability events, Metrics, Logs, Traces ● Instrumentation code is a first citizen code in our app ● Tracing

Slide 62

Slide 62 text

Credits and articles to read ● https://gianarb.it ● https://www.influxdata.com/blog/monitoring-kubernetes-architecture/ ● https://www.honeycomb.io/observability/ ● https://www.weave.works/technologies/gitops/

Slide 63

Slide 63 text

~ @gianarb - https://gianarb.it ~ Thanks @gianarb