Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ScaleConf Cape Town - Debug like a pro on Kubernetes

ScaleConf Cape Town - Debug like a pro on Kubernetes

Golang applications are perfect to be run inside a container. You can build a single binary, a tiny Docker image and you can ship them on your Kubernetes cluster. A successful production environment requires stability and simplicity, it needs to be easy to troubleshoot and operators need to be able to get all the information developers will need to fix a bug. During this talk, Gianluca will share what influxData is doing to allow developers and system administrator to work together, understanding problems running live at scale on Kubernetes and how to escalate them down to Software Engineer using logs, delve, gdb, core dumps, and traces to replicate and fix issues.

Gianluca Arbezzano

March 08, 2019
Tweet

More Decks by Gianluca Arbezzano

Other Decks in Technology

Transcript

  1. ~ @gianarb - https://gianarb.it ~
    Debug like a pro on
    Kubernetes
    ScaleConf - Cape Town - 2019

    View Slide

  2. ~ @gianarb - https://gianarb.it ~
    Gianluca Arbezzano
    Site Reliability Engineer @InfluxData
    ● http://gianarb.it
    ● @gianarb
    What I like:
    ● I make dirty hacks that look awesome
    ● I grow my vegetables
    ● Travel for fun and work

    View Slide

  3. 1. Yo n !
    Your team knows
    and use Docker for
    local development
    and testing
    2. Kub te !
    Everyone speaks
    about kubernetes.
    3. Hir !
    You don’t know why
    but you hired a
    DevOps that kind of
    know k8s.
    3. Ex i m !
    You are moving
    everything and
    everyone to
    kubernetes

    View Slide

  4. Inspired by a true story

    View Slide

  5. View Slide

  6. You need a good book
    1. Short
    2. Driven by experiences
    3. Practical
    4. Easy

    View Slide

  7. We need to make our
    hands dirty

    View Slide

  8. Spin up a cluster that you
    can break
    Bring developers in the loop

    View Slide

  9. Deploy CI on Kubernetes
    Bring developers in the loop

    View Slide

  10. Run your code in prod
    Bring developers in the loop

    View Slide

  11. Don’t be scared and write your
    own tools!

    View Slide

  12. K8s as code: From YAML to code (golang)
    1. You have the ability to use Golang autocomplete as documentation, reference for every
    kubernetes resources
    2. You feel less a YAML engineer (great feeling btw)
    3. Code is better than YAML! You can reuse it, compile it, embed it in other projects.

    View Slide

  13. K8s as code: From YAML to code (golang)
    Tiny cli
    to make
    the
    migration
    to golang
    Some
    manual
    refactoring

    View Slide

  14. K8s as code: From YAML to code (golang)
    Tiny cli
    to make
    the
    migration
    to golang
    Some
    manual
    refactoring
    ● Continue to improve our CI to validate that YAML and Go file are the same,
    and the resources in Kubernetes are like the Go file.
    ● Maybe we will be able to remove the YAML at some point.

    View Slide

  15. GitOps
    Your Git repository is the entrypoint for all your code changes.
    Infrastructure is ‘as code’, so the place where you make it happen should be Git.

    View Slide

  16. Examples
    ● Everything has an API because you should USE it TO make something good!
    (cURL is good but you can make something better)
    ● Some of our tools:
    ○ Backup and Restore Operator for Persistent Volumes
    ○ We have a service to create runtime isolated environment to allow devs to test or product
    people to have a safe environment to demo, try. We also use it for the integration and smoke
    tests.
    ○ We have a tool to replicate environment locally on minikube to install and configure all the
    dependencies

    View Slide

  17. Instrumentation and
    Observability

    View Slide

  18. We need to have processes and tools that give us
    the ability to take a real time picture of our system

    View Slide

  19. Observability
    Events, metrics, logs and traces

    View Slide

  20. Observability
    It is all about how we collect and aggregate the data

    View Slide

  21. Normal state vs Current state

    View Slide

  22. Bring developers in the loop
    You need knowledgeable devs to drive the team

    View Slide

  23. Instrumentation code is a first citizen in your
    codebase: OpenCensus
    ● Open Source project sponsored by Google
    ● It is a SPEC plus a set of libraries in different languages to instrument your
    application
    ● To collect metrics, traces and events.

    View Slide

  24. OpenCensus
    Common
    Interface to
    collect stats
    and traces
    from your app
    Different
    exporters to
    persist your
    data

    View Slide

  25. gianarb.it ~ @gianarb
    # HELP http_requests_total The total number of HTTP requests.
    # TYPE http_requests_total counter
    http_requests_total{method="post",code="200"} 1027 1395066363000
    http_requests_total{method="post",code="400"} 3 1395066363000
    # Escaping in label values:
    msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""}
    1.458255915e9
    # Minimalistic line:
    metric_without_timestamp_and_labels 12.47
    # A weird metric from before the epoch:
    something_weird{problem="division by zero"} +Inf -3982045
    # A histogram, which has a pretty complex representation in the text format:
    # HELP http_request_duration_seconds A histogram of the request duration.
    # TYPE http_request_duration_seconds histogram
    http_request_duration_seconds_bucket{le="0.05"} 24054
    http_request_duration_seconds_bucket{le="0.1"} 33444
    http_request_duration_seconds_bucket{le="0.2"} 100392
    http_request_duration_seconds_bucket{le="0.5"} 129389
    http_request_duration_seconds_bucket{le="1"} 133988
    http_request_duration_seconds_bucket{le="+Inf"} 144320
    http_request_duration_seconds_sum 53423
    http_request_duration_seconds_count 144320

    View Slide

  26. OpenMetrics
    v2 Prometheus exposition format

    View Slide

  27. gianarb.it ~ @gianarb

    View Slide

  28. gianarb.it ~ @gianarb
    func FetchMetricFamilies(url string, ch chanskipServerCertCheck bool) error {
    defer close(ch)
    var transport *http.Transport
    if certificate != "" && key != "" {
    cert, err := tls.LoadX509KeyPair(certificate, key)
    if err != nil {
    return err
    }
    tlsConfig := &tls.Config{
    Certificates: []tls.Certificate{cert},
    InsecureSkipVerify: skipServerCertCheck,
    }
    tlsConfig.BuildNameToCertificate()
    transport = &http.Transport{TLSClientConfig: tlsConfig}
    } else {
    transport = &http.Transport{
    TLSClientConfig: &tls.Config{InsecureSkipVerify: skipServerCertCheck},
    }
    }
    client := &http.Client{Transport: transport}
    return decodeContent(client, url, ch)
    }
    https://github.com/prometheus/prom2json/blob/master/prom2json.go#L123

    View Slide

  29. gianarb.it ~ @gianarb
    More to read
    ● OpenMetrics: https://github.com/OpenObservability/OpenMetrics
    ● OpenMetrics mailing list: https://groups.google.com/forum/#\protect\kern-.
    1667em\relaxforum/openmetrics
    ● WIP branch for Python library
    https://github.com/prometheus/client_python/tree/openmetrics
    ● Thanks RICHARD for your work! I got some slides from here:
    https://promcon.io/2018-munich/slides/openmetrics-transforming-the-prometh
    eus-exposition-format-into-a-global-standard.pdf

    View Slide

  30. @gianarb - [email protected]

    View Slide

  31. @gianarb - [email protected]

    View Slide

  32. How do you “tell stories” about
    concurrent systems?

    View Slide

  33. View Slide

  34. OpenTracing

    View Slide

  35. gianarb.it ~ @gianarb
    OpenTracing

    View Slide

  36. I was just waiting for a new
    standard!
    cit. Troll

    View Slide

  37. © 2017 InfluxData. All rights reserved.
    37
    Typical problems with logs
    ¨ Which library do I need to use?
    ¨ Every library has a different format
    ¨ Every languages exposes a different format

    View Slide

  38. © 2017 InfluxData. All rights reserved.
    38
    Tracing is not something new
    ¨ There are vendors
    ¨ Every vendor has their own format

    View Slide

  39. © 2017 InfluxData. All rights reserved.
    39
    log log log
    log log
    log
    Parent Span
    Span Context / Baggage
    Child
    Child
    Child Span
    ¨ Spans - Basic unit of timing and causality. Can be tagged with
    key/value pairs.
    ¨ Logs - Structured data recorded on a span.
    ¨ Span Context - serializable format for linking spans across network
    boundaries. Carries baggage, such as a request and client IDs.
    ¨ Tracers - Anything that plugs into the OpenTracing API to record
    information.
    ¨ ZipKin, Jaeger, LightStep, others
    ¨ Also metrics (Prometheus) and logging

    View Slide

  40. © 2017 InfluxData. All rights reserved.
    40
    OpenTracing
    API
    application logic
    µ-service frameworks
    Lambda functions
    RPC & control-flow frameworks
    existing instrumentation
    tracing infrastructure
    main()
    I N S T A N A
    J a e g e r
    microservice process

    View Slide

  41. © 2017 InfluxData. All rights reserved.
    41
    import "github.com/opentracing/opentracing-go"
    import ".../some_tracing_impl"
    func main() {
    opentracing.SetGlobalTracer(
    // tracing impl specific:
    some_tracing_impl.New(...),
    )
    ...
    }
    https://github.com/opentracing/opentracing-go
    Opentracing: Configure the GlobalTracer

    View Slide

  42. © 2017 InfluxData. All rights reserved.
    42
    func xyz(ctx context.Context, ...) {
    ...
    span, ctx := opentracing.StartSpanFromContext(ctx, "op_name")
    defer span.Finish()
    span.LogFields(
    log.String("event", "soft error"),
    log.String("type", "cache timeout"),
    log.Int("waited.millis", 1500))
    ...
    }
    https://github.com/opentracing/opentracing-go
    Opentracing: Create a Span from the Context

    View Slide

  43. © 2017 InfluxData. All rights reserved.
    43
    func xyz(parentSpan opentracing.Span, ...) {
    ...
    sp := opentracing.StartSpan(
    "operation_name",
    opentracing.ChildOf(parentSpan.Context()))
    defer sp.Finish()
    ...
    }
    https://github.com/opentracing/opentracing-go
    Opentracing: Create a Child Span

    View Slide

  44. Golang and Kubernetes: pprof
    ● It is the Golang native profiler
    ● You can use it via the `go pprof` command
    ● `import "runtime/pprof"` writes runtime profiling data
    ● `import "net/http/pprof"` serves via HTTP server runtime profiling data

    View Slide

  45. Golang and Kubernetes: pprof
    package main
    import (
    "log"
    "net/http"
    _ "net/http/pprof"
    )
    func main() {
    log.Println(http.ListenAndServe("localhost:6060", nil))
    }

    View Slide

  46. $ go tool pprof http://localhost:6060/debug/pprof/heap
    Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
    Saved profile in
    /home/gianarb/pprof/pprof.main.alloc_objects.alloc_space.inuse_objects.inuse_space.00
    9.pb.gz
    File: main
    Type: inuse_space
    Time: Oct 15, 2018 at 4:22pm (CEST)
    Entering interactive mode (type "help" for commands, "o" for options)
    (pprof) png
    Generating report in profile001.png

    View Slide

  47. View Slide

  48. Golang and Kubernetes: pprof
    https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go
    // handleProfiles determines which profile to return to the requester.
    func (h *Handler) handleProfiles(w http.ResponseWriter, r *http.Request) {
    switch r.URL.Path {
    case "/debug/pprof/cmdline":
    httppprof.Cmdline(w, r)
    case "/debug/pprof/profile":
    httppprof.Profile(w, r)
    case "/debug/pprof/symbol":
    httppprof.Symbol(w, r)
    case "/debug/pprof/all":
    h.archiveProfilesAndQueries(w, r)
    default:
    httppprof.Index(w, r)
    }
    }

    View Slide

  49. Golang and Kubernetes: pprof
    https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go
    // archiveProfilesAndQueries collects the following profiles:
    // - goroutine profile
    // - heap profile
    // - blocking profile
    // - mutex profile
    // - (optionally) CPU profile
    //
    // It also collects the following query results:
    //
    // - SHOW SHARDS
    // - SHOW STATS
    // - SHOW DIAGNOSTICS
    //
    // All information is added to a tar archive and then compressed, before
    being
    // returned to the requester as an archive file.

    View Slide

  50. Golang and Kubernetes: pprof
    https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go
    var allProfs = []*prof{
    {Name: "goroutine", Debug: 1},
    {Name: "block", Debug: 1},
    {Name: "mutex", Debug: 1},
    {Name: "heap", Debug: 1},
    }

    View Slide

  51. gz := gzip.NewWriter(&resp)
    tw := tar.NewWriter(gz)
    // Collect and write out profiles.
    for _, profile := range allProfs {
    if profile.Name == "cpu" {
    if err := pprof.StartCPUProfile(&buf); err != nil {
    http.Error(w, err.Error(), http.StatusInternalServerError)
    return
    }
    sleep(w, time.Duration(profile.Debug)*time.Second)
    pprof.StopCPUProfile()
    } else {
    prof := pprof.Lookup(profile.Name)
    if prof == nil {
    http.Error(w, "unable to find profile "+profile.Name, 500)
    return
    }
    if err := prof.WriteTo(&buf, int(profile.Debug)); err != nil {
    http.Error(w, err.Error(), http.StatusInternalServerError)
    return
    }
    }
    }

    View Slide

  52. Golang and Kubernetes: pprof
    https://github.com/influxdata/influxdb/blob/4cbdc197b8117fee648d62e2e5be75c6575352f0/services/httpd/pprof.go
    // Collect and write out the queries.
    var allQueries = []struct {
    name string
    fn func() ([]*models.Row, error)
    }{
    {"shards", h.showShards},
    {"stats", h.showStats},
    {"diagnostics", h.showDiagnostics},
    }
    tabW := tabwriter.NewWriter(&buf, 8, 8, 1, '\t', 0)
    for _, query := range allQueries {
    rows, err := query.fn()
    if err != nil {
    http.Error(w, err.Error(), http.StatusInternalServerError)
    }

    View Slide

  53. func (h *Handler) showDiagnostics() ([]*models.Row, error) {
    diags, err := h.Monitor.Diagnostics()
    if err != nil {
    return nil, err
    }
    // Get a sorted list of diagnostics keys.
    sortedKeys := make([]string, 0, len(diags))
    for k := range diags {
    sortedKeys = append(sortedKeys, k)
    }
    sort.Strings(sortedKeys)
    rows := make([]*models.Row, 0, len(diags))
    for _, k := range sortedKeys {
    row := &models.Row{Name: k}
    row.Columns = diags[k].Columns
    row.Values = diags[k].Rows
    rows = append(rows, row)
    }
    return rows, nil
    }

    View Slide

  54. Back to tracing - the cost of a retry

    View Slide

  55. Distributed Tracing - the cost of a retry

    View Slide

  56. Tracing is a debugging tool

    View Slide

  57. View Slide

  58. Remote Debugging
    in Kubernetes
    Delve, GDB

    View Slide

  59. Remote Debugging
    In Kubernetes
    DOESN’T APPLY!

    View Slide

  60. That’s not totally true. You can expose the
    debugger endpoint. It works. It doesn’t apply for
    production workload.

    View Slide

  61. Summarize
    ● Make your hands dirty
    ● Bring developers in the ops loop
    ● Write tools!
    ● Observability events, Metrics, Logs, Traces
    ● Instrumentation code is a first citizen code in our app
    ● Tracing

    View Slide

  62. Credits and articles to read
    ● https://gianarb.it
    ● https://www.influxdata.com/blog/monitoring-kubernetes-architecture/
    ● https://www.honeycomb.io/observability/
    ● https://www.weave.works/technologies/gitops/

    View Slide

  63. ~ @gianarb - https://gianarb.it ~
    Thanks
    @gianarb

    View Slide