Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GoGetCommunity - Continuous profiling Go application

GoGetCommunity - Continuous profiling Go application

I use profiles to better describe post mortems, to enrich observability and monitoring signals with concrete information from the binary itself. They are the perfect bridge between ops and developers when somebody reaches out to me asking why this application eats all that memory I can translate that to a function that I can check out in my editor. I find myself looking for outages that happened in the past because cloud providers and Kubernetes increased my resiliency budget the application gets restarted when it reaches a certain threshold and the system keeps running, but that leak is still a problem that as to be fixed. Having profiles well organized and easy to retrieve is a valuable source of information and you never know when you will need them. That's why continuous profiling is important today more than ever. I use Profefe to collect and store profiles from all my applications continuously. It is an open-source project that exposes a friendly API and an interface to concrete storage of your preference like badger, S3, Minio, and counting. I will describe to you how to project works, how I use it with Kubernetes, and how I analyze the collected profiles.

Gianluca Arbezzano

May 21, 2020
Tweet

More Decks by Gianluca Arbezzano

Other Decks in Programming

Transcript

  1. Continuous Profiling on Go Applications
    Gianluca Arbezzano / @gianarb

    View Slide

  2. View Slide

  3. $ go tool pprof http://localhost:14271/debug/pprof/allocs?debug=1
    Fetching profile over HTTP from http://localhost:14271/debug/pprof/allocs?debug=1
    Saved profile in /home/gianarb/pprof/pprof.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
    Type: inuse_space
    Entering interactive mode (type "help" for commands, "o" for options)
    (pprof) text
    Showing nodes accounting for 1056.92kB, 100% of 1056.92kB total
    Showing top 10 nodes out of 21
    flat flat% sum% cum cum%
    544.67kB 51.53% 51.53% 544.67kB 51.53% github.com/jaegertracing/jaeger/vendor/google.golang.org/grpc/internal/transport.newBufWriter
    512.25kB 48.47% 100% 512.25kB 48.47% time.startTimer
    0 0% 100% 512.25kB 48.47% github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer
    0 0% 100% 512.25kB 48.47% github.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2
    0 0% 100% 512.25kB 48.47% github.com/jaegertracing/jaeger/cmd/agent/app/reporter.(*MetricsReporter).EmitBatch
    0 0% 100% 512.25kB 48.47% github.com/jaegertracing/jaeger/cmd/agent/app/reporter/grpc.(*Reporter).EmitBatch
    0 0% 100% 512.25kB 48.47% github.com/jaegertracing/jaeger/cmd/agent/app/reporter/grpc.(*Reporter).send
    0 0% 100% 512.25kB 48.47% github.com/jaegertracing/jaeger/proto-gen/api_v2.(*collectorServiceClient).PostSpans
    0 0% 100% 512.25kB 48.47% github.com/jaegertracing/jaeger/thrift-gen/jaeger.(*AgentProcessor).Process
    0 0% 100% 512.25kB 48.47% github.com/jaegertracing/jaeger/thrift-gen/jaeger.(*agentProcessorEmitBatch).Process

    View Slide

  4. View Slide

  5. View Slide

  6. About Me

    Gianluca Arbezzano
    • Work for Packet, Sr. Staff Software Engineer
    • www.gianarb.it / @gianarb
    What I like:
    • I make dirty hacks that look awesome
    • I grow my vegetables
    • Travel for fun and work

    View Slide

  7. View Slide

  8. Applications Make 

    Troubles in Production

    View Slide

  9. How Developers Extract 

    Profiles From Production

    View Slide

  10. The common why is by bothering who 

    better knows where the applications run, 

    their IP and ask for them …

    View Slide

  11. Usually they have better things 

    to do then babysitting SWE

    View Slide

  12. It is not a SWE’s fault.
    They do not have a good way 

    to retrieve what they need 

    to be effective at their work.

    View Slide

  13. You never know when you will need a profile, and for what or from
    where.

    View Slide

  14. Let’s summarize issues
    • Developer are usually the profile stakeholder
    • Production is not always a comfortable place to interact with
    • You do not know when you will need a profile, it can be from 2 weeks ago
    • Cloud and Kubernetes increased the amount of noise. A lot more binaries, 

    they go up and down continuously. Containers that OOMs gets restarted
    transparency, there is a lot of postmortem analysis going on

    View Slide

  15. We have the same issue 

    in different but similar use case

    View Slide

  16. Are you ready to know
    a possible solution?
    Spoiler Alert: it is part of the title

    View Slide

  17. Metrics/Logs
    They are continuously collected and stored in a centralized place.

    View Slide

  18. Follow me
    APP
    APP
    APP
    APP
    APP
    collector repo
    API

    View Slide

  19. profefe
    github.com/profefe/profefe

    View Slide

  20. github.com/profefe

    View Slide

  21. github.com/profefe

    View Slide

  22. • Too many applications to re-instrument with the sdk
    • Our services already expose pprof http handler by default
    The pull based solution was easier to implement for us:

    View Slide

  23. APP
    APP
    APP
    APP
    APP
    APP
    APP
    APP
    APP
    APP
    APP
    APP
    APP
    APP
    APP

    View Slide

  24. How can I automate 

    all this?

    View Slide

  25. I need an API…
    But everything has an API!

    View Slide

  26. • Retrieve list of candidates (pods, EC2, containers) and their IP
    • Filter them if needed (scalability via partitioning): you can use labels for this
    purpose.
    • Override or configure the gathering as needed (override pprof port, or path
    or add more labels to the profile as the go runtime version)
    Requirements

    View Slide

  27. I have made on for Kubernetes
    github.com/profefe/kube-profefe

    View Slide

  28. View Slide

  29. View Slide

  30. Develop can take what 

    they need from Profefe API

    View Slide

  31. Cool things: Merge profile
    go tool pprof 'http://repo.pprof.cluster.local:10100/api/0/profiles/merge?
    service=auth&type=cpu&from=2019-05-30T11:49:00&to=2019-05-30T12:49:00&labels=version=1.0.0'
    Fetching profile over HTTP from http://localhost:10100/api/0/profiles...
    Type: cpu
    Entering interactive mode (type "help" for commands, "o" for options)
    (pprof) top
    Showing nodes accounting for 43080ms, 99.15% of 43450ms total
    Dropped 53 nodes (cum <= 217.25ms)
    Showing top 10 nodes out of 12
    flat flat% sum% cum cum%
    42220ms 97.17% 97.17% 42220ms 97.17% main.load
    860ms 1.98% 99.15% 860ms 1.98% runtime.nanotime
    0 0% 99.15% 21050ms 48.45% main.bar
    0 0% 99.15% 21170ms 48.72% main.baz

    View Slide

  32. Pod 150 *
    6 =

    ----------

    900 pprof/hour

    View Slide

  33. Analyze pprof Profiles
    • Easy correlation with other metrics such as mem/cpu usage
    • All those profiles contains useful information
    • Cross service utilization for performance optimization
    • Give me the top 10 cpu intensive function in all system
    • Building bridges between dev and ops

    View Slide

  34. Analytics Pipeline
    store to
    triggers
    CreateObject
    push samples as
    time series data

    View Slide

  35. THANKS
    • https://github.com/profefe/profefe
    • https://ai.google/research/pubs/pub36575
    • https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/
    • https://github.com/google/pprof
    • https://gianarb.it

    View Slide