Slide 1

Slide 1 text

󰗞 Say hello on Twitter: @felixge Felix Geisendörfer Go Profiling From Bottom Up GopherCon Russia 2021

Slide 2

Slide 2 text

󰗞 Say hello on Twitter: @felixge Go Profiler Notes - https://github.com/DataDog/go-profiler-notes - Research and documentation for anything Go profiling related. - Feedback welcome!

Slide 3

Slide 3 text

󰗞 Say hello on Twitter: @felixge Company Duration What Datadog ~3 month Continuous Go Profiling Product Apple ~6.5 years Secret Manufacturing Stuff using PostgreSQL and Go Thomson Reuters ~1 year XMPP Chat Server in Go Transloadit ~4 years Co-Founder, File Uploading / Encoding, Node.js, AWS, etc. My trail of destruction work history + A lot of open source: https://github.com/felixge (Node.js core contributor, many small Go projects, etc.)

Slide 4

Slide 4 text

󰗞 Say hello on Twitter: @felixge How the Profiling Sausage is Made Not Depicted: Profiling, Sausage

Slide 5

Slide 5 text

󰗞 Say hello on Twitter: @felixge Go ships with great profiling tools, but it’s easy to get confused and misunderstand the data.

Slide 6

Slide 6 text

󰗞 Say hello on Twitter: @felixge “There is a lot of data, yet very little information on what any of it actually means” FlameGraph View go tool pprof CPU Profile Allocation Profile Goroutine Profile Mutex Profile Block Profile gopclntab DWARF Frame Pointers Stop-The-World Pauses CPU Overhead Sampling Bias pprof Format SIGPROF Memory Overhead TSC Spin Locks profiler labels waitsince waitreason gopc Skid Stack Layout cgo Graph View Created by Uwe Kils (iceberg) and User:Wiska Bodo (sky) https://de.wikipedia.org/wiki/Datei:Iceberg.jpg Documentation

Slide 7

Slide 7 text

󰗞 Say hello on Twitter: @felixge Go Profiling from Bottom Up

Slide 8

Slide 8 text

󰗞 Say hello on Twitter: @felixge Assembly Program Counters Instructions

Slide 9

Slide 9 text

󰗞 Say hello on Twitter: @felixge Assembly - Computer like Assembly. It’s easy for them to execute. - Humans don’t like Assembly. It’s difficult for them to read and write.

Slide 10

Slide 10 text

󰗞 Say hello on Twitter: @felixge - Humans like source code. It’s relatively easy for them to read and write. - Computers don’t like source code. They think humans are weird. Source code

Slide 11

Slide 11 text

󰗞 Say hello on Twitter: @felixge That’s why we need Go

Slide 12

Slide 12 text

󰗞 Say hello on Twitter: @felixge Go translates between source code and assembly go build main.go

Slide 13

Slide 13 text

󰗞 Say hello on Twitter: @felixge Humans are not very good writing source code.

Slide 14

Slide 14 text

󰗞 Say hello on Twitter: @felixge Stack Traces

Slide 15

Slide 15 text

󰗞 Say hello on Twitter: @felixge Go translates between assembly and source code panic()

Slide 16

Slide 16 text

󰗞 Say hello on Twitter: @felixge Assembly to source code translation information go build main.go

Slide 17

Slide 17 text

󰗞 Say hello on Twitter: @felixge gopclntab: function lookup table

Slide 18

Slide 18 text

󰗞 Say hello on Twitter: @felixge gopclntab: per-function information

Slide 19

Slide 19 text

󰗞 Say hello on Twitter: @felixge gopclntab: main idea func entry pc main.foo 0x110d6c8 main.bar 0x110d63e main.main 0x110cd08 ... ...

Slide 20

Slide 20 text

󰗞 Say hello on Twitter: @felixge Running a Go Binary

Slide 21

Slide 21 text

󰗞 Say hello on Twitter: @felixge

Slide 22

Slide 22 text

󰗞 Say hello on Twitter: @felixge Stack Traces Unwinding & Symbolization

Slide 23

Slide 23 text

󰗞 Say hello on Twitter: @felixge Unwinding - gopclntab: Used by Go runtime (including all profilers) - Frame Pointers: Used by Linux perf - DWARF: Used by Debuggers, Linux Perf

Slide 24

Slide 24 text

󰗞 Say hello on Twitter: @felixge - Supports unwinding & symbolization - O(N) with regards to number of stack frames - gentraceback() has all the magic - Complex and “slow”: ~1-2 µsec for stack depth 16 - Very little documentation - Works without frame pointers (has “virtual frame pointer tables”) - Can be a significant contributor to Go binary size - Public APIs - unwind: runtime.Callers() - symbolization: runtime.CallersFrames() - both: runtime.Stack() - debug/gosym package gocplntab

Slide 25

Slide 25 text

󰗞 Say hello on Twitter: @felixge - O(N) with regards to number of stack frames - Supports only unwinding, no symbolization - Enabled for all 64bit Go binaries - Allows fast & simple unwinding (~50x faster than gopclntab) - Doesn’t handle inlined functions. - Add 2% overhead to regular program execution. - No -fomit-frame-pointer in Go (IMO that’s a good thing) - Pushed by callee -> Race condition for interrupts (CPU profiling) - runtime: use frame pointers for callers #16638 Frame Pointers

Slide 26

Slide 26 text

󰗞 Say hello on Twitter: @felixge - Supports unwinding & symbolization - Very very complex, but standardized and documented - Rejected for unwinding stacks in the Linux Kernel (only ORCs are allowed there) - Go emits DWARF v4 symbols by default - Rob Pike: proposal: by default, do not write out DWARF #26074. - Would enable faster builds and smaller binaries - A bit of devil’s bargain (breaks linux perf and debuggers) - Debate on whether distinction between DWARF and the devil makes sense is still ongoing ... 🍿 DWARF

Slide 27

Slide 27 text

󰗞 Say hello on Twitter: @felixge Stack Traces 📚 More Information: https://github.com/DataDog/go-profiler-notes/blob/main/stack-traces.md 󰚦 Michael Pratt from the Go team for helping me with gopclntab questions and review.

Slide 28

Slide 28 text

󰗞 Say hello on Twitter: @felixge How can we store stack traces? pprof format

Slide 29

Slide 29 text

󰗞 Say hello on Twitter: @felixge - protocol buffer format, stored using gzip compression. - Contains program counters and symbols (func, file names, etc.). - Go profilers generally output data using the pprof format. pprof format

Slide 30

Slide 30 text

󰗞 Say hello on Twitter: @felixge pprof format - Just a fancy way encoding a frequency table of stack traces: - 󰗲 to FlameGraph OG Brendan Gregg for keeping it real with his folded text format: stack trace count main;foo 5 main;foo;bar 3 main;foobar 4

Slide 31

Slide 31 text

󰗞 Say hello on Twitter: @felixge - For a simple view, use pprofutils: pprof format https://github.com/felixge/pprofutils

Slide 32

Slide 32 text

󰗞 Say hello on Twitter: @felixge - For a more details use go tool pprof -raw pprof format

Slide 33

Slide 33 text

󰗞 Say hello on Twitter: @felixge - For all the gory details use protoc pprof format

Slide 34

Slide 34 text

󰗞 Say hello on Twitter: @felixge pprof format 📚 More Information: https://github.com/DataDog/go-profiler-notes/blob/main/pprof.md

Slide 35

Slide 35 text

󰗞 Say hello on Twitter: @felixge Humans write slow code

Slide 36

Slide 36 text

󰗞 Say hello on Twitter: @felixge Tracing & Profiling

Slide 37

Slide 37 text

󰗞 Say hello on Twitter: @felixge Tracing: Capture all Events Hey, can you write down everything you’re doing? Ugh, that will slow me down a lot, but sure! timestamp event 1 Start X 2 Stop X 2 Start Y 5 Stop Y 5 Start X 6 Stop X ... ... Event Log Blue Gopher Pink Gopher Example: Go Tracer

Slide 38

Slide 38 text

󰗞 Say hello on Twitter: @felixge Go Tracer - Captures all kinds of interesting events in the Go runtime. Various ways for enabling it, e.g. go test -trace=trace.out - Only available in firehose mode, 40-50% overhead. - A significant amount of this is capturing stack traces. - go tool trace can be used to analyze data.

Slide 39

Slide 39 text

󰗞 Say hello on Twitter: @felixge This is slowing me down a little, but low frequency makes it okay. Profiling: Observe by Sampling Hey, what are you doing? Working on X Hey, what are you doing? Working on Y Hey, what are you doing? Working on X event count X 2 Y 1 Frequency Table Blue Gopher Example Go Profiles: CPU, Goroutine Pink Gopher

Slide 40

Slide 40 text

󰗞 Say hello on Twitter: @felixge CPU Profiler - Use setitimer(2) system call to interrupt and stop Go program 100 times per second. - Interrupt happens via SIGPROF signals delivered to “random” threads. Interrupted thread takes stack trace. - Minor Issues: Signal Deliver Bias, Skid, Overhead (a few %), etc. - pprof profile looks like this: stack trace samples/count cpu/nanoseconds main;foo 5 50000000 main;foo;bar 3 30000000 main;foobar 4 40000000

Slide 41

Slide 41 text

󰗞 Say hello on Twitter: @felixge - SIGPROF signals can make certain workloads go 2x faster 🤪. Could be frequency scaling or scheduler issue. CPU Profiler

Slide 42

Slide 42 text

󰗞 Say hello on Twitter: @felixge CPU Profiler 🚧 More research coming soon: https://github.com/DataDog/go-profiler-notes

Slide 43

Slide 43 text

󰗞 Say hello on Twitter: @felixge Goroutine Profiler - Returns a list of all goroutines and their current stack trace - Stops the world (!): ~2-10 µsec / goroutine pause. Possibly problematic when you have > 10-100k goroutines. - fgprof uses it for wall clock profiling, but has scalability issues. - pprof profile looks like this: stack trace goroutine/count main;foo 5 main;foo;bar 3 main;foobar 4

Slide 44

Slide 44 text

󰗞 Say hello on Twitter: @felixge Goroutine Profiler - Different APIs offer different additional details 🙁

Slide 45

Slide 45 text

󰗞 Say hello on Twitter: @felixge Goroutine Profile 📚 More Information: https://github.com/DataDog/go-profiler-notes/blob/main/goroutine.md

Slide 46

Slide 46 text

󰗞 Say hello on Twitter: @felixge Tracing + Profiling: Summarizing a Subset of Events Hey, can you capture every 10th thing you do and summarize it? Sure, I can do that pretty fast. Pink Gopher event count X 2 Y 1 Frequency Table Examples Go Profiles: Heap, Mutex, Block, ThreadCreate Blue Gopher

Slide 47

Slide 47 text

󰗞 Say hello on Twitter: @felixge Heap Profiler Pseudo Code

Slide 48

Slide 48 text

󰗞 Say hello on Twitter: @felixge Heap Profiler - MemProfileRate can be used to control the sampling, default aims at 1 sample per 512KiB of allocs (uses poisson sampling). - inuse = allocs - frees stack trace alloc_objects/count alloc_space/bytes inuse_objects/count inuse_space/bytes main;foo 5 120 2 48 main;foo;bar 3 768 0 0 main;foobar 4 512 1 128

Slide 49

Slide 49 text

󰗞 Say hello on Twitter: @felixge Heap Profiler 🚧 More research coming soon: https://github.com/DataDog/go-profiler-notes

Slide 50

Slide 50 text

󰗞 Say hello on Twitter: @felixge Block Profiler Pseudo Code

Slide 51

Slide 51 text

󰗞 Say hello on Twitter: @felixge Block Profiler - Integrated into channel and mutex codes paths that might “park” a goroutine that has to wait for something, e.g. chansend() while nobody is reading. - Sampling and recording happens after the blocking event is over and the goroutine is scheduled again. - Does NOT track blocked syscalls, time.Sleep(), cgo, spinlocks, etc. stack trace contentions/count delay/nanoseconds main;foo;runtime.chansend1 22820 867549417 main;foo;bar;runtime.chanrecv1 22748 453510869 main;foobar;sync.(*Mutex).Lock 795 5351086

Slide 52

Slide 52 text

󰗞 Say hello on Twitter: @felixge Block Profiler - runtime.SetBlockProfileRate() controls sampling rate. - My research indicates 10000 is a safe default for production.

Slide 53

Slide 53 text

󰗞 Say hello on Twitter: @felixge Block Profiler - Sampling bias favors infrequent long over frequent short events - Submitted patch: runtime/pprof: fix block profile bias

Slide 54

Slide 54 text

󰗞 Say hello on Twitter: @felixge Block Profiler 📚 More Information: https://github.com/DataDog/go-profiler-notes/blob/main/block.md

Slide 55

Slide 55 text

󰗞 Say hello on Twitter: @felixge - Similar to to block profile, but focused on mutexes. - Built into mutex Unlock() code paths (unlike block profile). - Sampling and recording happens when Unlock() unblocks another goroutine that was blocked. - Use runtime.SetMutexProfileFraction() to tune sampling rate. - No sampling bias. Mutex Profiler stack trace contentions/count delay/nanoseconds main;foo;sync.(*Mutex).Unlock 22820 867549417 main;foo;bar;sync.(*Mutex).Unlock 22748 453510869 main;foobar;sync.(*Mutex).Unlock 795 5351086

Slide 56

Slide 56 text

󰗞 Say hello on Twitter: @felixge Mutex Profiler 🚧 More research coming soon: https://github.com/DataDog/go-profiler-notes

Slide 57

Slide 57 text

󰗞 Say hello on Twitter: @felixge - could not get it to work, see: runtime: threadcreate profile is broken #6104 - in theory should capture stack traces that cause new OS threads to be created ThreadCreate Profiler stack trace threadcreate/count main;foo 2 main;foo;bar 1 main;foobar 3

Slide 58

Slide 58 text

󰗞 Say hello on Twitter: @felixge Recap - For profiling, stack traces are load bearing subsystem. - Go uses various techniques: Tracing, Profiling, Hybrid - Understanding how the data is collected is important to do safe profiling in production and interpret the resulting data correctly.

Slide 59

Slide 59 text

󰗞 Say hello on Twitter: @felixge Thank You! Research to be continued ...

Slide 60

Slide 60 text

󰗞 Say hello on Twitter: @felixge Image Credits - Original Gopher by Renee French - BBQ Gopher by @RuixRicardo - Black & White Gophers by Olga Shalakhina - Bottom up Gopher & King Kong Gopher by Egon Elbre - GopherNotes Gopher by Marcus Olsson - Lots of Gophers by Ashley McNamara Questions? Watch https://github.com/DataDog/go-profiler-notes for more research