Slide 1

Slide 1 text

profile-guided code analysis 2022 / @quasilyte 1

Slide 2

Slide 2 text

Agenda ● Some facts about Go CPU profiles and profiling ● Go profiles parsing ● Custom profiles data aggregation ● Heatmaps intro and why we need them ● Structural code search with heatmap filters ● Profile-guided performance static analysis (PGO) ● Some pprof insights 2

Slide 3

Slide 3 text

Why does this talk exists? 3 A complex, big system We want to make it faster. But sometimes there are no obvious “bottlenecks”. A system as a whole is slow. There are hundreds of small performance issues.

Slide 4

Slide 4 text

Let’s state our goal clearly 4 We’re interested in making the entire system faster We don’t want to change the code in “cold” paths I don’t care about FooBar benchmark running 100 times faster Our motto is: less code changes => more performance impact

Slide 5

Slide 5 text

We should not optimize blindly 5 You need to know which parts of your program are executed. You also need some extra info, like timings. The CPU profiles provide us this and more.

Slide 6

Slide 6 text

CPU profiling in Go 6

Slide 7

Slide 7 text

CPU profiling facts ● Interruption-based (SIGPROF on unix) ● Sample-based (runtime/pprof records 100 samples/sec) ● Writes the output in pprof format (profile.proto) 7

Slide 8

Slide 8 text

What makes a good CPU profile ● Collected for a long time (longer than a few seconds!) ● Collected under an interesting and realistic load ● Aligned with your task* It’s also nice to have several CPU profiles, collected in different configurations. (*) If you’re optimizing a single function, CPU profiles from benchmarks are OK. Otherwise they’re not a good fit. 8

Slide 9

Slide 9 text

Why CPU profiles from benchmarks are bad? They can make irrelevant code look “hot”. They do not show the entire system execution patterns. Merging CPU profiles from all benchmarks doesn’t help. They’re not aligned with our goals. 9

Slide 10

Slide 10 text

profile.proto structure 10

Slide 11

Slide 11 text

Parsing profile.proto files The pprof/profile Go library allows you to parse CPU profiles produced by Go. github.com/google/pprof/profile 11

Slide 12

Slide 12 text

Samples layout 12 Sample Location Location Location Line Line Line Line Line Line Profile Sample

Slide 13

Slide 13 text

Visiting profile samples: a straightforward approach 13 for _, sample := range p.Samples { for _, loc := range sample.Location { for _, l := range loc.Line { sampleValue := sample.Value[1] // time/ns funcName := l.Function.Name filename := l.Function.Filename line := l.Line println(filename, line, funcName, sampleValue) } } }

Slide 14

Slide 14 text

Visiting profile samples: a smarter approach 14 var stack []profile.Line for _, sample := range p.Samples { stack = stack[:0] // reuse memory for every stack for _, loc := range sample.Location { stack = append(stack, loc.Line...) } for i, l := range stack { // handle l } }

Slide 15

Slide 15 text

Visiting profile samples: a smarter approach 15 var stack []profile.Line for _, sample := range p.Samples { stack = stack[:0] for _, loc := range sample.Location { stack = append(stack, loc.Line...) } } stack[0] “self” sample, current function

Slide 16

Slide 16 text

Visiting profile samples: a smarter approach 16 var stack []profile.Line for _, sample := range p.Samples { stack = stack[:0] for _, loc := range sample.Location { stack = append(stack, loc.Line...) } } stack[1:] Callers stack

Slide 17

Slide 17 text

Function.Name parsing 17 * somefunc * runtime.mallocgc * github.com/foo/pkg.Bar * github.com/foo/pkg.Bar.func1 * github.com/foo/pkg.(*Bar).Method

Slide 18

Slide 18 text

Function.Name parsing 18 * somefunc * runtime.mallocgc * github.com/foo/pkg.Bar * github.com/foo/pkg.Bar.func1 * github.com/foo/pkg.(*Bar).Method Some symbols are ambiguous! * Could be a Bar method named “func1” * Could be a lambda “func1” inside “Bar” function Use https://github.com/quasilyte/pprofutil

Slide 19

Slide 19 text

Aggregation: calculating “flat” Map file:line:func keys to sampleValue, but use only stack[0] records. 19

Slide 20

Slide 20 text

Aggregation: calculating “cumulative” (e.g. “cum”) Map file:line:func keys to sampleValue, using all records from stack (all lines). 20

Slide 21

Slide 21 text

Runtime-cumulative aggregation 21

Slide 22

Slide 22 text

Does this look familiar to you? (pprof) top 5 9.80% runtime.findObject 8.40% runtime.scanobject 3.47% runtime.mallocgc 3.42% runtime.heapBitsSetType 2.87% runtime.markBits.isMarked 22

Slide 23

Slide 23 text

So, the Go runtime is slow? All top nodes are showing that most of the time is spent in the runtime. Does this mean that our app itself is fast, but Go runtime is a bottleneck? 23

Slide 24

Slide 24 text

copyBytes definition 24 func copyBytes(b []byte) []byte { dst := make([]byte, len(b)) copy(dst, b) return dst }

Slide 25

Slide 25 text

Benchmarking copyBytes 25 func BenchmarkCopyBytes(b *testing.B) { dst := make([]byte, 2022) for i := 0; i < b.N; i++ { copyBytes(dst) } } go test -bench=. -cpuprofile=cpu.out

Slide 26

Slide 26 text

Where is copyBytes? (pprof) top 5 37.56% runtime.mallocgc 12.16% runtime.memclrNoHeapPointers 9.35% runtime.memmove 8.47% runtime.scanobject 6.42% runtime.scanblock 26

Slide 27

Slide 27 text

User-code example 27 func copyBytes(b []byte) []byte { dst := make([]byte, len(b)) copy(dst, b) return dst } ● mallocgc (allocating a slice) ● memclrNoHeapPointers (memory zeroing) ● memmove (copying memory)

Slide 28

Slide 28 text

Aggregation: runtime-cumulative aggregation Map file:line:func keys to sampleValue, like in a normal flat or cumulative schemes, but for every runtime stack[0] add this value to a first non-runtime caller. 28

Slide 29

Slide 29 text

Runtime-cumulative aggregation results ./runtime-cumulative cpu.out 6.25s example.copyBytes 40ms example.BenchmarkCopyBytes 20ms testing.(*B).runN 29

Slide 30

Slide 30 text

So, the Go runtime is slow? All top nodes are showing that most of the time is spent in the runtime. Does this mean that our app itself is fast, but Go runtime is a bottleneck? No, we just need to aggregate the CPU data correctly. 30

Slide 31

Slide 31 text

Heatmaps (and text editors) 31

Slide 32

Slide 32 text

Idea: display how “hot” the source code line is It would be great to see the performance-sensitive parts of our code right in our text editor. We’ll use heat levels to categories the hotness. 32

Slide 33

Slide 33 text

Heatmaps: building a line-oriented index from a CPU profile We can build a simple index that aggregates all samples from the CPU profile and splits them into categories (heat levels). Then we can tell what is the heat level of the given source code line. 33

Slide 34

Slide 34 text

Building a heatmap 34 file.go:100 0.1s file.go:100 0.3s file.go:120 0.1s file.go:120 0.1s file.go:100 0.4s file.go:130 0.2s file.go:140 0.1s file.go:145 0.1s file.go:150 0.3s file.go:165 0.2s file.go:170 0.2s Take all samples for a file

Slide 35

Slide 35 text

Building a heatmap 35 file.go:100 0.8s file.go:120 0.2s file.go:130 0.2s file.go:140 0.1s file.go:145 0.1s file.go:150 0.3s file.go:165 0.2s file.go:170 0.2s Combine sample values for the same lines

Slide 36

Slide 36 text

Building a heatmap 36 file.go:100 0.8s file.go:150 0.3s file.go:120 0.2s file.go:130 0.2s file.go:165 0.2s file.go:170 0.2s file.go:140 0.1s file.go:145 0.1s Sort samples by their value

Slide 37

Slide 37 text

Building a heatmap 37 file.go:100 0.8s | L=5 file.go:150 0.3s | L=5 file.go:120 0.2s | L=4 file.go:130 0.2s | L=3 file.go:165 0.2s | L=3 file.go:170 0.2s | L=2 file.go:140 0.1s | L=1 file.go:145 0.1s | L=1 Divide them into categories (heat levels)

Slide 38

Slide 38 text

Building a heatmap 38 file.go:100 0.8s | L=5 file.go:150 0.3s | L=4 file.go:120 0.2s | L=3 file.go:130 0.2s | L=2 file.go:165 0.2s | L=0 file.go:170 0.2s | L=0 file.go:140 0.1s | L=0 file.go:145 0.1s | L=0 A threshold can control the % of samples we’re using (top%) Let’s use threshold=0.5

Slide 39

Slide 39 text

https://github.com/quasilyte/vscode-perf-heatmap 39

Slide 40

Slide 40 text

perf-heatmap library I created a library that can be used to build a heatmap index from profile.proto CPU profiles. It’s used in all profile-guided tools presented today. https://github.com/quasilyte/perf-heatmap 40

Slide 41

Slide 41 text

perf-heatmap index properties ● Fast line (or line range) querying ● Relatively compact, low memory usage ● Has both flat and cumulative values ● Only one simple config option: threshold value ● Reliable symbol mapping* (*) It can match the location even if its absolute path differs in profile and local machine. 41

Slide 42

Slide 42 text

gogrep + heatmap 42

Slide 43

Slide 43 text

Structural code search ● Intellij IDE – structural code search and replace (SSR) ● go-ruleguard static analyzer (gogrep lib) ● go-critic static analyzer (gogrep lib) ● gocorpus queries (gogrep lib) ● gogrep code search tool (gogrep lib) Try gogrep – it’s simple and very useful. 43

Slide 44

Slide 44 text

Let’s try looking for some patterns! 44 reflect.TypeOf($x).Size() This pattern finds all reflect.TypeOf() calls that are followed by a Size() call on its result. $x is a wildcard, it’ll match any expression.

Slide 45

Slide 45 text

Let’s try looking for some patterns! 45 $ gogrep . 'reflect.TypeOf($x).Size()' src/foo.go:20: strSize := int(reflect.TypeOf("").Size() src/lib/bar.go:43: return reflect.TypeOf(pair).Size(), nil … + 15 matches Should we rewrite all these 17 cases?

Slide 46

Slide 46 text

gogrep + heatmap filter 46 $ gogrep . --heatmap cpu.out 'reflect.TypeOf($x).Size()' '$$.IsHot()' --heatmap cpu.out A CPU-profile that will be used to build a heatmap

Slide 47

Slide 47 text

gogrep + heatmap filter 47 $ gogrep . --heatmap cpu.out 'reflect.TypeOf($x).Size()' '$$.IsHot()' $$.IsHot() A filter expression. $$ references the entire match, like $0. IsHot() applies a heatmap filter.

Slide 48

Slide 48 text

48 Only one match now! reflect.TypeOf(value.Elem().Interface()).Size()

Slide 49

Slide 49 text

49 Only one match now! reflect.TypeOf(value.Elem().Interface()).Size() => value.Elem().Type().Size() 5 times faster (-80%), 0 allocations

Slide 50

Slide 50 text

50 Hard to find using CPU profiles! reflect.TypeOf(value.Elem().Interface()).Size() Calls involved: reflect.Value.Elem() reflect.Value.Interface() reflect.TypeOf() reflect.Type.Size() + all functions called from them

Slide 51

Slide 51 text

Useful resources: structural code search ● gogrep intro: RU, EN (old CLI interface) ● Profile-guided code search articles: RU, EN ● quasilyte.dev/gocorpus Go corpus with gogrep queries https://github.com/quasilyte/gogrep 51

Slide 52

Slide 52 text

ruleguard + heatmap = ? 52

Slide 53

Slide 53 text

What is go-ruleguard? 53 gogrep patterns advanced filters + = go-ruleguard

Slide 54

Slide 54 text

What is go-perfguard? 54 go-ruleguard heatmap + = go-perfguard

Slide 55

Slide 55 text

perfguard: profile-guided Go optimizer ● Works on the source code level ● Has two main modes: lint and optimize ● Finds performance issues in Go code ● Most issues reported have autofixes 55

Slide 56

Slide 56 text

perfguard: lint mode ● Doesn’t require a CPU profile ● Can be used on CI to avoid slow code being merged ● Less precise and powerful than optimize mode 56

Slide 57

Slide 57 text

perfguard: optimize mode ● Requires a CPU profile ● Finds only real issues ● Contains checks that are impossible in lint mode It’s better than an ordinary static analyzer because it uses a CPU profile to get the information about the actual program execution patterns. 57

Slide 58

Slide 58 text

Running perfguard on our code 58 $ perfguard optimize --heatmap cpu.out ./... --heatmap cpu.out A CPU profile we collected for this application

Slide 59

Slide 59 text

Running perfguard on our code 59 $ perfguard optimize --heatmap cpu.out ./... ./... Analyzing all packages, recursively

Slide 60

Slide 60 text

Analyzing the results If you get some output, be sure to try out the --fix option to autofix the issues found. It’s possible to go deeper and analyze the dependencies. 60

Slide 61

Slide 61 text

Usually, your code has some dependencies… 61 A complex, big system xrouter fasthttp jwt protobuf zap

Slide 62

Slide 62 text

And some of your bottlenecks may live inside them… 62 A complex, big system xrouter fasthttp jwt protobuf zap 14% 12% 3% 21% 8%

Slide 63

Slide 63 text

Some of them are 3rd party -> hard to fix 63 A complex, big system fasthttp jwt protobuf zap 12% 3% 21% 8%

Slide 64

Slide 64 text

But some of them can be under your control 64 A complex, big system xrouter 14%

Slide 65

Slide 65 text

Let’s grab the dependencies for the analysis Execute this: $ go mod vendor Now we should have a “vendor” folder that contains the sources of all our dependencies. You can undo this after we finish with analysis. 65

Slide 66

Slide 66 text

Running perfguard on dependencies 66 $ perfguard optimize --heatmap cpu.out ./vendor/... --heatmap cpu.out The same CPU profile we collected and used on our own source code

Slide 67

Slide 67 text

Running perfguard on dependencies 67 $ perfguard optimize --heatmap cpu.out ./vendor/... ./vendor/... Running the analysis on our dependencies

Slide 68

Slide 68 text

Analyzing the results vendor/jwt/jwt.go:20: []byte(s)... => s... vendor/b/b.go:15: bytes.Buffer => strings.Builder vendor/xrouter/xrouter.go: compiling regexp on hot path vendor/c/c.go:50: allocating const error, use global var 68

Slide 69

Slide 69 text

Analyzing the results vendor/jwt/jwt.go:20: []byte(s)... => s... vendor/b/b.go:15: bytes.Buffer => strings.Builder vendor/xrouter/xrouter.go: compiling regexp on hot path vendor/c/c.go:50: allocating const error, use global var 69

Slide 70

Slide 70 text

Now let’s get the xrouter sources $ git clone github.com/quasilyte/xrouter $ cd xrouter 70

Slide 71

Slide 71 text

Running perfguard on xrouter 71 $ perfguard optimize --heatmap cpu.out ./... --heatmap cpu.out The very same CPU profile again

Slide 72

Slide 72 text

Running perfguard on xrouter 72 $ perfguard optimize --heatmap cpu.out ./... ./... Note: running on the xrouter “own” sources

Slide 73

Slide 73 text

xrouter results vendor/xrouter/xrouter.go: compiling regexp on hot path Only relevant results. Can also use --fix option to apply autofixes. 73

Slide 74

Slide 74 text

Why bother with “go mod vendor”? Maybe it’s possible for perfguard to figure out where to find the relevant sources in the Go mod pkg dir, but vendoring the sources is easier for now. This could change in the future, but for now you’ll need to make the dependencies code easily accessible. 74

Slide 75

Slide 75 text

What perfguard can fix? 75

Slide 76

Slide 76 text

Suggesting one types over the others bytes.Buffer => strings.Builder If local bytes.Buffer is used using the API also covered with strings.Builder and the final result is constructed with String() method, using strings.Builder will result in 1 less allocation. 76

Slide 77

Slide 77 text

Alternative API suggestions io.WriteString(w, fmt.Sprintf("%s:%d", f, l)) => fmt.Fprintf(w, "%s:%d", f, l) 77

Slide 78

Slide 78 text

Removing redundant data copies b = append(b, []byte(s)...) => b = append(b, s...) copy(b, []byte(s)) => copy(b, s) re.Match([]byte(s)) => re.MatchString(s) w.Write([]byte(s)) => w.WriteString(s) And many more transformations that remove excessive data conversions. 78

Slide 79

Slide 79 text

Condition reordering f() && isOK => isOk && f() Putting cheaper expressions first in the condition is almost always a win: you can avoid doing unnecessary calls due to a short-circuit nature of logical operators. 79

Slide 80

Slide 80 text

map[T]bool -> map[T]struct{} When local map[T]bool is used as a set, perfguard can automatically rewrite it to map[T]struct{}, updating all relevant code parts that use it. 80

Slide 81

Slide 81 text

Useful resources: ruleguard & perfguard ● ruleguard intro: RU, EN ● ruleguard by example tour ● ruleguard comparison with Semgrep and CodeQL https://github.com/quasilyte/go-perfguard https://github.com/quasilyte/go-ruleguard 81

Slide 82

Slide 82 text

Final thoughts ● Heatmaps do not replace pprof/flamegraphs/etc ● Filtering results depend on the CPU profile quality ● perfguard can’t magically solve all of your problems 82

Slide 83

Slide 83 text

profile-guided code analysis 2022 / @quasilyte 83