Allocator Wrestling - Speaker Deck

Slide 1

Slide 1 text

Allocator Wrestling Eben Freeman @_emfree_

Slide 2

Slide 2 text

currently: building things at honeycomb.io Hi everyone! I'm Eben these slides: speakerdeck.com/emfree/allocator-wrestling

Slide 3

Slide 3 text

Why this talk - Go is a managed-memory language - The runtime does a lot of sophisticated work on behalf of you, the programmer

Slide 4

Slide 4 text

Why this talk - Go is a managed-memory language - The runtime does a lot of sophisticated work on behalf of you, the programmer - And yet, dynamic memory allocation is not free - A program's allocation patterns can substantially affect its performance!

Slide 5

Slide 5 text

A motivating example Storage / query service at the day job (Honeycomb): - goal: <5 second query latency - up to billions of rows per query - flexible queries, on-the-fly schema changes

Slide 6

Slide 6 text

A motivating example A few rounds of just memory-efficiency optimization: 2-3x speedup

Slide 7

Slide 7 text

Why this talk - A program's allocation patterns can substantially affect its performance! - However, those patterns are often syntactically opaque. - Equipped with understanding and the right set of tools, we can spend our time more wisely.

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Outline I. To tame the beast, you must first understand its mind How do the allocator and garbage collector work? II. Bring your binoculars Tools for understanding allocation patterns III. Delicious treats for the ravenous allocator Strategies for improving memory efficiency

Slide 10

Slide 10 text

This is a practitioner's perspective. The Go runtime authors are much smarter than me, maybe even smarter than you! And they're always cooking up new stuff A Caveat The Go runtime team discusses the design of the garbage collector.

Slide 11

Slide 11 text

I. Allocator Internals

Slide 12

Slide 12 text

Memory Layout func f() *int { // ... x := 22 y := 44 // ... return &y } Depending on their lifetimes, objects are allocated either on stacks or on the heap.

Slide 13

Slide 13 text

Memory Layout func f() *int { // ... x := 22 y := 44 // ... return &y } Depending on their lifetimes, objects are allocated either on stacks or on the heap.

Slide 14

Slide 14 text

Memory Layout Design goals for an allocator: - Efficiently satisfy allocations of a given size, but avoid fragmentation - Avoid locking in the common case - Efficiently reclaim freeable memory

Slide 15

Slide 15 text

Memory Layout Design goals for an allocator: - Efficiently satisfy allocations of a given size, but avoid fragmentation allocate like-sized objects in blocks - Avoid locking in the common case - Efficiently reclaim freeable memory

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Memory Layout The heap is divided into two levels of structure: arenas and spans.

Slide 19

Slide 19 text

Memory Layout The heap is divided into two levels of structure: arenas and spans. Arenas are big chunks of aligned memory. On amd64, each arena is 64MB, so 4 million arenas cover the address space, and we keep track of them in a big global array (mheap.arenas)

Slide 20

Slide 20 text

So if a stranger on the street hands us a (valid) pointer, we can easily find its heap metadata! heapArena := mheap_.arenas[ptr / arenaSize] span := heapArena.spans[(ptr % arenaSize) / pageSize] stacks of pointers

Slide 21

Slide 21 text

What's a span? Managing the heap at arena granularity isn't practical, so heap objects live in spans. Small objects (<=32KB) live in spans of a fixed size class. type span struct { startAddr uintptr npages uintptr spanclass spanClass // allocated/free bitmap allocBits *gcBits // ... }

Slide 22

Slide 22 text

What's a span? Managing the heap at arena granularity isn't practical, so heap objects live in spans. Small objects (<=32KB) live in spans of a fixed size class. - There are ~70 size classes - Their spans are 8KB-64KB - So we can compactly allocate small objects with at most a few MB overhead

Slide 23

Slide 23 text

Memory Layout Each P has an mcache holding a span of each size class. Ideally, allocations can be satisfied directly out of the mcache.

Slide 24

Slide 24 text

Memory Layout To allocate, we find the first free object in our cached mspan, then return its address. type mspan struct { startAddr uintpr freeIndex uintptr // first possibly free slot allocCache uint64 // used/free bitmap cache: // ... }

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Memory Layout This means that "most" memory allocations are fast: 1. Find a cached span with the right size (mcache.mspan[sizeClass]) (if there's none, get a new span, cache it) 2. Find the next free object in the span 3. If necessary, update the heap bitmap (so the garbage collector knows which fields are pointers) and require no locking!

Slide 27

Slide 27 text

Memory Layout What have we got so far? ✓ Efficiently satisfy allocations of a given size, but avoid fragmentation allocate like-sized objects in blocks ✓ Avoid locking in the common case maintain local caches ? Efficiently reclaim free memory

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Garbage collection We have to find and reclaim objects once they're no longer referenced. checksum(filename) uint64 { data := read(filename) // compute checksum } func read(filename) []byte { // open and read file } checksum read

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Garbage collection We have to find and reclaim objects once they're no longer referenced.

Slide 32

Slide 32 text

Garbage collection We have to find and reclaim objects once they're no longer referenced. Go uses a tricolor concurrent mark-sweep garbage collector.

Slide 33

Slide 33 text

Garbage collection We have to find and reclaim objects once they're no longer referenced. Go uses a tricolor concurrent mark-sweep garbage collector. GC is divided into (roughly) two phases: MARK: find reachable (live) objects (this is where the action happens) SWEEP: free unreachable objects

Slide 34

Slide 34 text

Garbage collection In the mark phase, objects are white, grey, or black. Initially, all objects are white. We start by marking goroutine stacks and globals.

Slide 35

Slide 35 text

Garbage collection In the mark phase, objects are white, grey, or black. Initially, all objects are white. We start by marking goroutine stacks and globals. When we reach an object, we mark it grey.

Slide 36

Slide 36 text

Garbage collection In the mark phase, objects are white, grey, or black. Initially, all objects are white. We start by marking goroutine stacks and globals. When we reach an object, we mark it grey. When an object's referents are all marked, we mark it black.

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Garbage collection At the end, objects are either white or black. White objects can then be swept and freed.

Slide 39

Slide 39 text

Garbage collection Questions: - How do we know what an object's referents are? - How do we actually mark an object?

Slide 40

Slide 40 text

Garbage collection Questions: - How do we know what an object's referents are? - How do we actually mark an object? Use bitmaps for metadata!

Slide 41

Slide 41 text

Garbage collection Say we have something like: type Row struct { index int data []uint64 } How does the garbage collector know what other objects it points to? I.e., which of its fields are pointers?

Slide 42

Slide 42 text

Garbage collection Remember that this heap object is actually inside an arena! The arena's bitmap tells us which of its words are pointers.

Slide 43

Slide 43 text

Garbage collection Similarly, mark state is kept in a span's gcMark bits type span struct { startAddr uintptr // ... // allocated/free bitmap allocBits *gcBits // mark state bitmap gcMarkBits *gcBits // ... }

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Garbage collection Once we're done marking, unmarked bits correspond to free slots! type span struct { startAddr uintptr // ... // allocated/free bitmap allocBits *gcBits // mark state bitmap gcMarkBits *gcBits // ... }

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Garbage collection The garbage collector is concurrent . . . with a twist. If we're not careful, the application can do sneaky stuff to thwart the garbage collector. type S struct { p *int } func f(s *S) *int { r := s.p s.p = nil return r } s s.p

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Garbage collection The garbage collector is concurrent . . . with a twist. If we're not careful, the application can do sneaky stuff to thwart the garbage collector. Now we have a live pointer to memory that the garbage collector can free!

Slide 54

Slide 54 text

Garbage collection To avoid this peril, the compiler turns pointer writes into potential calls into the write barrier; very roughly: if writeBarrier.enabled { shade(*ptr) if current stack is grey { shade(val) } *ptr = val } *ptr = val

Slide 55

Slide 55 text

Slide 56

Slide 56 text

While the garbage collector is marking: - the write barrier is on - marking consumes resources - background marking - GC assist Garbage collection

Slide 57

Slide 57 text

Garbage collection During marking, 25% of GOMAXPROCS are dedicated to background marking. But a rapidly allocating goroutine can outrun it. Dedicated GC worker MARK ASSIST

Slide 58

Slide 58 text

Garbage collection So during marking, a goroutine gets charged for each allocation. If it's in debt, it has to do mark work before continuing. func mallocgc(size uintptr, ...) unsafe.Pointer { // ... assistG.gcAssistBytes -= int64(size) if assistG.gcAssistBytes < 0 { // This goroutine is in debt. Assist the GC to // this before allocating. This must happen // before disabling preemption. gcAssistAlloc(assistG) } // ...

Slide 59

Slide 59 text

Garbage collection In summary: - The runtime allocates data in spans to avoid fragmentation - Local caches speed up allocation, but the allocator still has to do some bookkeeping - GC is concurrent, but write barriers and mark assists can slow a program - GC work is proportional to scannable heap

Slide 60

Slide 60 text

II. Tools

Slide 61

Slide 61 text

Question It seems like dynamic memory allocation has some cost. Does this mean that reducing allocations will improve performance?

Slide 62

Slide 62 text

Question It seems like dynamic memory allocation has some cost. Does this mean that reducing allocations will improve performance? Well, it depends.

Slide 63

Slide 63 text

Question It seems like dynamic memory allocation has some cost. Does this mean that reducing allocations will improve performance? Well, it depends. The builtin memory profiler can tell us where we're allocating, but doesn't answer the causal question "will reducing allocations make a difference?"

Slide 64

Slide 64 text

Slide 65

Slide 65 text

Crude experimenting We think that the allocator and garbage collector have some overhead, but we're not sure how much. Well . . .

Slide 66

Slide 66 text

Crude experimenting We think that the allocator and garbage collector have some overhead, but we're not sure how much. Well . . .

Slide 67

Slide 67 text

Crude experimenting Turn off the garbage collector or the allocator with runtime flags: GOGC=off : Disables garbage collector GODEBUG=sbrk=1 : Replaces entire allocator with simple persistent allocator

Slide 68

Slide 68 text

Crude experimenting ~30% speedup on some benchmarks:

Slide 69

Slide 69 text

Crude experimenting This might seem kind of stupid, but it's a cheap way to establish expectations. If we see speedup, that's a hint that we can optimize!

Slide 70

Slide 70 text

This might seem kind of stupid, but it's a cheap way to establish expectations. If we see speedup, that's a hint that we can optimize! Problems: - not viable in production: need synthetic benchmarks - persistent allocator isn't free either, so this doesn't fully reflect allocation cost Crude experimenting

Slide 71

Slide 71 text

Profiling A pprof CPU profile can often show time spent in runtime.mallocgc Tips: - Use the flamegraph viewer in the pprof web UI - If pprof isn't enabled in your binary, you can use Linux perf too

Slide 72

Slide 72 text

Profiling refilling span caches GC assist

Slide 73

Slide 73 text

Profiling Problems: - Program might not be CPU-bound - Allocation might not be on critical path

Slide 74

Slide 74 text

go tool trace The execution tracer might be the best tool at our disposal to understand the impact of allocating. The execution tracer captures very granular runtime events over a short time window: curl localhost:6060/debug/pprof/trace?seconds=5 > trace.out Which you can visualize in a web UI go tool trace trace.out

Slide 75

Slide 75 text

go tool trace However, it can be a bit dense.

Slide 76

Slide 76 text

go tool trace Remember, top-level GC doesn't mean the program is blocked, but what happens within GC is interesting!

Slide 77

Slide 77 text

go tool trace Remember, top-level GC doesn't mean the program is blocked, but what happens within GC is interesting! Dedicated GC worker MARK ASSIST

Slide 78

Slide 78 text

go tool trace If you're motivated, a CL exists that parses traces to generate minimum mutator utilization curves https://golang.org/cl/60790

Slide 79

Slide 79 text

go tool trace Minimum mutator utilization: over a sliding time window (1ms, 10ms, etc.), what was the minimum amount of resources available to mutators (goroutines doing work)?

Slide 80

Slide 80 text

go tool trace This is terrific (if you ask me) -- you can see if a production service is GC-bound utilization is < 75% :( utilization is ~ 100% :)

Slide 81

Slide 81 text

In Summary Together, benchmarks with the allocator off, CPU profiles, and execution traces give us a sense of: - whether allocation / GC are affecting performance - which call sites are spending a lot of time allocating - how throughput changes during GC.

Slide 82

Slide 82 text

III. What can we change?

Slide 83

Slide 83 text

If we've concluded that allocations are a source of inefficiency, what can we do? - Limit pointers - Allocate in batches - Try to recycle objects

Slide 84

Slide 84 text

What about tuning GOGC? - Absolutely helps with throughput! However . . . - If we want to optimize for throughput, GOGC doesn't express the real goal: "use all available memory, but no more" - Live heap size is generally (but not always) small - High GOGC makes avoiding OOMS harder But first

Slide 85

Slide 85 text

Limit pointers Sometimes, spurious heap allocations are easily avoided! func (c *ColumnManager) ReadRows() { // ... for !abort { tsRecord := tsReader.read() ts := time.Unix(0, tsRecord.Timestamp).UTC() if compareTimestamps(&ts, query.Start, query.End) { // ... }

Slide 86

Slide 86 text

Slide 87

Slide 87 text

Slide 88

Slide 88 text

Slide 89

Slide 89 text

Limit pointers The Go compiler can be enticed to tell you why a variable is heap-allocated: go build -gcflags="-m -m" but its output is a bit unwieldy. https://github.com/loov/view-annotated-file helps digest it:

Slide 90

Slide 90 text

Limit pointers for !abort { tsRecord := tsReader.read() ts := time.Unix(0, tsRecord.Timestamp).UTC() // . . . } var ts time.Time var tsNanos uint64

Slide 91

Slide 91 text

Limit pointers Not allocating structs with inner pointers helps the garbage collector too! func BenchmarkTimeAlloc(b *testing.B) { var x []time.Time for n := 0; n < b.N; n++ { x = make([]time.Time, 1024) } test.Check(b, len(x) == 1024) } func BenchmarkIntAlloc(b *testing.B) { var x []int64 for n := 0; n < b.N; n++ { x = make([]int64, 1024) } test.Check(b, len(x) == 1024) } BenchmarkTimeAlloc-4 8880 ns/op BenchmarkIntAlloc-4 1540 ns/op

Slide 92

Slide 92 text

Limit pointers Why this discrepancy? BenchmarkTimeAlloc-4 8880 ns/op BenchmarkIntAlloc-4 1540 ns/op

Slide 93

Slide 93 text

Limit pointers Why this discrepancy? BenchmarkTimeAlloc-4 8880 ns/op BenchmarkIntAlloc-4 1540 ns/op type Time struct { wall uint64 ext in64 loc *Location } Sneaky time.Time conceals nefarious pointer!

Slide 94

Slide 94 text

Slab allocation Is it better to do one horse-sized allocation or 100 duck-sized allocations?

Slide 95

Slide 95 text

Slab allocation Although smaller allocs make better use of the mcache, larger allocs are faster on a per-byte basis.

Slide 96

Slide 96 text

Slab allocation Even though the fast path in the allocator is very optimized, we still need to do some work on every allocation: - prevent ourselves from being preempted - check if we need to assist GC - compute the next free slot in the mcache - set heap bitmap bits - etc.

Slide 97

Slide 97 text

Slab allocation In some cases, we can amortize that overhead by doing fewer, bigger allocs: // Allocate individual []interface{}s out of a big buffer type SlicePool struct { bigBuf []interface{} } func (s *SlicePool) GetSlice(size int) []interface{} { if size >= len(s.bigBuf) { s.bigBuf = make([]interface{}, blockSize) } res := s.bigBuf[:size] s.bigBuf = s.bigBuf[size:] return res }

Slide 98

Slide 98 text

Slab allocation In some cases, we can amortize that overhead by doing fewer, bigger allocs. The danger is that any live reference will keep the whole slab alive! buf := slicePool.GetSlice(24) w.Write(buf) if filters.Match(buf) { results <- buf } Generous subslice grants immortality to surrounding memory!

Slide 99

Slide 99 text

Slab allocation In some cases, we can amortize that overhead by doing fewer, bigger allocs. The danger is that any live reference will keep the whole slab alive! Also, these aren't safe for concurrent use: best for a few heavily-allocating goroutines buf := slicePool.GetSlice(24) w.Write(buf) if filters.Match(buf) { results <- buf Generous subslice grants immortality to surrounding memory!

Slide 100

Slide 100 text

Recycle allocations Optimization strategies: ✓ Avoid limit pointers, don't do dumb stuff ✓ Amortize do fewer, larger allocations ? Reuse recycle allocated memory

Slide 101

Slide 101 text

Storage engine architecture: two phases, mediated by channels Recycle allocations user query: COUNT, P95(duration) GROUP BY user_id WHERE duration > 0 ORDER BY COUNT DESC LIMIT 10

Slide 102

Slide 102 text

Storage engine architecture: two phases, mediated by channels Recycle allocations - simple, easy to reason about - maximizes available parallelism - generates tons of garbage - data passed over channels - format not known in advance

Slide 103

Slide 103 text

Optimization: explicitly recycle allocated blocks of memory Recycle allocations

Slide 104

Slide 104 text

Optimization: explicitly recycle allocated blocks of memory Recycle allocations var buf RowBuffer select { case buf = <-recycled: default: buf = NewRowBuffer() }

Slide 105

Slide 105 text

A more sophisticated version: sync.Pool - maintains slices of recycled objects, sharded by CPU (with runtime support) - allows lock-free get/put in the fast path - caveat: cleared on every GC Danger: - must be very careful to zero or overwrite recycled memory Recycle allocations

Slide 106

Slide 106 text

In Review Your priorities and your mileage may vary!

Slide 107

Slide 107 text

In Review - The allocator and garbage collector are pretty ingenious! - Single allocations are fast but not free - The garbage collector can stop individual goroutines, even without STW - GC work depends on pointer density - Bring your toolbox: - benchmark with GC off - use CPU profiler to find hot allocations - use execution tracer to understand GC pattern - use escape analyzer to understand why allocations happen

Slide 108

Slide 108 text

Thank you! Special credit to Ian Wilkes and many others at Honeycomb: the true masterminds Suggested further reading: Allocation efficiency in high-performance Go services Achille Roussel / Rick Branson segment.com/blog/allocation-efficiency-in-high- performance-go-services/ Go 1.5 concurrent garbage collector pacing Austin Clements golang.org/s/go15gcpacing So You Wanna Go Fast Tyler Treat bravenewgeek.com/so-you-wanna-go-fast/ @_emfree_

Slide 109

Slide 109 text

- Optimizing for latency makes a lot of sense! Reliably low-latency garbage collection is essential for many use cases: chat, streaming, lockservers, low-latency high-fanout services, etc. - For throughput-oriented use cases, Go's pauseless garbage collector may not be theoretically optimal. (But it might not be an existential burden either.) A Caveat