Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Allocator Wrestling

Eben Freeman
August 28, 2018
3.1k

Allocator Wrestling

Eben Freeman

August 28, 2018
Tweet

Transcript

  1. Allocator Wrestling
    Eben Freeman
    @_emfree_

    View Slide

  2. currently: building things at honeycomb.io
    Hi everyone!
    I'm Eben
    these slides:
    speakerdeck.com/emfree/allocator-wrestling

    View Slide

  3. Why this talk
    - Go is a managed-memory language
    - The runtime does a lot of sophisticated work on behalf of you, the programmer

    View Slide

  4. Why this talk
    - Go is a managed-memory language
    - The runtime does a lot of sophisticated work on behalf of you, the programmer
    - And yet, dynamic memory allocation is not free
    - A program's allocation patterns can substantially affect its performance!

    View Slide

  5. A motivating example
    Storage / query service at the day job
    (Honeycomb):
    - goal: <5 second query latency
    - up to billions of rows per query
    - flexible queries,
    on-the-fly schema changes

    View Slide

  6. A motivating example
    A few rounds of just memory-efficiency optimization: 2-3x speedup

    View Slide

  7. Why this talk
    - A program's allocation patterns can substantially affect its performance!
    - However, those patterns are often syntactically opaque.
    - Equipped with understanding and the right set of tools,
    we can spend our time more wisely.

    View Slide

  8. Why this talk
    - A program's allocation patterns can substantially affect its performance!
    - However, those patterns are often syntactically opaque.
    - Equipped with understanding and the right set of tools,
    we can spend our time more wisely.
    - Moreover, the runtime's internals are inherently interesting!

    View Slide

  9. Outline
    I. To tame the beast, you must first understand its mind
    How do the allocator and garbage collector work?
    II. Bring your binoculars
    Tools for understanding allocation patterns
    III. Delicious treats for the ravenous allocator
    Strategies for improving memory efficiency

    View Slide

  10. This is a practitioner's perspective.
    The Go runtime authors are much smarter
    than me, maybe even smarter than you!
    And they're always cooking up new stuff
    A Caveat
    The Go runtime team discusses the design of the garbage collector.

    View Slide

  11. I. Allocator Internals

    View Slide

  12. Memory Layout
    func f() *int {
    // ...
    x := 22
    y := 44
    // ...
    return &y
    }
    Depending on their lifetimes, objects are allocated either on stacks or on the heap.

    View Slide

  13. Memory Layout
    func f() *int {
    // ...
    x := 22
    y := 44
    // ...
    return &y
    }
    Depending on their lifetimes, objects are allocated either on stacks or on the heap.

    View Slide

  14. Memory Layout
    Design goals for an allocator:
    - Efficiently satisfy allocations of a given size, but avoid fragmentation
    - Avoid locking in the common case
    - Efficiently reclaim freeable memory

    View Slide

  15. Memory Layout
    Design goals for an allocator:
    - Efficiently satisfy allocations of a given size, but avoid fragmentation
    allocate like-sized objects in blocks
    - Avoid locking in the common case
    - Efficiently reclaim freeable memory

    View Slide

  16. Memory Layout
    Design goals for an allocator:
    - Efficiently satisfy allocations of a given size, but avoid fragmentation
    allocate like-sized objects in blocks
    - Avoid locking in the common case
    maintain local caches
    - Efficiently reclaim freeable memory

    View Slide

  17. Memory Layout
    Design goals for an allocator:
    - Efficiently satisfy allocations of a given size, but avoid fragmentation
    allocate like-sized objects in blocks
    - Avoid locking in the common case
    maintain local caches
    - Efficiently reclaim freeable memory
    use bitmaps for metadata, run GC concurrently

    View Slide

  18. Memory Layout
    The heap is divided into two levels of structure: arenas and spans.

    View Slide

  19. Memory Layout
    The heap is divided into two levels of structure: arenas and spans.
    Arenas are big chunks of aligned memory.
    On amd64, each arena is 64MB, so 4 million arenas cover the address space,
    and we keep track of them in a big global array (mheap.arenas)

    View Slide

  20. So if a stranger on the street hands us a (valid) pointer,
    we can easily find its heap metadata!
    heapArena := mheap_.arenas[ptr / arenaSize]
    span := heapArena.spans[(ptr % arenaSize) / pageSize]
    stacks of
    pointers

    View Slide

  21. What's a span?
    Managing the heap at arena granularity isn't practical, so heap objects live in spans.
    Small objects (<=32KB) live in spans of a fixed size class.
    type span struct {
    startAddr uintptr
    npages uintptr
    spanclass spanClass
    // allocated/free bitmap
    allocBits *gcBits
    // ...
    }

    View Slide

  22. What's a span?
    Managing the heap at arena granularity isn't practical, so heap objects live in spans.
    Small objects (<=32KB) live in spans of a fixed size class.
    - There are ~70 size classes
    - Their spans are 8KB-64KB
    - So we can compactly allocate small objects
    with at most a few MB overhead

    View Slide

  23. Memory Layout
    Each P has an mcache holding a span of each size class.
    Ideally, allocations can be satisfied directly out of the mcache.

    View Slide

  24. Memory Layout
    To allocate, we find the first free object in our cached mspan, then return its address.
    type mspan struct {
    startAddr uintpr
    freeIndex uintptr // first possibly free slot
    allocCache uint64 // used/free bitmap cache:
    // ...
    }

    View Slide

  25. Memory Layout
    To allocate, we find the first free object in our cached mspan, then return its address.
    type mspan struct {
    startAddr uintpr
    freeIndex uintptr // first possibly free slot
    allocCache uint64 // used/free bitmap cache:
    // ...
    }

    View Slide

  26. Memory Layout
    This means that "most" memory allocations are fast:
    1. Find a cached span with the right size (mcache.mspan[sizeClass])
    (if there's none, get a new span, cache it)
    2. Find the next free object in the span
    3. If necessary, update the heap bitmap
    (so the garbage collector knows which fields are pointers)
    and require no locking!

    View Slide

  27. Memory Layout
    What have we got so far?
    ✓ Efficiently satisfy allocations of a given size, but avoid fragmentation
    allocate like-sized objects in blocks
    ✓ Avoid locking in the common case
    maintain local caches
    ? Efficiently reclaim free memory

    View Slide

  28. Memory Layout
    What have we got so far?
    ✓ Efficiently satisfy allocations of a given size, but avoid fragmentation
    allocate like-sized objects in blocks
    ✓ Avoid locking in the common case
    maintain local caches
    ? Efficiently reclaim free memory
    What about garbage collection?

    View Slide

  29. Garbage collection
    We have to find and reclaim objects
    once they're no longer referenced.
    checksum(filename) uint64 {
    data := read(filename)
    // compute checksum
    }
    func read(filename) []byte {
    // open and read file
    }
    checksum
    read

    View Slide

  30. Garbage collection
    We have to find and reclaim objects
    once they're no longer referenced.
    checksum(filename) uint64 {
    data := read(filename)
    // compute checksum
    }
    func read(filename) []byte {
    // open and read file
    }
    checksum

    View Slide

  31. Garbage collection
    We have to find and reclaim objects
    once they're no longer referenced.

    View Slide

  32. Garbage collection
    We have to find and reclaim objects
    once they're no longer referenced.
    Go uses a tricolor concurrent mark-sweep
    garbage collector.

    View Slide

  33. Garbage collection
    We have to find and reclaim objects
    once they're no longer referenced.
    Go uses a tricolor concurrent mark-sweep
    garbage collector.
    GC is divided into (roughly) two phases:
    MARK: find reachable (live) objects
    (this is where the action happens)
    SWEEP: free unreachable objects

    View Slide

  34. Garbage collection
    In the mark phase,
    objects are white, grey, or black.
    Initially, all objects are white.
    We start by marking goroutine stacks
    and globals.

    View Slide

  35. Garbage collection
    In the mark phase,
    objects are white, grey, or black.
    Initially, all objects are white.
    We start by marking goroutine stacks
    and globals.
    When we reach an object, we mark it grey.

    View Slide

  36. Garbage collection
    In the mark phase,
    objects are white, grey, or black.
    Initially, all objects are white.
    We start by marking goroutine stacks
    and globals.
    When we reach an object, we mark it grey.
    When an object's referents are all marked,
    we mark it black.

    View Slide

  37. Garbage collection
    In the mark phase,
    objects are white, grey, or black.
    Initially, all objects are white.
    We start by marking goroutine stacks
    and globals.
    When we reach an object, we mark it grey.
    When an object's referents are all marked,
    we mark it black.

    View Slide

  38. Garbage collection
    At the end, objects are either white or black.
    White objects can then be swept and freed.

    View Slide

  39. Garbage collection
    Questions:
    - How do we know what an object's
    referents are?
    - How do we actually mark an object?

    View Slide

  40. Garbage collection
    Questions:
    - How do we know what an object's
    referents are?
    - How do we actually mark an object?
    Use bitmaps for metadata!

    View Slide

  41. Garbage collection
    Say we have something like:
    type Row struct {
    index int
    data []uint64
    }
    How does the garbage collector know what
    other objects it points to?
    I.e., which of its fields are pointers?

    View Slide

  42. Garbage collection
    Remember that this heap object is actually inside an arena!
    The arena's bitmap tells us which of its words are pointers.

    View Slide

  43. Garbage collection
    Similarly, mark state is kept in a span's gcMark bits
    type span struct {
    startAddr uintptr
    // ...
    // allocated/free bitmap
    allocBits *gcBits
    // mark state bitmap
    gcMarkBits *gcBits
    // ...
    }

    View Slide

  44. Garbage collection
    Similarly, mark state is kept in a span's gcMark bits
    type span struct {
    startAddr uintptr
    // ...
    // allocated/free bitmap
    allocBits *gcBits
    // mark state bitmap
    gcMarkBits *gcBits
    // ...
    }

    View Slide

  45. Garbage collection
    Similarly, mark state is kept in a span's gcMark bits
    type span struct {
    startAddr uintptr
    // ...
    // allocated/free bitmap
    allocBits *gcBits
    // mark state bitmap
    gcMarkBits *gcBits
    // ...
    }

    View Slide

  46. Garbage collection
    Similarly, mark state is kept in a span's gcMark bits
    type span struct {
    startAddr uintptr
    // ...
    // allocated/free bitmap
    allocBits *gcBits
    // mark state bitmap
    gcMarkBits *gcBits
    // ...
    }

    View Slide

  47. Garbage collection
    Once we're done marking, unmarked bits correspond to free slots!
    type span struct {
    startAddr uintptr
    // ...
    // allocated/free bitmap
    allocBits *gcBits
    // mark state bitmap
    gcMarkBits *gcBits
    // ...
    }

    View Slide

  48. Garbage collection
    Once we're done marking, unmarked bits correspond to free slots!
    type span struct {
    startAddr uintptr
    // ...
    // allocated/free bitmap
    allocBits *gcBits
    // mark state bitmap
    gcMarkBits *gcBits
    // ...
    }

    View Slide

  49. Garbage collection
    The garbage collector is concurrent . . . with a twist.
    If we're not careful, the application can do sneaky stuff to thwart the garbage
    collector.
    type S struct {
    p *int
    }
    func f(s *S) *int {
    r := s.p
    s.p = nil
    return r
    }
    s
    s.p

    View Slide

  50. Garbage collection
    The garbage collector is concurrent . . . with a twist.
    If we're not careful, the application can do sneaky stuff to thwart the garbage
    collector.
    type S struct {
    p *int
    }
    func f(s *S) *int {
    r := s.p
    s.p = nil
    return r
    }
    s
    s.p
    r

    View Slide

  51. Garbage collection
    The garbage collector is concurrent . . . with a twist.
    If we're not careful, the application can do sneaky stuff to thwart the garbage
    collector.
    type S struct {
    p *int
    }
    func f(s *S) *int {
    r := s.p
    s.p = nil
    return r
    }
    s
    r

    View Slide

  52. Garbage collection
    The garbage collector is concurrent . . . with a twist.
    If we're not careful, the application can do sneaky stuff to thwart the garbage
    collector.
    type S struct {
    p *int
    }
    func f(s *S) *int {
    r := s.p
    s.p = nil
    return r
    }

    View Slide

  53. Garbage collection
    The garbage collector is concurrent . . . with a twist.
    If we're not careful, the application can do sneaky stuff to thwart the garbage
    collector.
    Now we have a live pointer to memory that the garbage collector can free!

    View Slide

  54. Garbage collection
    To avoid this peril, the compiler turns pointer writes into potential calls into the
    write barrier; very roughly:
    if writeBarrier.enabled {
    shade(*ptr)
    if current stack is grey {
    shade(val)
    }
    *ptr = val
    }
    *ptr = val

    View Slide

  55. Garbage collection
    To avoid this peril, the compiler turns pointer writes into potential calls into the
    write barrier; very roughly:
    if writeBarrier.enabled {
    shade(*ptr)
    if current stack is grey {
    shade(val)
    }
    *ptr = val
    }
    *ptr = val

    View Slide

  56. While the garbage collector is marking:
    - the write barrier is on
    - marking consumes resources
    - background marking
    - GC assist
    Garbage collection

    View Slide

  57. Garbage collection
    During marking, 25% of GOMAXPROCS are dedicated to background marking.
    But a rapidly allocating goroutine can outrun it.
    Dedicated GC worker
    MARK ASSIST

    View Slide

  58. Garbage collection
    So during marking, a goroutine gets charged for each allocation.
    If it's in debt, it has to do mark work before continuing.
    func mallocgc(size uintptr, ...) unsafe.Pointer {
    // ...
    assistG.gcAssistBytes -= int64(size)
    if assistG.gcAssistBytes < 0 {
    // This goroutine is in debt. Assist the GC to
    // this before allocating. This must happen
    // before disabling preemption.
    gcAssistAlloc(assistG)
    }
    // ...

    View Slide

  59. Garbage collection
    In summary:
    - The runtime allocates data in spans to avoid fragmentation
    - Local caches speed up allocation,
    but the allocator still has to do some bookkeeping
    - GC is concurrent, but write barriers and mark assists can slow a program
    - GC work is proportional to scannable heap

    View Slide

  60. II. Tools

    View Slide

  61. Question
    It seems like dynamic memory allocation has some cost.
    Does this mean that reducing allocations will improve performance?

    View Slide

  62. Question
    It seems like dynamic memory allocation has some cost.
    Does this mean that reducing allocations will improve performance?
    Well, it depends.

    View Slide

  63. Question
    It seems like dynamic memory allocation has some cost.
    Does this mean that reducing allocations will improve performance?
    Well, it depends.
    The builtin memory profiler can tell us where we're allocating, but doesn't answer
    the causal question "will reducing allocations make a difference?"

    View Slide

  64. Question
    It seems like dynamic memory allocation has some cost.
    Does this mean that reducing allocations will improve performance?
    Well, it depends.
    The builtin memory profiler can tell us where we're allocating, but doesn't answer
    the causal question "will reducing allocations make a difference?"
    Three tools to start with:
    - crude experimenting
    - sampling profiling with pprof
    - go tool trace

    View Slide

  65. Crude experimenting
    We think that the allocator and garbage collector have some overhead,
    but we're not sure how much.
    Well . . .

    View Slide

  66. Crude experimenting
    We think that the allocator and garbage collector have some overhead,
    but we're not sure how much.
    Well . . .

    View Slide

  67. Crude experimenting
    Turn off the garbage collector or the allocator with runtime flags:
    GOGC=off : Disables garbage collector
    GODEBUG=sbrk=1 : Replaces entire allocator with simple persistent allocator

    View Slide

  68. Crude experimenting
    ~30% speedup on some benchmarks:

    View Slide

  69. Crude experimenting
    This might seem kind of stupid, but it's a cheap way to establish expectations.
    If we see speedup, that's a hint that we can optimize!

    View Slide

  70. This might seem kind of stupid, but it's a cheap way to establish expectations.
    If we see speedup, that's a hint that we can optimize!
    Problems:
    - not viable in production: need synthetic benchmarks
    - persistent allocator isn't free either, so this doesn't fully reflect allocation cost
    Crude experimenting

    View Slide

  71. Profiling
    A pprof CPU profile can often show time spent in runtime.mallocgc
    Tips:
    - Use the flamegraph viewer in the pprof web UI
    - If pprof isn't enabled in your binary, you can use Linux perf too

    View Slide

  72. Profiling
    refilling span caches GC assist

    View Slide

  73. Profiling
    Problems:
    - Program might not be CPU-bound
    - Allocation might not be on critical path

    View Slide

  74. go tool trace
    The execution tracer might be the best tool at our disposal to understand the impact
    of allocating.
    The execution tracer captures very granular runtime events over a short time
    window:
    curl localhost:6060/debug/pprof/trace?seconds=5 > trace.out
    Which you can visualize in a web UI
    go tool trace trace.out

    View Slide

  75. go tool trace
    However, it can be a bit dense.

    View Slide

  76. go tool trace
    Remember, top-level GC doesn't mean the program is blocked,
    but what happens within GC is interesting!

    View Slide

  77. go tool trace
    Remember, top-level GC doesn't mean the program is blocked,
    but what happens within GC is interesting!
    Dedicated GC worker
    MARK ASSIST

    View Slide

  78. go tool trace
    If you're motivated, a CL exists that parses traces to generate
    minimum mutator utilization curves
    https://golang.org/cl/60790

    View Slide

  79. go tool trace
    Minimum mutator utilization: over a sliding time window (1ms, 10ms, etc.),
    what was the minimum amount of resources available to mutators (goroutines
    doing work)?

    View Slide

  80. go tool trace
    This is terrific (if you ask me) -- you can see if a production service is GC-bound
    utilization is < 75% :( utilization is ~ 100% :)

    View Slide

  81. In Summary
    Together, benchmarks with the allocator off, CPU profiles, and execution traces give
    us a sense of:
    - whether allocation / GC are affecting performance
    - which call sites are spending a lot of time allocating
    - how throughput changes during GC.

    View Slide

  82. III. What can we change?

    View Slide

  83. If we've concluded that allocations are a source of inefficiency, what can we do?
    - Limit pointers
    - Allocate in batches
    - Try to recycle objects

    View Slide

  84. What about tuning GOGC?
    - Absolutely helps with throughput! However . . .
    - If we want to optimize for throughput, GOGC doesn't express the real goal:
    "use all available memory, but no more"
    - Live heap size is generally (but not always) small
    - High GOGC makes avoiding OOMS harder
    But first

    View Slide

  85. Limit pointers
    Sometimes, spurious heap allocations are easily avoided!
    func (c *ColumnManager) ReadRows() {
    // ...
    for !abort {
    tsRecord := tsReader.read()
    ts := time.Unix(0, tsRecord.Timestamp).UTC()
    if compareTimestamps(&ts, query.Start, query.End) {
    // ...
    }

    View Slide

  86. Limit pointers
    Sometimes, spurious heap allocations are easily avoided!
    func (c *ColumnManager) ReadRows() {
    // ...
    for !abort {
    tsRecord := tsReader.read()
    ts := time.Unix(0, tsRecord.Timestamp).UTC()
    if compareTimestamps(&ts, query.Start, query.End) {
    // ...
    }
    Tight loop

    View Slide

  87. Limit pointers
    Sometimes, spurious heap allocations are easily avoided!
    func (c *ColumnManager) ReadRows() {
    // ...
    for !abort {
    tsRecord := tsReader.read()
    ts := time.Unix(0, tsRecord.Timestamp).UTC()
    if compareTimestamps(&ts, query.Start, query.End) {
    // ...
    }
    Tight loop
    gratuitous pointer

    View Slide

  88. Limit pointers
    Sometimes, spurious heap allocations are easily avoided!
    func (c *ColumnManager) ReadRows() {
    // ...
    for !abort {
    tsRecord := tsReader.read()
    ts := time.Unix(0, tsRecord.Timestamp).UTC()
    if compareTimestamps(&ts, query.Start, query.End) {
    // ...
    }
    Tight loop
    gratuitous pointer
    spurious heap allocation

    View Slide

  89. Limit pointers
    The Go compiler can be enticed to tell you why a variable is heap-allocated:
    go build -gcflags="-m -m"
    but its output is a bit unwieldy.
    https://github.com/loov/view-annotated-file helps digest it:

    View Slide

  90. Limit pointers
    for !abort {
    tsRecord := tsReader.read()
    ts := time.Unix(0, tsRecord.Timestamp).UTC()
    // . . .
    }
    var ts time.Time
    var tsNanos uint64

    View Slide

  91. Limit pointers
    Not allocating structs with inner pointers helps the garbage collector too!
    func BenchmarkTimeAlloc(b *testing.B) {
    var x []time.Time
    for n := 0; n < b.N; n++ {
    x = make([]time.Time, 1024)
    }
    test.Check(b, len(x) == 1024)
    }
    func BenchmarkIntAlloc(b *testing.B) {
    var x []int64
    for n := 0; n < b.N; n++ {
    x = make([]int64, 1024)
    }
    test.Check(b, len(x) == 1024)
    }
    BenchmarkTimeAlloc-4 8880 ns/op BenchmarkIntAlloc-4 1540 ns/op

    View Slide

  92. Limit pointers
    Why this discrepancy?
    BenchmarkTimeAlloc-4 8880 ns/op BenchmarkIntAlloc-4 1540 ns/op

    View Slide

  93. Limit pointers
    Why this discrepancy?
    BenchmarkTimeAlloc-4 8880 ns/op BenchmarkIntAlloc-4 1540 ns/op
    type Time struct {
    wall uint64
    ext in64
    loc *Location
    }
    Sneaky time.Time conceals nefarious pointer!

    View Slide

  94. Slab allocation
    Is it better to do one horse-sized allocation or 100 duck-sized allocations?

    View Slide

  95. Slab allocation
    Although smaller allocs make better use of the mcache, larger allocs are faster on a
    per-byte basis.

    View Slide

  96. Slab allocation
    Even though the fast path in the allocator is very optimized, we still need to do some
    work on every allocation:
    - prevent ourselves from being preempted
    - check if we need to assist GC
    - compute the next free slot in the mcache
    - set heap bitmap bits
    - etc.

    View Slide

  97. Slab allocation
    In some cases, we can amortize that overhead by doing fewer, bigger allocs:
    // Allocate individual []interface{}s out of a big buffer
    type SlicePool struct {
    bigBuf []interface{}
    }
    func (s *SlicePool) GetSlice(size int) []interface{} {
    if size >= len(s.bigBuf) {
    s.bigBuf = make([]interface{}, blockSize)
    }
    res := s.bigBuf[:size]
    s.bigBuf = s.bigBuf[size:]
    return res
    }

    View Slide

  98. Slab allocation
    In some cases, we can amortize that overhead by doing fewer, bigger allocs.
    The danger is that any live reference will keep the whole slab alive!
    buf := slicePool.GetSlice(24)
    w.Write(buf)
    if filters.Match(buf) {
    results <- buf
    }
    Generous subslice grants immortality
    to surrounding memory!

    View Slide

  99. Slab allocation
    In some cases, we can amortize that overhead by doing fewer, bigger allocs.
    The danger is that any live reference will keep the whole slab alive!
    Also, these aren't safe for concurrent use: best for a few heavily-allocating
    goroutines
    buf := slicePool.GetSlice(24)
    w.Write(buf)
    if filters.Match(buf) {
    results <- buf
    Generous subslice grants immortality
    to surrounding memory!

    View Slide

  100. Recycle allocations
    Optimization strategies:
    ✓ Avoid
    limit pointers, don't do dumb stuff
    ✓ Amortize
    do fewer, larger allocations
    ? Reuse
    recycle allocated memory

    View Slide

  101. Storage engine architecture: two phases, mediated by channels
    Recycle allocations
    user query:
    COUNT, P95(duration)
    GROUP BY user_id
    WHERE duration > 0
    ORDER BY COUNT DESC
    LIMIT 10

    View Slide

  102. Storage engine architecture: two phases, mediated by channels
    Recycle allocations
    - simple, easy to reason about
    - maximizes available parallelism
    - generates tons of garbage
    - data passed over channels
    - format not known in advance

    View Slide

  103. Optimization: explicitly recycle allocated blocks of memory
    Recycle allocations

    View Slide

  104. Optimization: explicitly recycle allocated blocks of memory
    Recycle allocations
    var buf RowBuffer
    select {
    case buf = <-recycled:
    default:
    buf = NewRowBuffer()
    }

    View Slide

  105. A more sophisticated version: sync.Pool
    - maintains slices of recycled objects, sharded by CPU (with runtime support)
    - allows lock-free get/put in the fast path
    - caveat: cleared on every GC
    Danger:
    - must be very careful to zero or overwrite recycled memory
    Recycle allocations

    View Slide

  106. In Review
    Your priorities and your mileage may vary!

    View Slide

  107. In Review
    - The allocator and garbage collector are pretty ingenious!
    - Single allocations are fast but not free
    - The garbage collector can stop individual goroutines, even without STW
    - GC work depends on pointer density
    - Bring your toolbox:
    - benchmark with GC off
    - use CPU profiler to find hot allocations
    - use execution tracer to understand GC pattern
    - use escape analyzer to understand why allocations happen

    View Slide

  108. Thank you!
    Special credit to Ian Wilkes and many others at
    Honeycomb: the true masterminds
    Suggested further reading:
    Allocation efficiency in high-performance Go services
    Achille Roussel / Rick Branson
    segment.com/blog/allocation-efficiency-in-high-
    performance-go-services/
    Go 1.5 concurrent garbage collector pacing
    Austin Clements
    golang.org/s/go15gcpacing
    So You Wanna Go Fast
    Tyler Treat
    bravenewgeek.com/so-you-wanna-go-fast/
    @_emfree_

    View Slide

  109. - Optimizing for latency makes a lot of sense!
    Reliably low-latency garbage collection is essential for many use cases:
    chat, streaming, lockservers, low-latency high-fanout services, etc.
    - For throughput-oriented use cases, Go's pauseless garbage collector may not
    be theoretically optimal.
    (But it might not be an existential burden either.)
    A Caveat

    View Slide