Upgrade to Pro — share decks privately, control downloads, hide ads and more …

So You Wanna Go Fast?

Tyler Treat
September 29, 2017

So You Wanna Go Fast?

Go's simplicity and concurrency model make it an appealing choice for backend systems, but how does it fare for latency-sensitive applications? In this talk, we explore the other side of the coin by providing some tips on writing high-performance Go and lessons learned in the process. We do a deep dive on low-level performance optimizations in order to make Go a more compelling option in the world of systems programming, but we also consider the trade-offs involved.

Tyler Treat

September 29, 2017
Tweet

More Decks by Tyler Treat

Other Decks in Programming

Transcript

  1. Wanna Go
    So You
    Fast?
    Strange Loop 2017 @tyler_treat

    View full-size slide

  2. @tyler_treat
    @tyler_treat

    View full-size slide

  3. @tyler_treat
    this one weird trick
    Make your code faster with

    View full-size slide

  4. @tyler_treat
    this one weird trick
    Make your code faster with

    View full-size slide

  5. @tyler_treat
    So You Wanna
    Subvert Go?

    View full-size slide

  6. @tyler_treat
    Spoiler Alert:

    Go is not a

    systems language…

    View full-size slide

  7. @tyler_treat
    but that doesn’t mean you
    can’t build internet-scale
    systems with it.

    View full-size slide

  8. @tyler_treat

    View full-size slide

  9. @tyler_treat
    This is a talk about how to
    write terrible Go code.

    View full-size slide

  10. @tyler_treat
    @tyler_treat

    View full-size slide

  11. @tyler_treat
    Because this is a talk
    about trade-offs.

    View full-size slide

  12. @tyler_treat
    - Messaging Nerd @ Apcera

    - Working on nats.io

    - Distributed systems

    - bravenewgeek.com
    Tyler Treat

    View full-size slide

  13. @tyler_treat
    @tyler_treat

    View full-size slide

  14. @tyler_treat
    matter?
    Why does this talk

    View full-size slide

  15. @tyler_treat
    The
    compiler
    isn’t magic.

    View full-size slide

  16. @tyler_treat
    The
    compiler
    isn’t magic.

    View full-size slide

  17. @tyler_treat
    You have to be

    mindful of performance

    when it matters.

    View full-size slide

  18. @tyler_treat
    @tyler_treat
    Where bad things hide

    View full-size slide

  19. @tyler_treat
    @tyler_treat
    Where bad things hide
    Where we’re usually looking

    View full-size slide

  20. @tyler_treat
    Tire fires

    at scale
    @tyler_treat

    View full-size slide

  21. @tyler_treat
    @tyler_treat
    @tyler_treat

    View full-size slide

  22. @tyler_treat
    @tyler_treat
    @tyler_treat

    View full-size slide

  23. @tyler_treat
    @tyler_treat
    @tyler_treat

    View full-size slide

  24. @tyler_treat
    Overview
    - Measuring performance

    - Language features

    - Memory management

    - Concurrency and multi-core

    View full-size slide

  25. @tyler_treat
    Overview
    - Measuring performance

    - Language features

    - Memory management

    - Concurrency and multi-core

    View full-size slide

  26. @tyler_treat
    Disclaimer:

    Don’t blindly apply
    optimizations presented.

    View full-size slide

  27. @tyler_treat
    tl;dr of this talk is

    “IT DEPENDS!”

    View full-size slide

  28. @tyler_treat
    Measure
    Optimize

    View full-size slide

  29. @tyler_treat
    Measurement Techniques
    - pprof

    - memory

    - cpu

    - blocking

    - GODEBUG

    - gctrace

    - schedtrace

    - allocfreetrace

    - Benchmarking

    - Code-level: testing.B

    - System-level: HdrHistogram (https://github.com/codahale/hdrhistogram)

    bench (https://github.com/tylertreat/bench)

    View full-size slide

  30. @tyler_treat
    @tyler_treat

    View full-size slide

  31. @tyler_treat
    The only way to get good at something
    is to be really fucking bad at it

    for a long time.

    View full-size slide

  32. @tyler_treat
    Benchmarking…
    a great way to rattle the

    Hacker News fart chamber.

    View full-size slide

  33. @tyler_treat
    Overview
    - Measuring performance

    - Language features

    - Memory management

    - Concurrency and multi-core

    View full-size slide

  34. @tyler_treat
    channels

    View full-size slide

  35. @tyler_treat
    “Instead of explicitly using locks to mediate access
    to shared data, Go encourages the use of channels
    to pass references to data between goroutines.”

    https://blog.golang.org/share-memory-by-communicating

    View full-size slide

  36. @tyler_treat
    @tyler_treat

    View full-size slide

  37. @tyler_treat
    @tyler_treat
    USE CHANNELS TO COORDINATE,
    NOT SYNCHRONIZE.

    View full-size slide

  38. @tyler_treat
    @tyler_treat

    View full-size slide

  39. @tyler_treat
    @tyler_treat

    View full-size slide

  40. @tyler_treat
    defer

    View full-size slide

  41. @tyler_treat
    @tyler_treat

    View full-size slide

  42. @tyler_treat
    Is defer still slow?

    View full-size slide

  43. @tyler_treat
    @tyler_treat

    View full-size slide

  44. @tyler_treat
    The Secret Life

    of interface{}

    View full-size slide

  45. @tyler_treat
    type Stringer interface {

    String() string

    }
    https://research.swtch.com/interfaces

    View full-size slide

  46. @tyler_treat
    type Stringer interface {

    String() string

    }

    type Binary uint64
    https://research.swtch.com/interfaces

    View full-size slide

  47. @tyler_treat
    type Stringer interface {

    String() string

    }

    type Binary uint64
    200
    b := Binary(200)
    https://research.swtch.com/interfaces

    View full-size slide

  48. @tyler_treat
    type Stringer interface {

    String() string

    }

    type Binary uint64

    func (i Binary) String() string {
    return strconv.FormatUint(uint64(i), 2)
    }
    200
    b := Binary(200)
    https://research.swtch.com/interfaces

    View full-size slide

  49. @tyler_treat
    type Stringer interface {

    String() string

    }
    https://research.swtch.com/interfaces
    s := Stringer(b)
    Stringer
    tab
    data

    View full-size slide

  50. @tyler_treat
    s := Stringer(b)
    Stringer
    tab
    data
    .

    .

    .
    itable(Stringer, Binary)
    type
    fun[0]
    type(Binary)
    (*Binary).String
    type Stringer interface {

    String() string

    }
    https://research.swtch.com/interfaces

    View full-size slide

  51. @tyler_treat
    tab
    data
    200
    Binary
    s := Stringer(b)
    Stringer
    .

    .

    .
    itable(Stringer, Binary)
    type
    fun[0]
    type(Binary)
    (*Binary).String
    type Stringer interface {

    String() string

    }
    https://research.swtch.com/interfaces

    View full-size slide

  52. @tyler_treat

    View full-size slide

  53. @tyler_treat
    So what?

    View full-size slide

  54. @tyler_treat
    @tyler_treat

    View full-size slide

  55. @tyler_treat
    @tyler_treat

    View full-size slide

  56. @tyler_treat
    @tyler_treat
    Sorting 100M Interfaces

    View full-size slide

  57. @tyler_treat
    @tyler_treat
    Sorting 100M Interfaces

    View full-size slide

  58. @tyler_treat
    @tyler_treat
    Sorting 100M Structs

    View full-size slide

  59. @tyler_treat
    @tyler_treat
    Sorting 100M Structs

    View full-size slide

  60. @tyler_treat
    $ go test -bench=. -gcflags="-m"

    View full-size slide

  61. @tyler_treat
    $ go test -bench=. -gcflags="-m"

    View full-size slide

  62. @tyler_treat
    @tyler_treat

    View full-size slide

  63. @tyler_treat
    $ go test -bench=. -gcflags="-l"

    View full-size slide

  64. @tyler_treat
    @tyler_treat
    Struct

    No Inlining
    Interface

    No Inlining

    View full-size slide

  65. @tyler_treat
    @tyler_treat
    Struct

    No Inlining
    Interface

    No Inlining

    View full-size slide

  66. @tyler_treat
    @tyler_treat
    Struct

    No Inlining
    Interface

    No Inlining

    View full-size slide

  67. @tyler_treat
    @tyler_treat

    View full-size slide

  68. @tyler_treat
    @tyler_treat
    x.(*T) inlined

    View full-size slide

  69. @tyler_treat
    @tyler_treat
    SSA backend &

    remaining type

    conversions inlined
    x.(*T) inlined

    View full-size slide

  70. @tyler_treat
    @tyler_treat

    View full-size slide

  71. @tyler_treat

    View full-size slide

  72. @tyler_treat
    @tyler_treat
    Struct Interface

    View full-size slide

  73. @tyler_treat
    @tyler_treat
    Struct Interface

    View full-size slide

  74. @tyler_treat
    @tyler_treat

    View full-size slide

  75. @tyler_treat
    $ go test -bench=. -gcflags="-S"

    View full-size slide

  76. @tyler_treat
    $ go test -bench=. -gcflags="-S"

    View full-size slide

  77. @tyler_treat
    $ go test -bench=. -gcflags="-S"

    View full-size slide

  78. @tyler_treat
    Key Insight:
    If performance matters,

    write type-specific code.

    View full-size slide

  79. @tyler_treat
    Overview
    - Measuring performance

    - Language features

    - Memory management

    - Concurrency and multi-core

    View full-size slide

  80. @tyler_treat
    []byte to string

    conversions

    View full-size slide

  81. @tyler_treat

    View full-size slide

  82. @tyler_treat
    @tyler_treat

    View full-size slide

  83. @tyler_treat
    @tyler_treat

    View full-size slide

  84. @tyler_treat
    What’s going on here?

    View full-size slide

  85. @tyler_treat
    @tyler_treat

    View full-size slide

  86. @tyler_treat
    memory allocation

    View full-size slide

  87. @tyler_treat
    @tyler_treat

    View full-size slide

  88. @tyler_treat
    How is sync.Pool so fast?

    View full-size slide

  89. @tyler_treat
    Per-CPU storage!

    View full-size slide

  90. @tyler_treat
    @tyler_treat
    https://golang.org/src/sync/pool.go

    View full-size slide

  91. @tyler_treat
    @tyler_treat
    https://golang.org/src/sync/pool.go

    View full-size slide

  92. @tyler_treat
    @tyler_treat

    View full-size slide

  93. @tyler_treat
    Overview
    - Measuring performance

    - Language features

    - Memory management

    - Concurrency and multi-core

    View full-size slide

  94. @tyler_treat
    “We generally don’t want sync/atomic to be used
    at all…Experience has shown us again and again
    that very very few people are capable of writing
    correct code that uses atomic operations…”

    —Ian Lance Taylor

    View full-size slide

  95. @tyler_treat

    View full-size slide

  96. @tyler_treat
    @tyler_treat
    Subscribers Messages
    Fast Topic Matching
    http://bravenewgeek.com/fast-topic-matching/

    View full-size slide

  97. @tyler_treat
    @tyler_treat
    Subscribers Messages
    Fast Topic Matching
    http://bravenewgeek.com/fast-topic-matching/

    View full-size slide

  98. @tyler_treat
    @tyler_treat
    Fast Topic Matching

    View full-size slide

  99. @tyler_treat
    @tyler_treat
    Fast Topic Matching

    View full-size slide

  100. @tyler_treat
    @tyler_treat

    View full-size slide

  101. @tyler_treat
    @tyler_treat
    Fast Topic Matching

    View full-size slide

  102. @tyler_treat
    @tyler_treat
    Concurrent

    80,000 inserts

    80,000 lookups


    View full-size slide

  103. @tyler_treat
    @tyler_treat
    Ctrie

    View full-size slide

  104. @tyler_treat
    @tyler_treat
    G1
    G1
    1. Assign a generation, G1, to each

    I-node (empty struct).
    Ctrie

    View full-size slide

  105. @tyler_treat
    1. Assign a generation, G1, to each

    I-node (empty struct).

    2. Add new node by copying I-node with
    updated branch and generation then
    GCAS, i.e. atomically:

    - compare I-nodes to detect tree

    mutations.

    - compare root generations to detect

    snapshots.
    @tyler_treat
    G2
    G1
    Ctrie

    View full-size slide

  106. @tyler_treat
    @tyler_treat

    View full-size slide

  107. @tyler_treat
    @tyler_treat

    View full-size slide

  108. @tyler_treat
    The Go race detector

    doesn’t protect you from

    doing dumb stuff.

    View full-size slide

  109. @tyler_treat
    @tyler_treat

    View full-size slide

  110. @tyler_treat
    @tyler_treat

    View full-size slide

  111. @tyler_treat
    @tyler_treat

    View full-size slide

  112. @tyler_treat
    Side note:

    unsafe is, in fact, unsafe.

    View full-size slide

  113. @tyler_treat
    “Packages that import unsafe may depend on internal
    properties of the Go implementation. We reserve the
    right to make changes to the implementation that may
    break such programs.”

    https://golang.org/doc/go1compat

    View full-size slide

  114. @tyler_treat

    View full-size slide

  115. @tyler_treat
    Key Insight:
    Struct layout can make

    a big difference.

    View full-size slide

  116. @tyler_treat
    @tyler_treat
    Mechanical

    Sympathy

    View full-size slide

  117. @tyler_treat
    https://github.com/Workiva/go-datastructures/blob/master/queue/ring.go
    @tyler_treat

    View full-size slide

  118. @tyler_treat
    @tyler_treat

    View full-size slide

  119. @tyler_treat
    @tyler_treat

    View full-size slide

  120. @tyler_treat
    @tyler_treat

    View full-size slide

  121. @tyler_treat
    @tyler_treat
    https://golang.org/src/sync/rwmutex.go

    View full-size slide

  122. @tyler_treat
    @tyler_treat
    https://golang.org/src/sync/rwmutex.go

    View full-size slide

  123. @tyler_treat
    CPU
    reader
    reader
    reader
    RWMutex

    View full-size slide

  124. @tyler_treat
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    RWMutex

    View full-size slide

  125. @tyler_treat
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    reader
    reader
    RWMutex

    View full-size slide

  126. @tyler_treat
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    RWMutex

    View full-size slide

  127. @tyler_treat
    RWMutex
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader
    writer
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    writer
    reader
    reader

    View full-size slide

  128. @tyler_treat
    RWMutex
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader
    writer
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    writer
    reader
    reader

    View full-size slide

  129. @tyler_treat
    RWMutex
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader
    writer
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    writer
    reader
    reader

    View full-size slide

  130. @tyler_treat
    RWMutex
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader
    writer
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    writer
    reader
    reader

    View full-size slide

  131. @tyler_treat
    RWMutex
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader
    writer
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    writer
    reader
    reader

    View full-size slide

  132. @tyler_treat
    RWMutex
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader
    writer
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    writer
    reader
    reader

    View full-size slide

  133. @tyler_treat
    RWMutex
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader
    writer
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    writer
    reader
    reader
    RWMutex RWMutex RWMutex

    View full-size slide

  134. @tyler_treat
    RWMutex
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader
    writer
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    writer
    reader
    reader
    RWMutex RWMutex RWMutex

    View full-size slide

  135. @tyler_treat
    RWMutex
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader
    writer
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    writer
    reader
    reader
    RWMutex RWMutex RWMutex

    View full-size slide

  136. @tyler_treat
    RWMutex
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader
    writer
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    writer
    reader
    reader
    RWMutex RWMutex RWMutex

    View full-size slide

  137. @tyler_treat
    RWMutex
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader
    writer
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    writer
    reader
    reader
    RWMutex RWMutex RWMutex

    View full-size slide

  138. @tyler_treat
    RWMutex
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader
    writer
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    writer
    reader
    reader
    RWMutex RWMutex RWMutex

    View full-size slide

  139. @tyler_treat
    RWMutex
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader
    writer
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    writer
    reader
    reader
    RWMutex RWMutex RWMutex

    View full-size slide

  140. CPU
    reader
    CPU
    reader
    reader
    reader
    CPU
    reader
    reader
    CPU
    reader
    reader
    U
    writer
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader
    writer
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    writer
    reader
    reader
    CPU
    reader
    reader
    CPU
    reader
    reader
    reader
    writer
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    CPU
    reader
    reader
    reader
    reader
    reader reader
    reader
    reader reader
    reader
    reader
    reader
    U
    reader
    reader
    ader
    ader
    U
    reader
    reader
    ader
    ader
    ader
    reader
    ader
    CPU
    read
    read
    reader
    reader
    CPU
    read
    read
    reader
    reader
    CPU
    read
    reader
    read
    reader
    @tyler_treat

    View full-size slide

  141. @tyler_treat
    @tyler_treat

    View full-size slide

  142. @tyler_treat
    How to create

    CPU->RWMutex

    mapping?

    View full-size slide

  143. @tyler_treat
    @tyler_treat
    https://github.com/jonhoo/drwmutex/blob/master/cpu_amd64.s

    View full-size slide

  144. @tyler_treat
    /proc/cpuinfo

    View full-size slide

  145. @tyler_treat
    @tyler_treat

    View full-size slide

  146. @tyler_treat
    memory RWMutex1
    24 bytes

    View full-size slide

  147. @tyler_treat
    RWMutex1 RWMutex2
    memory
    24 bytes

    View full-size slide

  148. @tyler_treat
    RWMutex1 RWMutex2 RWMutex3
    memory
    24 bytes

    View full-size slide

  149. @tyler_treat
    RWMutex1 RWMutex2 RWMutex3 RWMutexN

    memory
    24 bytes

    View full-size slide

  150. @tyler_treat
    RWMutex1 RWMutex2 RWMutex3 RWMutexN

    memory
    24 bytes
    64 bytes
    (cache line size)

    View full-size slide

  151. @tyler_treat
    RWMutex1 RWMutex2 RWMutex3 RWMutexN

    memory
    24 bytes
    64 bytes
    (cache line size)
    Cache rules everything around me

    View full-size slide

  152. @tyler_treat
    https://github.com/jonhoo/drwmutex/blob/master/drwmutex.go
    @tyler_treat

    View full-size slide

  153. @tyler_treat
    https://github.com/jonhoo/drwmutex/blob/master/drwmutex.go
    @tyler_treat

    View full-size slide

  154. @tyler_treat
    padding …
    64 bytes
    (cache line size)
    memory
    24 bytes
    RWMutex1
    Cache rules everything around me

    View full-size slide

  155. @tyler_treat
    @tyler_treat

    View full-size slide

  156. @tyler_treat
    @tyler_treat

    View full-size slide

  157. @tyler_treat
    @tyler_treat

    View full-size slide

  158. @tyler_treat
    Go makes concurrency

    easy enough to be
    dangerous.

    View full-size slide

  159. @tyler_treat
    Conclusions

    View full-size slide

  160. @tyler_treat
    The standard library provides

    general solutions (and they’re

    generally what you should use).
    1

    View full-size slide

  161. @tyler_treat
    Seemingly small, idiomatic

    decisions can have profound

    performance implications.
    2

    View full-size slide

  162. @tyler_treat
    The Go toolchain has lots

    of tools for analyzing your

    code—learn them.
    3

    View full-size slide

  163. @tyler_treat
    Go’s compiler and runtime

    continue to improve.
    4

    View full-size slide

  164. @tyler_treat
    Performance profile can

    change dramatically

    between releases.
    5

    View full-size slide

  165. @tyler_treat
    Relying on assumptions

    can be fatal.
    6

    View full-size slide

  166. @tyler_treat
    Code is marginal,

    architecture is material.
    7

    View full-size slide

  167. @tyler_treat
    Peeking behind the curtains

    can pay dividends.
    8

    View full-size slide

  168. @tyler_treat
    Above all, optimize for the

    right trade-off.
    9

    View full-size slide

  169. @tyler_treat
    Thanks!

    View full-size slide