Upgrade to Pro — share decks privately, control downloads, hide ads and more …

So You Wanna Go Fast?

Dcbf01e42178cd9698fb3d4806e33d84?s=47 Tyler Treat
September 29, 2017

So You Wanna Go Fast?

Go's simplicity and concurrency model make it an appealing choice for backend systems, but how does it fare for latency-sensitive applications? In this talk, we explore the other side of the coin by providing some tips on writing high-performance Go and lessons learned in the process. We do a deep dive on low-level performance optimizations in order to make Go a more compelling option in the world of systems programming, but we also consider the trade-offs involved.

Dcbf01e42178cd9698fb3d4806e33d84?s=128

Tyler Treat

September 29, 2017
Tweet

Transcript

  1. Wanna Go So You Fast? Strange Loop 2017 @tyler_treat

  2. @tyler_treat @tyler_treat

  3. @tyler_treat this one weird trick Make your code faster with

  4. @tyler_treat this one weird trick Make your code faster with

  5. @tyler_treat So You Wanna Subvert Go?

  6. @tyler_treat Spoiler Alert:
 Go is not a
 systems language…

  7. @tyler_treat but that doesn’t mean you can’t build internet-scale systems

    with it.
  8. @tyler_treat

  9. @tyler_treat This is a talk about how to write terrible

    Go code.
  10. @tyler_treat @tyler_treat

  11. @tyler_treat Because this is a talk about trade-offs.

  12. @tyler_treat - Messaging Nerd @ Apcera - Working on nats.io

    - Distributed systems - bravenewgeek.com Tyler Treat
  13. @tyler_treat @tyler_treat

  14. @tyler_treat matter? Why does this talk

  15. @tyler_treat The compiler isn’t magic.

  16. @tyler_treat The compiler isn’t magic.

  17. @tyler_treat You have to be
 mindful of performance
 when it

    matters.
  18. @tyler_treat @tyler_treat Where bad things hide

  19. @tyler_treat @tyler_treat Where bad things hide Where we’re usually looking

  20. @tyler_treat Tire fires at scale @tyler_treat

  21. @tyler_treat @tyler_treat @tyler_treat

  22. @tyler_treat @tyler_treat @tyler_treat

  23. @tyler_treat @tyler_treat @tyler_treat

  24. @tyler_treat Overview - Measuring performance - Language features - Memory

    management - Concurrency and multi-core
  25. @tyler_treat Overview - Measuring performance - Language features - Memory

    management - Concurrency and multi-core
  26. @tyler_treat Disclaimer:
 Don’t blindly apply optimizations presented.

  27. @tyler_treat tl;dr of this talk is
 “IT DEPENDS!”

  28. @tyler_treat Measure Optimize

  29. @tyler_treat Measurement Techniques - pprof
 - memory
 - cpu
 -

    blocking - GODEBUG
 - gctrace
 - schedtrace
 - allocfreetrace - Benchmarking
 - Code-level: testing.B
 - System-level: HdrHistogram (https://github.com/codahale/hdrhistogram)
 bench (https://github.com/tylertreat/bench)
  30. @tyler_treat @tyler_treat

  31. @tyler_treat The only way to get good at something is

    to be really fucking bad at it
 for a long time.
  32. @tyler_treat Benchmarking… a great way to rattle the
 Hacker News

    fart chamber.
  33. @tyler_treat Overview - Measuring performance - Language features - Memory

    management - Concurrency and multi-core
  34. @tyler_treat channels

  35. @tyler_treat “Instead of explicitly using locks to mediate access to

    shared data, Go encourages the use of channels to pass references to data between goroutines.” https://blog.golang.org/share-memory-by-communicating
  36. @tyler_treat @tyler_treat

  37. @tyler_treat @tyler_treat USE CHANNELS TO COORDINATE, NOT SYNCHRONIZE.

  38. @tyler_treat @tyler_treat

  39. @tyler_treat @tyler_treat

  40. @tyler_treat defer

  41. @tyler_treat @tyler_treat

  42. @tyler_treat Is defer still slow?

  43. @tyler_treat @tyler_treat

  44. @tyler_treat The Secret Life of interface{}

  45. @tyler_treat type Stringer interface {
 String() string
 } https://research.swtch.com/interfaces

  46. @tyler_treat type Stringer interface {
 String() string
 }
 type Binary

    uint64 https://research.swtch.com/interfaces
  47. @tyler_treat type Stringer interface {
 String() string
 }
 type Binary

    uint64 200 b := Binary(200) https://research.swtch.com/interfaces
  48. @tyler_treat type Stringer interface {
 String() string
 }
 type Binary

    uint64
 func (i Binary) String() string { return strconv.FormatUint(uint64(i), 2) } 200 b := Binary(200) https://research.swtch.com/interfaces
  49. @tyler_treat type Stringer interface {
 String() string
 } https://research.swtch.com/interfaces s

    := Stringer(b) Stringer tab data
  50. @tyler_treat s := Stringer(b) Stringer tab data .
 .
 .

    itable(Stringer, Binary) type fun[0] type(Binary) (*Binary).String type Stringer interface {
 String() string
 } https://research.swtch.com/interfaces
  51. @tyler_treat tab data 200 Binary s := Stringer(b) Stringer .


    .
 . itable(Stringer, Binary) type fun[0] type(Binary) (*Binary).String type Stringer interface {
 String() string
 } https://research.swtch.com/interfaces
  52. @tyler_treat

  53. @tyler_treat So what?

  54. @tyler_treat @tyler_treat

  55. @tyler_treat @tyler_treat

  56. @tyler_treat @tyler_treat Sorting 100M Interfaces

  57. @tyler_treat @tyler_treat Sorting 100M Interfaces

  58. @tyler_treat @tyler_treat Sorting 100M Structs

  59. @tyler_treat @tyler_treat Sorting 100M Structs

  60. @tyler_treat $ go test -bench=. -gcflags="-m"

  61. @tyler_treat $ go test -bench=. -gcflags="-m"

  62. @tyler_treat @tyler_treat

  63. @tyler_treat $ go test -bench=. -gcflags="-l"

  64. @tyler_treat @tyler_treat Struct
 No Inlining Interface
 No Inlining

  65. @tyler_treat @tyler_treat Struct
 No Inlining Interface
 No Inlining

  66. @tyler_treat @tyler_treat Struct
 No Inlining Interface
 No Inlining

  67. @tyler_treat @tyler_treat

  68. @tyler_treat @tyler_treat x.(*T) inlined

  69. @tyler_treat @tyler_treat SSA backend &
 remaining type
 conversions inlined x.(*T)

    inlined
  70. @tyler_treat @tyler_treat

  71. @tyler_treat

  72. @tyler_treat @tyler_treat Struct Interface

  73. @tyler_treat @tyler_treat Struct Interface

  74. @tyler_treat @tyler_treat

  75. @tyler_treat $ go test -bench=. -gcflags="-S"

  76. @tyler_treat $ go test -bench=. -gcflags="-S"

  77. @tyler_treat $ go test -bench=. -gcflags="-S"

  78. @tyler_treat Key Insight: If performance matters,
 write type-specific code.

  79. @tyler_treat Overview - Measuring performance - Language features - Memory

    management - Concurrency and multi-core
  80. @tyler_treat []byte to string
 conversions

  81. @tyler_treat

  82. @tyler_treat @tyler_treat

  83. @tyler_treat @tyler_treat

  84. @tyler_treat What’s going on here?

  85. @tyler_treat @tyler_treat

  86. @tyler_treat memory allocation

  87. @tyler_treat @tyler_treat

  88. @tyler_treat How is sync.Pool so fast?

  89. @tyler_treat Per-CPU storage!

  90. @tyler_treat @tyler_treat https://golang.org/src/sync/pool.go

  91. @tyler_treat @tyler_treat https://golang.org/src/sync/pool.go

  92. @tyler_treat @tyler_treat

  93. @tyler_treat Overview - Measuring performance - Language features - Memory

    management - Concurrency and multi-core
  94. @tyler_treat “We generally don’t want sync/atomic to be used at

    all…Experience has shown us again and again that very very few people are capable of writing correct code that uses atomic operations…” —Ian Lance Taylor
  95. @tyler_treat

  96. @tyler_treat @tyler_treat Subscribers Messages Fast Topic Matching http://bravenewgeek.com/fast-topic-matching/

  97. @tyler_treat @tyler_treat Subscribers Messages Fast Topic Matching http://bravenewgeek.com/fast-topic-matching/

  98. @tyler_treat @tyler_treat Fast Topic Matching

  99. @tyler_treat @tyler_treat Fast Topic Matching

  100. @tyler_treat @tyler_treat

  101. @tyler_treat @tyler_treat Fast Topic Matching

  102. @tyler_treat @tyler_treat Concurrent
 80,000 inserts
 80,000 lookups


  103. @tyler_treat @tyler_treat Ctrie

  104. @tyler_treat @tyler_treat G1 G1 1. Assign a generation, G1, to

    each
 I-node (empty struct). Ctrie
  105. @tyler_treat 1. Assign a generation, G1, to each
 I-node (empty

    struct). 2. Add new node by copying I-node with updated branch and generation then GCAS, i.e. atomically:
 - compare I-nodes to detect tree
 mutations.
 - compare root generations to detect
 snapshots. @tyler_treat G2 G1 Ctrie
  106. @tyler_treat @tyler_treat

  107. @tyler_treat @tyler_treat

  108. @tyler_treat The Go race detector
 doesn’t protect you from
 doing

    dumb stuff.
  109. @tyler_treat @tyler_treat

  110. @tyler_treat @tyler_treat

  111. @tyler_treat @tyler_treat

  112. @tyler_treat Side note:
 unsafe is, in fact, unsafe.

  113. @tyler_treat “Packages that import unsafe may depend on internal properties

    of the Go implementation. We reserve the right to make changes to the implementation that may break such programs.” https://golang.org/doc/go1compat
  114. @tyler_treat

  115. @tyler_treat Key Insight: Struct layout can make
 a big difference.

  116. @tyler_treat @tyler_treat Mechanical Sympathy

  117. @tyler_treat https://github.com/Workiva/go-datastructures/blob/master/queue/ring.go @tyler_treat

  118. @tyler_treat @tyler_treat

  119. @tyler_treat @tyler_treat

  120. @tyler_treat @tyler_treat

  121. @tyler_treat @tyler_treat https://golang.org/src/sync/rwmutex.go

  122. @tyler_treat @tyler_treat https://golang.org/src/sync/rwmutex.go

  123. @tyler_treat CPU reader reader reader RWMutex

  124. @tyler_treat CPU reader reader CPU reader reader reader RWMutex

  125. @tyler_treat CPU reader reader CPU reader reader reader reader CPU

    reader reader reader RWMutex
  126. @tyler_treat CPU reader reader CPU reader reader reader reader CPU

    reader reader CPU reader reader reader RWMutex
  127. @tyler_treat RWMutex CPU reader reader CPU reader reader reader reader

    reader writer CPU reader reader reader reader CPU reader writer reader reader
  128. @tyler_treat RWMutex CPU reader reader CPU reader reader reader reader

    reader writer CPU reader reader reader reader CPU reader writer reader reader
  129. @tyler_treat RWMutex CPU reader reader CPU reader reader reader reader

    reader writer CPU reader reader reader reader CPU reader writer reader reader
  130. @tyler_treat RWMutex CPU reader reader CPU reader reader reader reader

    reader writer CPU reader reader reader reader CPU reader writer reader reader
  131. @tyler_treat RWMutex CPU reader reader CPU reader reader reader reader

    reader writer CPU reader reader reader reader CPU reader writer reader reader
  132. @tyler_treat RWMutex CPU reader reader CPU reader reader reader reader

    reader writer CPU reader reader reader reader CPU reader writer reader reader
  133. @tyler_treat RWMutex CPU reader reader CPU reader reader reader reader

    reader writer CPU reader reader reader reader CPU reader writer reader reader RWMutex RWMutex RWMutex
  134. @tyler_treat RWMutex CPU reader reader CPU reader reader reader reader

    reader writer CPU reader reader reader reader CPU reader writer reader reader RWMutex RWMutex RWMutex
  135. @tyler_treat RWMutex CPU reader reader CPU reader reader reader reader

    reader writer CPU reader reader reader reader CPU reader writer reader reader RWMutex RWMutex RWMutex
  136. @tyler_treat RWMutex CPU reader reader CPU reader reader reader reader

    reader writer CPU reader reader reader reader CPU reader writer reader reader RWMutex RWMutex RWMutex
  137. @tyler_treat RWMutex CPU reader reader CPU reader reader reader reader

    reader writer CPU reader reader reader reader CPU reader writer reader reader RWMutex RWMutex RWMutex
  138. @tyler_treat RWMutex CPU reader reader CPU reader reader reader reader

    reader writer CPU reader reader reader reader CPU reader writer reader reader RWMutex RWMutex RWMutex
  139. @tyler_treat RWMutex CPU reader reader CPU reader reader reader reader

    reader writer CPU reader reader reader reader CPU reader writer reader reader RWMutex RWMutex RWMutex
  140. CPU reader CPU reader reader reader CPU reader reader CPU

    reader reader U writer CPU reader reader CPU reader reader reader reader reader writer CPU reader reader reader reader CPU reader writer reader reader CPU reader reader CPU reader reader reader writer reader reader CPU reader reader reader reader CPU reader reader reader reader reader reader reader reader reader reader reader reader U reader reader ader ader U reader reader ader ader ader reader ader CPU read read reader reader CPU read read reader reader CPU read reader read reader @tyler_treat
  141. @tyler_treat @tyler_treat

  142. @tyler_treat How to create
 CPU->RWMutex
 mapping?

  143. @tyler_treat @tyler_treat https://github.com/jonhoo/drwmutex/blob/master/cpu_amd64.s

  144. @tyler_treat /proc/cpuinfo

  145. @tyler_treat @tyler_treat

  146. @tyler_treat memory RWMutex1 24 bytes

  147. @tyler_treat RWMutex1 RWMutex2 memory 24 bytes

  148. @tyler_treat RWMutex1 RWMutex2 RWMutex3 memory 24 bytes

  149. @tyler_treat RWMutex1 RWMutex2 RWMutex3 RWMutexN … memory 24 bytes

  150. @tyler_treat RWMutex1 RWMutex2 RWMutex3 RWMutexN … memory 24 bytes 64

    bytes (cache line size)
  151. @tyler_treat RWMutex1 RWMutex2 RWMutex3 RWMutexN … memory 24 bytes 64

    bytes (cache line size) Cache rules everything around me
  152. @tyler_treat https://github.com/jonhoo/drwmutex/blob/master/drwmutex.go @tyler_treat

  153. @tyler_treat https://github.com/jonhoo/drwmutex/blob/master/drwmutex.go @tyler_treat

  154. @tyler_treat padding … 64 bytes (cache line size) memory 24

    bytes RWMutex1 Cache rules everything around me
  155. @tyler_treat @tyler_treat

  156. @tyler_treat @tyler_treat

  157. @tyler_treat @tyler_treat

  158. @tyler_treat Go makes concurrency
 easy enough to be dangerous.

  159. @tyler_treat Conclusions

  160. @tyler_treat The standard library provides
 general solutions (and they’re
 generally

    what you should use). 1
  161. @tyler_treat Seemingly small, idiomatic
 decisions can have profound
 performance implications.

    2
  162. @tyler_treat The Go toolchain has lots
 of tools for analyzing

    your
 code—learn them. 3
  163. @tyler_treat Go’s compiler and runtime
 continue to improve. 4

  164. @tyler_treat Performance profile can
 change dramatically
 between releases. 5

  165. @tyler_treat Relying on assumptions
 can be fatal. 6

  166. @tyler_treat Code is marginal,
 architecture is material. 7

  167. @tyler_treat Peeking behind the curtains
 can pay dividends. 8

  168. @tyler_treat Above all, optimize for the
 right trade-off. 9

  169. @tyler_treat Thanks!