Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tackling contention: the monsters inside the `s...

Tackling contention: the monsters inside the `sync.Locker`

Go is all about parallelism and concurrency, but they don’t come for free. This talk is about measuring their contention price and being able to reduce it.

Roberto (Rob) Clapis

May 30, 2019
Tweet

More Decks by Roberto (Rob) Clapis

Other Decks in Technology

Transcript

  1. Who am I Security Enhancement for the Web @ Google

    Gopher since 2015 Strong interests in parallel programming since 2009 Roberto Clapis (@empijei)
  2. The problem Go makes it easy to write parallel and

    concurrent code. Parallel concurrent code might cause contention. Contention might cause performance loss.
  3. The tools: pprof pprof is very powerful and has many

    potential profiles. Just run your benchmarks with -cpuprofile=cpu.out Then run go tool pprof <exec.test> cpu.out Now type "web". Can also be collected live with minimum overhead and "net/http/pprof".
  4. pprof with "cpuprofile" for i := 0; i < count;

    i++ { go func() { for j := 0; j < 100000; j++ { mu.Lock() a++ mu.Unlock() } }() }
  5. pprof with "mutexprofile" ROUTINE ============= one.Contend.func1 in main.go 1.02ms (flat,

    cum) 4.13% of Total . 64: for j := 0; j < 100000; j++ { 1.02ms 65: mu.Lock() . 66: a++ . 67: mu.Unlock()
  6. The tools: trace Like profiles traces can be captured live

    by importing "net/http/pprof". Traces can also be generated with go test -trace. go tool trace opens a browser (/trace only works with chrome). The blocking profile is still pprof.
  7. Beware of Heisenberg's uncertainty principle If you try to measure

    contention too precisely, measurements will influence the runtime and will change the values.
  8. Warning Before you make any change on your code based

    on what follows you should measure. Never optimize for contention reduction unless you know it is your bottleneck. When you measure, beware of sample size: small samples will give wrong results.
  9. Fix it • Reduce contention by changing the algorithm •

    Reduce contention by changing primitives As engineers do
  10. Channels Channels are mutex-protected structs. Channels are blocking, and if

    unbuffered every single operation will block. If many send/receive operations are executed in parallel they will fight for the lock. // runtime/chan.go type hchan struct { // Lots of unexported fields lock mutex }
  11. Sharding: decentralize state If jobs take more or less the

    same time and if the queue is stable contention can be reduced and bandwidth can be increased by splitting work. chan chan
  12. Batching: reduce n. of accesses to shared state If latency

    is not critical and if eventual consistency is enough contention can be reduced and bandwidth can be increased by sending batches of work over the channel. chan batch batch
  13. Mutexes Mutex: faster than channels by a constant, but trade

    orchestration, readability and writability for the extra speed. That said Go mutexes are fair and fast. If a mutex becomes contended the same principles apply: ◦ Sharding (beware of Zipf law) ◦ Batching (beware of inconsistency) ◦ Shorten critical section to what is needed RWMutex: for a very high reads/writes ratio
  14. Mutexes are fast func (m *Mutex) Lock() { if atomic.CompareAndSwapInt32(&m.state,

    0, mutexLocked) { return // Handle contended case and starvation func (m *Mutex) Unlock() { new := atomic.AddInt32(&m.state, -mutexLocked) // Handle starvation
  15. End of the safe zone This is where we abandon

    hope In some cases the previous suggestions might not be enough.
  16. Intuition will not help much func ContendIntType() { var wg

    sync.WaitGroup wg.Add(cores) c := make([]Type, cores) for i := 0; i < cores; i++ { go func(i int) { defer wg.Done() for j := 0; j < ops; j++ { c[i] += Type(j) } }(i) } wg.Wait() } • Uint8 • Uint32 • Uint64 Which is faster?
  17. When addressing lock contention is not enough In the ideal

    case lock contention is the only contention that exists, but computers are real machines, not ideal machines. Enter cache contention. Sharing state across cores requires to go through layers of memory. Main memory Cache L3 Cache L2 Cache L2 Cache L1 Cache L1 Cache L1 Cache L1 Core Core Core Core
  18. Real world "Reads" Acquiring a Read lock for a RWMutex

    requires to Write a counter. Even atomic adds are executed sequentially in the hardware. This is counterintuitive but modern hardware works this way. Main memory Cache L3 Cache L2 Cache L2 Cache L1 Cache L1 Cache L1 Cache L1 Core Core Core Core
  19. Cache sharing: sync.Map sync.Map is functionally equivalent to a map

    with an RWMutex and it is cache friendly. Downside is that it gives less guarantees, less methods and loses type safety, so you should only use it if necessary. It is strongly advised to never use it directly but write type-safe wrappers for it. Package sync type Map func (m *Map) Delete(key interface{}) func (m *Map) Load(key interface{}) (val interface{}, ok bool) func (m *Map) Range(f func(key, value interface{}) bool) func (m *Map) Store(key, value interface{})
  20. If a map is not what you need Sometimes you

    need to have something different than a map that you rarely write but frequently read. In those rare cases you can use the atomic package. If you deal with objects you'll need to also use unsafe.
  21. What could possibly go wrong? Race detector will probably not

    detect your races. It is very hard to reason about the code. Very few people can. Even the experts on these topics have a hard time debugging this kind of code.
  22. If you really need to Read: v := atomic.Load(addr) Write:

    for { oldValue := atomic.Load(addr) newValue := change(oldValue) if atomic.CompareAndSwap(addr, oldValue, newValue) { break } } Warning: do not just use Store or you'll get a nasty data race. Be also aware of the ABA problem.
  23. A bit of history • sync.WaitGroup was rewritten from mutexes

    to atomics on Jul 2011. This introduced a bug that caused some random-looking memory corruptions. The bug was fixed on Apr 2014. • A bug in the parallel GC was introduced on Sep 2011 and fixed only on Jan 2014. runtime.parfordo(work.sweepfor); bufferList[m->helpgc].busy = 0; if(runtime.xadd(&work.ndone, +1) == work.nproc-1) runtime.notewakeup(&work.alldone);