Tackling contention: the monsters inside the `sync.Locker`

The monsters in the Locker Tackling contention in Go

Who am I Security Enhancement for the Web @ Google
Gopher since 2015 Strong interests in parallel programming since 2009 Roberto Clapis (@empijei)

Contention The competition between diﬀerent processes to acquire a resource.

The problem Go makes it easy to write parallel and
concurrent code. Parallel concurrent code might cause contention. Contention might cause performance loss.

Measure it Go has amazing measurement tools: • pprof •
trace As scientists do

The tools: pprof pprof is very powerful and has many
potential proﬁles. Just run your benchmarks with -cpuprofile=cpu.out Then run go tool pprof <exec.test> cpu.out Now type "web". Can also be collected live with minimum overhead and "net/http/pprof".

pprof with "cpuproﬁle" for i := 0; i < count;
i++ { go func() { for j := 0; j < 100000; j++ { mu.Lock() a++ mu.Unlock() } }() }

The full output

pprof with "mutexproﬁle" ROUTINE ============= one.Contend.func1 in main.go 1.02ms (flat,
cum) 4.13% of Total . 64: for j := 0; j < 100000; j++ { 1.02ms 65: mu.Lock() . 66: a++ . 67: mu.Unlock()

The tools: trace Like proﬁles traces can be captured live
by importing "net/http/pprof". Traces can also be generated with go test -trace. go tool trace opens a browser (/trace only works with chrome). The blocking proﬁle is still pprof.

Regions

Goroutines

/trace

Beware of Heisenberg's uncertainty principle If you try to measure
contention too precisely, measurements will inﬂuence the runtime and will change the values.

Odd results You might end up measuring the measurements tools
performance.

Warning Before you make any change on your code based
on what follows you should measure. Never optimize for contention reduction unless you know it is your bottleneck. When you measure, beware of sample size: small samples will give wrong results.

Fix it • Reduce contention by changing the algorithm •
Reduce contention by changing primitives As engineers do

A common pattern: multi-producer multi-consumer chan

A common pattern: double-bottleneck chan Shared, contended state

Channels Channels are mutex-protected structs. Channels are blocking, and if
unbuﬀered every single operation will block. If many send/receive operations are executed in parallel they will ﬁght for the lock. // runtime/chan.go type hchan struct { // Lots of unexported fields lock mutex }

Sharding: decentralize state If jobs take more or less the
same time and if the queue is stable contention can be reduced and bandwidth can be increased by splitting work. chan chan

Batching: reduce n. of accesses to shared state If latency
is not critical and if eventual consistency is enough contention can be reduced and bandwidth can be increased by sending batches of work over the channel. chan batch batch

Mutexes Mutex: faster than channels by a constant, but trade
orchestration, readability and writability for the extra speed. That said Go mutexes are fair and fast. If a mutex becomes contended the same principles apply: ◦ Sharding (beware of Zipf law) ◦ Batching (beware of inconsistency) ◦ Shorten critical section to what is needed RWMutex: for a very high reads/writes ratio

Mutexes are fast func (m *Mutex) Lock() { if atomic.CompareAndSwapInt32(&m.state,
0, mutexLocked) { return // Handle contended case and starvation func (m *Mutex) Unlock() { new := atomic.AddInt32(&m.state, -mutexLocked) // Handle starvation

End of the safe zone This is where we abandon
hope In some cases the previous suggestions might not be enough.

Intuition will not help much func ContendIntType() { var wg
sync.WaitGroup wg.Add(cores) c := make([]Type, cores) for i := 0; i < cores; i++ { go func(i int) { defer wg.Done() for j := 0; j < ops; j++ { c[i] += Type(j) } }(i) } wg.Wait() } • Uint8 • Uint32 • Uint64 Which is faster?

False sharing ContendUint8-4 200 5808111 ns/op ContendUint32-4 1000 1075017 ns/op
ContendUint64-4 2000 956336 ns/op

When addressing lock contention is not enough In the ideal
case lock contention is the only contention that exists, but computers are real machines, not ideal machines. Enter cache contention. Sharing state across cores requires to go through layers of memory. Main memory Cache L3 Cache L2 Cache L2 Cache L1 Cache L1 Cache L1 Cache L1 Core Core Core Core

Real world "Reads" Acquiring a Read lock for a RWMutex
requires to Write a counter. Even atomic adds are executed sequentially in the hardware. This is counterintuitive but modern hardware works this way. Main memory Cache L3 Cache L2 Cache L2 Cache L1 Cache L1 Cache L1 Cache L1 Core Core Core Core

Cache sharing: sync.Map sync.Map is functionally equivalent to a map
with an RWMutex and it is cache friendly. Downside is that it gives less guarantees, less methods and loses type safety, so you should only use it if necessary. It is strongly advised to never use it directly but write type-safe wrappers for it. Package sync type Map func (m *Map) Delete(key interface{}) func (m *Map) Load(key interface{}) (val interface{}, ok bool) func (m *Map) Range(f func(key, value interface{}) bool) func (m *Map) Store(key, value interface{})

If a map is not what you need Sometimes you
need to have something diﬀerent than a map that you rarely write but frequently read. In those rare cases you can use the atomic package. If you deal with objects you'll need to also use unsafe.

What could possibly go wrong? Race detector will probably not
detect your races. It is very hard to reason about the code. Very few people can. Even the experts on these topics have a hard time debugging this kind of code.

If you really need to Read: v := atomic.Load(addr) Write:
for { oldValue := atomic.Load(addr) newValue := change(oldValue) if atomic.CompareAndSwap(addr, oldValue, newValue) { break } } Warning: do not just use Store or you'll get a nasty data race. Be also aware of the ABA problem.

A bit of history • sync.WaitGroup was rewritten from mutexes
to atomics on Jul 2011. This introduced a bug that caused some random-looking memory corruptions. The bug was ﬁxed on Apr 2014. • A bug in the parallel GC was introduced on Sep 2011 and ﬁxed only on Jan 2014. runtime.parfordo(work.sweepfor); bufferList[m->helpgc].busy = 0; if(runtime.xadd(&work.ndone, +1) == work.nproc-1) runtime.notewakeup(&work.alldone);

The ﬁx

Questions and, hopefully, answers! clap.page.link/contention Twitter: Roberto Clapis (@empijei)

Tackling contention: the monsters inside the `s...

Tackling contention: the monsters inside the `sync.Locker`

More Decks by Roberto (Rob) Clapis

Other Decks in Technology

Featured

Transcript