The tricky truth about parallel execution and modern hardware

The tricky truth about parallel execution and modern hardware Dirkjan
Bussink @dbussink

Causality

a = 1 b = ""

a = 1 b = "" x = b y
= a CPU1 CPU2 a = 0 b = 0

a = 1 b = "" x = b y
= a CPU1 CPU2 x = "" y = 1

a = 1 b = "" x = b y
= a CPU1 CPU2 x = 0 y = 0

a = 1 b = "" x = b CPU1
CPU2 x = 0 y = 1 y = a

x y x y x y x = "" y
= 0

Compiler optimization

a = 1 b = ""

b = "" a = 1

a = 1 x = b y = a CPU1
CPU2 x = "" y = 0 b = ""

Out of order execution

8.2.3.4 Loads May Be Reordered with Earlier Stores to Different
Locations Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A

store a, 1 store b, "" x = load b
y = load a ! x == "" y == 1 x = load b y = load a store a, 1 store b, "" ! x == 0 y == 0 store a, 1 x = load b y = load a store b, "" ! x == 0 y == 1

Not all architectures are created equal

store a, store b, x y ! x y x
y store a, store b, ! x y store a, x y store b, ! x y store b, "" x = load b y = load a store a, 1 ! x == "" y == 0

CPU caches

Memory is slow

L1 L2 L3 RAM 4 cycles ! 10 cycles !
40 - 75 cycles ! 60ns - 100ns hundreds of cycles

Caching is hard…

Store buffer

a = 1 x = b b = "" y
= a CPU1 CPU2 a = 0 b = 0

a = 1 x = b CPU1 CPU2 x =
0 y = 1 b = "" y = a

a = 1 x = b CPU1 CPU2 x =
"" y = 0 b = "" y = a

a = 1 x = b b = "" CPU1
CPU2 x = "" y = 1 y = a

8.2.3.4 Loads May Be Reordered with Earlier Stores to Different
Locations Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A

a = 1 CPU1 CPU2 x = 0 y =
0 y = a b = "" x = b

Fixing it!

Memory barriers

__asm__ __volatile__ ("mfence" ::: "memory");

sfence lfence mfence

Double check locking

class Foo @mutex = Mutex.new ! def initialize @bar =
"Bar" end ! def self.instance unless @instance @mutex.synchronize do unless @instance @instance = Foo.new end end end @instance end end

class Foo @mutex = Mutex.new ! def initialize @bar =
"Bar" end ! def self.instance unless @instance @mutex.synchronize do instance = Foo.new # Insert compiler barrier @instance = instance end end @instance end end Explicit synchronization

False sharing

CPU Cache

class Foo attr_accessor :a end ! f = Foo.new i
= 0 while i < 100000 f.a = 0 i += 1 end i = 0 while i < 100000 f.a = 1 i += 1 end CPU1 CPU2

L1 L2 L3 RAM 4 cycles ! 10 cycles !
40 - 75 cycles ! 60ns - 100ns hundreds of cycles Shared across cores

class Foo attr_accessor :a attr_accessor :b end ! f =
Foo.new i = 0 while i < 100000 f.b = 0 i += 1 end i = 0 while i < 100000 f.a = 1 i += 1 end CPU1 CPU2

Cache lines

class Foo attr_accessor :a ... attr_accessor :k end ! f
= Foo.new i = 0 while i < 100000 f.k = 0 i += 1 end i = 0 while i < 100000 f.a = 1 i += 1 end CPU1 CPU2

real single thread 4.252730 actual sharing 19.963792 false sharing 19.803237
no false sharing 4.617507

What is thread safe code?

Future

Ostrich strategy

Memory model

Better API’s

The tricky truth about parallel execution and ...

The tricky truth about parallel execution and modern hardware

More Decks by Dirkjan Bussink

Other Decks in Technology

Featured

Transcript