The tricky truth about parallel execution and modern hardware

Slide 1

Slide 1 text

The tricky truth about parallel execution and modern hardware Dirkjan Bussink @dbussink

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Causality

Slide 5

Slide 5 text

a = 1 b = ""

Slide 6

Slide 6 text

a = 1 b = "" x = b y = a CPU1 CPU2 a = 0 b = 0

Slide 7

Slide 7 text

a = 1 b = "" x = b y = a CPU1 CPU2 x = "" y = 1

Slide 8

Slide 8 text

a = 1 b = "" x = b y = a CPU1 CPU2 x = 0 y = 0

Slide 9

Slide 9 text

a = 1 b = "" x = b CPU1 CPU2 x = 0 y = 1 y = a

Slide 10

Slide 10 text

Slide 11

Slide 11 text

x y x y x y x = "" y = 0

Slide 12

Slide 12 text

Wat?

Slide 13

Slide 13 text

Compiler optimization

Slide 14

Slide 14 text

a = 1 b = ""

Slide 15

Slide 15 text

b = "" a = 1

Slide 16

Slide 16 text

a = 1 x = b y = a CPU1 CPU2 x = "" y = 0 b = ""

Slide 17

Slide 17 text

Out of order execution

Slide 18

Slide 18 text

8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A

Slide 19

Slide 19 text

store a, 1 store b, "" x = load b y = load a ! x == "" y == 1 x = load b y = load a store a, 1 store b, "" ! x == 0 y == 0 store a, 1 x = load b y = load a store b, "" ! x == 0 y == 1

Slide 20

Slide 20 text

Not all architectures are created equal

Slide 21

Slide 21 text

ARMv7

Slide 22

Slide 22 text

store a, store b, x y ! x y x y store a, store b, ! x y store a, x y store b, ! x y store b, "" x = load b y = load a store a, 1 ! x == "" y == 0

Slide 23

Slide 23 text

CPU caches

Slide 24

Slide 24 text

Memory is slow

Slide 25

Slide 25 text

L1 L2 L3 RAM 4 cycles ! 10 cycles ! 40 - 75 cycles ! 60ns - 100ns hundreds of cycles

Slide 26

Slide 26 text

Caching is hard…

Slide 27

Slide 27 text

Store buffer

Slide 28

Slide 28 text

a = 1 x = b b = "" y = a CPU1 CPU2 a = 0 b = 0

Slide 29

Slide 29 text

a = 1 x = b CPU1 CPU2 x = 0 y = 1 b = "" y = a

Slide 30

Slide 30 text

a = 1 x = b CPU1 CPU2 x = "" y = 0 b = "" y = a

Slide 31

Slide 31 text

a = 1 x = b b = "" CPU1 CPU2 x = "" y = 1 y = a

Slide 32

Slide 32 text

8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A

Slide 33

Slide 33 text

a = 1 CPU1 CPU2 x = 0 y = 0 y = a b = "" x = b

Slide 34

Slide 34 text

Fixing it!

Slide 35

Slide 35 text

Memory barriers

Slide 36

Slide 36 text

__asm__ __volatile__ ("mfence" ::: "memory");

Slide 37

Slide 37 text

sfence lfence mfence

Slide 38

Slide 38 text

Double check locking

Slide 39

Slide 39 text

class Foo @mutex = Mutex.new ! def initialize @bar = "Bar" end ! def self.instance unless @instance @mutex.synchronize do unless @instance @instance = Foo.new end end end @instance end end

Slide 40

Slide 40 text

class Foo @mutex = Mutex.new ! def initialize @bar = "Bar" end ! def self.instance unless @instance @mutex.synchronize do instance = Foo.new # Insert compiler barrier @instance = instance end end @instance end end Explicit synchronization

Slide 41

Slide 41 text

Ruby?

Slide 42

Slide 42 text

False sharing

Slide 43

Slide 43 text

CPU Cache

Slide 44

Slide 44 text

class Foo attr_accessor :a end ! f = Foo.new i = 0 while i < 100000 f.a = 0 i += 1 end i = 0 while i < 100000 f.a = 1 i += 1 end CPU1 CPU2

Slide 45

Slide 45 text

L1 L2 L3 RAM 4 cycles ! 10 cycles ! 40 - 75 cycles ! 60ns - 100ns hundreds of cycles Shared across cores

Slide 46

Slide 46 text

class Foo attr_accessor :a attr_accessor :b end ! f = Foo.new i = 0 while i < 100000 f.b = 0 i += 1 end i = 0 while i < 100000 f.a = 1 i += 1 end CPU1 CPU2

Slide 47

Slide 47 text

Cache lines

Slide 48

Slide 48 text

class Foo attr_accessor :a ... attr_accessor :k end ! f = Foo.new i = 0 while i < 100000 f.k = 0 i += 1 end i = 0 while i < 100000 f.a = 1 i += 1 end CPU1 CPU2

Slide 49

Slide 49 text

real single thread 4.252730 actual sharing 19.963792 false sharing 19.803237 no false sharing 4.617507