The tricky truth about parallel execution and modern hardware

The tricky truth about parallel execution and modern hardware

Concurrency and parallelism in Ruby are more and more important in the future. Machines will be multi-core and parallelization is often the way these days to speed things up.

At a hardware level, this parallel world is not always a nice and simple place to live. As Ruby implementations get faster and hardware more parallel, these details will matter for you as a Ruby developer too.

Want to know about the pitfalls are of double check locking? No idea what out of order execution means? How CPU cache effects can lead to obscure crashes? What this thing called a memory barrier is? How false sharing can cause performance issues?

Come listen if you want to know about nitty gritty details that can affect your Ruby application in the future.

B012094b37ab6946c44eaa41d7828478?s=128

Dirkjan Bussink

November 09, 2013
Tweet

Transcript

  1. The tricky truth about parallel execution and modern hardware Dirkjan

    Bussink @dbussink
  2. None
  3. None
  4. Causality

  5. a = 1 b = ""

  6. a = 1 b = "" x = b y

    = a CPU1 CPU2 a = 0 b = 0
  7. a = 1 b = "" x = b y

    = a CPU1 CPU2 x = "" y = 1
  8. a = 1 b = "" x = b y

    = a CPU1 CPU2 x = 0 y = 0
  9. a = 1 b = "" x = b CPU1

    CPU2 x = 0 y = 1 y = a
  10. ?

  11. x y x y x y x = "" y

    = 0
  12. Wat?

  13. Compiler optimization

  14. a = 1 b = ""

  15. b = "" a = 1

  16. a = 1 x = b y = a CPU1

    CPU2 x = "" y = 0 b = ""
  17. Out of order execution

  18. 8.2.3.4 Loads May Be Reordered with Earlier Stores to Different

    Locations Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A
  19. store a, 1 store b, "" x = load b

    y = load a ! x == "" y == 1 x = load b y = load a store a, 1 store b, "" ! x == 0 y == 0 store a, 1 x = load b y = load a store b, "" ! x == 0 y == 1
  20. Not all architectures are created equal

  21. ARMv7

  22. store a, store b, x y ! x y x

    y store a, store b, ! x y store a, x y store b, ! x y store b, "" x = load b y = load a store a, 1 ! x == "" y == 0
  23. CPU caches

  24. Memory is slow

  25. L1 L2 L3 RAM 4 cycles ! 10 cycles !

    40 - 75 cycles ! 60ns - 100ns hundreds of cycles
  26. Caching is hard…

  27. Store buffer

  28. a = 1 x = b b = "" y

    = a CPU1 CPU2 a = 0 b = 0
  29. a = 1 x = b CPU1 CPU2 x =

    0 y = 1 b = "" y = a
  30. a = 1 x = b CPU1 CPU2 x =

    "" y = 0 b = "" y = a
  31. a = 1 x = b b = "" CPU1

    CPU2 x = "" y = 1 y = a
  32. 8.2.3.4 Loads May Be Reordered with Earlier Stores to Different

    Locations Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A
  33. a = 1 CPU1 CPU2 x = 0 y =

    0 y = a b = "" x = b
  34. Fixing it!

  35. Memory barriers

  36. __asm__ __volatile__ ("mfence" ::: "memory");

  37. sfence lfence mfence

  38. Double check locking

  39. class Foo @mutex = Mutex.new ! def initialize @bar =

    "Bar" end ! def self.instance unless @instance @mutex.synchronize do unless @instance @instance = Foo.new end end end @instance end end
  40. class Foo @mutex = Mutex.new ! def initialize @bar =

    "Bar" end ! def self.instance unless @instance @mutex.synchronize do instance = Foo.new # Insert compiler barrier @instance = instance end end @instance end end Explicit synchronization
  41. Ruby?

  42. False sharing

  43. CPU Cache

  44. class Foo attr_accessor :a end ! f = Foo.new i

    = 0 while i < 100000 f.a = 0 i += 1 end i = 0 while i < 100000 f.a = 1 i += 1 end CPU1 CPU2
  45. L1 L2 L3 RAM 4 cycles ! 10 cycles !

    40 - 75 cycles ! 60ns - 100ns hundreds of cycles Shared across cores
  46. class Foo attr_accessor :a attr_accessor :b end ! f =

    Foo.new i = 0 while i < 100000 f.b = 0 i += 1 end i = 0 while i < 100000 f.a = 1 i += 1 end CPU1 CPU2
  47. Cache lines

  48. class Foo attr_accessor :a ... attr_accessor :k end ! f

    = Foo.new i = 0 while i < 100000 f.k = 0 i += 1 end i = 0 while i < 100000 f.a = 1 i += 1 end CPU1 CPU2
  49. real single thread 4.252730 actual sharing 19.963792 false sharing 19.803237

    no false sharing 4.617507
  50. Ruby

  51. What is thread safe code?

  52. Future

  53. Ostrich strategy

  54. Memory model

  55. Better API’s

  56. None
  57. Fin