$30 off During Our Annual Pro Sale. View Details »

The tricky truth about parallel execution and modern hardware

The tricky truth about parallel execution and modern hardware

Concurrency and parallelism in Ruby are more and more important in the future. Machines will be multi-core and parallelization is often the way these days to speed things up.

At a hardware level, this parallel world is not always a nice and simple place to live. As Ruby implementations get faster and hardware more parallel, these details will matter for you as a Ruby developer too.

Want to know about the pitfalls are of double check locking? No idea what out of order execution means? How CPU cache effects can lead to obscure crashes? What this thing called a memory barrier is? How false sharing can cause performance issues?

Come listen if you want to know about nitty gritty details that can affect your Ruby application in the future.

Dirkjan Bussink

November 09, 2013
Tweet

More Decks by Dirkjan Bussink

Other Decks in Technology

Transcript

  1. The tricky truth about
    parallel execution
    and modern hardware
    Dirkjan Bussink
    @dbussink

    View Slide

  2. View Slide

  3. View Slide

  4. Causality

    View Slide

  5. a = 1
    b = ""

    View Slide

  6. a = 1
    b = ""
    x = b
    y = a
    CPU1 CPU2
    a = 0
    b = 0

    View Slide

  7. a = 1
    b = ""
    x = b
    y = a
    CPU1 CPU2
    x = ""
    y = 1

    View Slide

  8. a = 1
    b = ""
    x = b
    y = a
    CPU1 CPU2
    x = 0
    y = 0

    View Slide

  9. a = 1
    b = ""
    x = b
    CPU1 CPU2
    x = 0
    y = 1
    y = a

    View Slide

  10. ?

    View Slide

  11. x
    y
    x
    y
    x
    y
    x = ""
    y = 0

    View Slide

  12. Wat?

    View Slide

  13. Compiler
    optimization

    View Slide

  14. a = 1
    b = ""

    View Slide

  15. b = ""
    a = 1

    View Slide

  16. a = 1
    x = b
    y = a
    CPU1 CPU2
    x = ""
    y = 0
    b = ""

    View Slide

  17. Out of order
    execution

    View Slide

  18. 8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations
    Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A

    View Slide

  19. store a, 1
    store b, ""
    x = load b
    y = load a
    !
    x == ""
    y == 1
    x = load b
    y = load a
    store a, 1
    store b, ""
    !
    x == 0
    y == 0
    store a, 1
    x = load b
    y = load a
    store b, ""
    !
    x == 0
    y == 1

    View Slide

  20. Not all architectures
    are created equal

    View Slide

  21. ARMv7

    View Slide

  22. store a,
    store b,
    x
    y
    !
    x
    y
    x
    y
    store a,
    store b,
    !
    x
    y
    store a,
    x
    y
    store b,
    !
    x
    y
    store b, ""
    x = load b
    y = load a
    store a, 1
    !
    x == ""
    y == 0

    View Slide

  23. CPU caches

    View Slide

  24. Memory is slow

    View Slide

  25. L1 L2 L3 RAM
    4 cycles
    !
    10 cycles
    !
    40 - 75 cycles
    !
    60ns - 100ns
    hundreds
    of cycles

    View Slide

  26. Caching is hard…

    View Slide

  27. Store buffer

    View Slide

  28. a = 1
    x = b
    b = ""
    y = a
    CPU1 CPU2
    a = 0
    b = 0

    View Slide

  29. a = 1
    x = b
    CPU1 CPU2
    x = 0
    y = 1
    b = ""
    y = a

    View Slide

  30. a = 1
    x = b
    CPU1 CPU2
    x = ""
    y = 0
    b = ""
    y = a

    View Slide

  31. a = 1
    x = b
    b = ""
    CPU1 CPU2
    x = ""
    y = 1
    y = a

    View Slide

  32. 8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations
    Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A

    View Slide

  33. a = 1
    CPU1 CPU2
    x = 0
    y = 0
    y = a
    b = ""
    x = b

    View Slide

  34. Fixing it!

    View Slide

  35. Memory barriers

    View Slide

  36. __asm__ __volatile__ ("mfence" ::: "memory");

    View Slide

  37. sfence
    lfence
    mfence

    View Slide

  38. Double check locking

    View Slide

  39. class Foo
    @mutex = Mutex.new
    !
    def initialize
    @bar = "Bar"
    end
    !
    def self.instance
    unless @instance
    @mutex.synchronize do
    unless @instance
    @instance = Foo.new
    end
    end
    end
    @instance
    end
    end

    View Slide

  40. class Foo
    @mutex = Mutex.new
    !
    def initialize
    @bar = "Bar"
    end
    !
    def self.instance
    unless @instance
    @mutex.synchronize do
    instance = Foo.new
    # Insert compiler barrier
    @instance = instance
    end
    end
    @instance
    end
    end
    Explicit
    synchronization

    View Slide

  41. Ruby?

    View Slide

  42. False sharing

    View Slide

  43. CPU Cache

    View Slide

  44. class Foo
    attr_accessor :a
    end
    !
    f = Foo.new
    i = 0
    while i < 100000
    f.a = 0
    i += 1
    end
    i = 0
    while i < 100000
    f.a = 1
    i += 1
    end
    CPU1 CPU2

    View Slide

  45. L1 L2 L3 RAM
    4 cycles
    !
    10 cycles
    !
    40 - 75 cycles
    !
    60ns - 100ns
    hundreds
    of cycles
    Shared across cores

    View Slide

  46. class Foo
    attr_accessor :a
    attr_accessor :b
    end
    !
    f = Foo.new
    i = 0
    while i < 100000
    f.b = 0
    i += 1
    end
    i = 0
    while i < 100000
    f.a = 1
    i += 1
    end
    CPU1 CPU2

    View Slide

  47. Cache lines

    View Slide

  48. class Foo
    attr_accessor :a
    ...
    attr_accessor :k
    end
    !
    f = Foo.new
    i = 0
    while i < 100000
    f.k = 0
    i += 1
    end
    i = 0
    while i < 100000
    f.a = 1
    i += 1
    end
    CPU1 CPU2

    View Slide

  49. real
    single thread 4.252730
    actual sharing 19.963792
    false sharing 19.803237
    no false sharing 4.617507

    View Slide

  50. Ruby

    View Slide

  51. What is thread safe code?

    View Slide

  52. Future

    View Slide

  53. Ostrich strategy

    View Slide

  54. Memory model

    View Slide

  55. Better API’s

    View Slide

  56. View Slide

  57. Fin

    View Slide