Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Atomic programming

Felix Chern
January 05, 2017
190

Atomic programming

Felix Chern

January 05, 2017
Tweet

Transcript

  1. Atomic and Lock Free
    Programming
    Felix Chern

    View Slide

  2. Atomic and Sanity Free
    Programming
    Felix Chern
    The most torturing
    experience
    I ever had!

    View Slide

  3. About Me
    • Felix Chern
    • Google: Cloud networking
    • OpenX: Big data team tech
    lead
    • SupplyFrame: Built Hadoop
    pipeline
    • http://idryman.org

    View Slide

  4. Disclaimer: this talk is
    not in behave of google!
    Just a personal side project
    ;)

    View Slide

  5. Before we begin

    View Slide

  6. This is not a
    framework
    enthusiast talk

    View Slide

  7. More like an adventure

    View Slide

  8. Become a
    treasure hunter
    and pass the tests!

    View Slide

  9. What is atomic?

    View Slide

  10. You might think atomic is…
    yet another data type that is concurrent friendly
    Reference count, program counter; how hard can it be?

    View Slide

  11. Actually

    View Slide

  12. View Slide

  13. Problems with C11/C++11
    atomic API
    • Memory (re)ordering
    • Subtle memory model
    • Compiler bugs
    • Performance penalties

    View Slide

  14. 1

    View Slide

  15. Compiler optimizes
    stuff

    View Slide

  16. int a;
    if (b > 0) {
    a = 1;
    } else {
    a = 0;
    }
    int a;
    a = 0;
    if (b > 0) {
    a = 1;
    }
    Can be optimized as

    View Slide

  17. What has been correct
    for sequential program,

    View Slide

  18. may be wrong for
    concurrent program!

    View Slide

  19. View Slide

  20. non atomic
    store x
    non atomic
    store y
    Atomic
    store a
    Atomic
    load a
    non atomic
    load x
    non atomic
    load y
    Time
    T1 T2
    Store on x and y
    “happens before”
    load on x and y

    View Slide

  21. non atomic
    store x
    non atomic
    store y
    Atomic
    store a
    Time
    T1
    memory order release:
    A store operation. No (reads or) writes
    in the current thread can be reordered
    after this store.
    // a init to 1
    x = 1;
    y = 2;
    a.store(0,
    memory_order_release);

    View Slide

  22. non atomic
    store x
    non atomic
    store y
    Atomic
    store a
    Time
    T1
    memory order release:
    A store operation. No (reads or) writes
    in the current thread can be reordered
    after this store.

    View Slide

  23. Atomic
    load a
    non atomic
    load x
    non atomic
    load y
    Time
    T2
    while (a.load
    (memory_order_acquire))
    { /* spin lock */ }
    // read x, y
    memory order acquire:
    A load operation. No reads (or writes)
    in the current thread can be reordered
    before this load.

    View Slide

  24. Atomic
    load a
    non atomic
    load x
    non atomic
    load y
    Time
    T2
    memory order acquire:
    A load operation. No reads (or writes)
    in the current thread can be reordered
    before this load.

    View Slide

  25. non atomic
    store x
    non atomic
    store y
    Atomic
    store a
    Atomic
    load a
    non atomic
    load x
    non atomic
    load y
    Time
    T1 T2
    Store on x and y
    “happens before”
    load on x and y

    View Slide

  26. non atomic
    store x
    Atomic
    store a
    Atomic
    test and load a
    Time
    T1 T2
    Atomic
    test and load a
    non atomic
    store x
    Atomic
    store a
    critical section
    critical section

    View Slide

  27. non atomic
    store x
    Atomic
    store a
    T1
    Atomic
    test and load a
    using namespace std;
    atomic_flag a = ATOMIC_FLAG_INIT;
    while (a.test_and_set(
    memory_order_acquire
    ))
    ; // spin lock
    x++;
    a.clear(memory_order_release);
    Alternatively:
    • atomic_fetch_and/atomic_fetch_or
    • atomic_exchange/atomic_store
    • atomic_compare_exchange_weak/strong

    View Slide

  28. Memory model
    • memory_order_relaxed: there are no synchronization or ordering
    constraints, only atomicity is required of this operation
    • memory_order_acquire: A load operation. No reads (or writes) in the
    current thread can be reordered before this load.
    • memory_order_release: A store operation. No (reads or) writes in the
    current thread can be reordered after this store.
    • memory_order_acq_rel: A read-modify-write operation. This is both
    acquire and release.
    • memory_order_seq_cst: Any operation with this memory order is both
    an acquire operation and a release operation, plus a single total order
    exists in which all threads observe all modifications in the same order

    View Slide

  29. 2

    View Slide

  30. subtle semantics

    View Slide

  31. non atomic
    store x
    Atomic
    store a
    ok?
    Atomic
    test and load a
    while (a.test_and_set(
    memory_order_acquire
    ))
    ; // spin lock
    x++;
    a.clear(memory_order_release);
    memory order acquire:
    A load operation. No reads in the
    current thread can be reordered before
    this load.
    ok?

    View Slide

  32. non atomic
    store x
    Atomic
    store a
    ok?
    Atomic
    test and load a
    while (a.test_and_set(
    memory_order_acquire
    ))
    ; // spin lock
    x++;
    a.clear(memory_order_release);
    memory order acquire:
    A load operation. No reads or writes in
    the current thread can be reordered
    before this load.
    ok?
    Added in Oct, 2016

    View Slide

  33. View Slide

  34. https://godbolt.org/g/atB98G
    Reordering seems fine in compiler implementation.

    View Slide

  35. Reference count
    // inlinable
    void retain(Object* obj) {
    obj->refcnt.fetch_add(1, memory_order_acquire);
    }
    memory order acquire:
    A load operation. No reads or writes in
    the current thread can be reordered
    before this load.
    Both a load and a store!

    View Slide

  36. View Slide

  37. What should we do?

    View Slide

  38. atomic checklist
    • Check your code on gcc.godbolt.org
    • Get familiar with the assembly of the targeted
    platform
    • Unit test for data race

    What seems correct on godbolt may not be the
    case on your platform :(

    View Slide

  39. View Slide

  40. 3

    View Slide

  41. How about the
    performance?

    View Slide

  42. https://godbolt.org/g/9Bb6yt
    • No system call
    • No thread fence (?)
    • Only xchg, mov
    • Should be fast, right?

    View Slide

  43. Spin lock has terrible performance!
    Latency

    View Slide

  44. View Slide

  45. View Slide

  46. Why?
    • xchg src, dest

    If one of the operands is a memory address, then the
    operation has an implicit LOCK prefix, that is, the
    exchange operation is atomic.
    • LOCK

    Causes the processor's LOCK# signal to be asserted
    during execution of the accompanying instruction. In
    a multiprocessor environment, the LOCK# signal
    insures that the processor has exclusive use of any
    shared memory while the signal is asserted.

    View Slide

  47. 4

    View Slide

  48. Optimize the spin
    Avoid calling LOCK# in the while loop

    View Slide

  49. https://godbolt.org/g/G0SfcB
    The red highlight does the job!

    View Slide

  50. Performance on different platforms may differ.

    View Slide

  51. Can we do better?

    View Slide

  52. • lock.load(memory_order_relaxed)
    • *reinterpret_cast(lock) // C++

    *(int*)lock // C
    • *reinterpret_cast(lock) // C++

    *(volatile int*)lock // C

    View Slide

  53. https://godbolt.org/g/WkyOHY

    View Slide

  54. Infinite loop LOL

    View Slide

  55. View Slide

  56. View Slide

  57. https://godbolt.org/g/wR6XZg
    volatile int works

    View Slide

  58. Turns out..
    • On x86
    • *(volatile int*) lock is same as

    lock.load(memory_order_relaxed)
    • memory order relaxed, acquire, acq_rel,
    seq_cst doesn’t matter here
    • But only on X86!
    • Unless you compile and test, you know nothing!

    View Slide

  59. It’s depressing.
    I know

    View Slide

  60. • Acquire-Release semantic is subtle
    • Need to double check compiled result
    • Performance penalty can be huge! (20x times
    slower)
    • Usually slower than pthread mutex
    • LOCK# synchronizes all memory access.

    View Slide

  61. View Slide

  62. View Slide

  63. 5

    View Slide

  64. BUT!!!

    View Slide

  65. Atomic API is more
    flexible compare to
    pthread mutex locks

    View Slide

  66. Build new logic with
    atomic
    • Build concurrent data structures/algorithms

    boost, Facebook folly, golang, etc.
    • Also make use of thread local variables
    • Create new semantic pthread doesn’t provide

    View Slide

  67. View Slide

  68. 6

    View Slide

  69. A mini concurrent queue
    • max_size init to 3
    • enqueue:

    acquire read lock

    if (size < max_size)

    enqueue object into the queue

    release read lock, return

    else

    release read lock and acquire write lock

    max_size = max_size * 2

    release write lock, acquire read lock

    enqueue object

    release read lock
    Not thread safe

    View Slide

  70. A mini concurrent queue
    • Problem: when releasing the read lock, the state
    is not safe
    • Two threads increasing the size (ok)
    • One thread increasing the size, the other does
    the opposite (bad)
    • One thread free the resource, the other inserting
    object (crash!)

    View Slide

  71. Introducing punch card
    • check_in (like acquire read lock)
    • check_out (like release read lock)
    • book_critical

    exclude new check in

    if already booked, failed to book
    • enter_critical

    wait until all check_in checked out then enter
    • exit_critical

    View Slide

  72. punch card read
    • check_in
    • check_out

    View Slide

  73. punch card write
    • check_in
    • if (!book_critical)

    check_out and return
    • while (!enter_critical) {}

    // spin until others check out
    • // do critical stuff, like switch state
    • exit_critical
    • check_out

    View Slide

  74. State A State B
    State C
    check in check out
    check in
    check out
    temporal state
    excluding check in
    critical section

    View Slide

  75. Way better than pthread rwlock!

    View Slide

  76. https://github.com/dryman/atomic_patterns/blob/master/op_atomic.h
    • check_in: pcard += 1 (when >= 0)
    • check_out: pcard -= 1
    • book_critical: Turn MSB to 1 (when > 0)

    i.e. pcard = INT_MIN + pcard
    • enter_critical: Spin until pcard == INT_MIN + 1

    pcard = INT_MIN
    • exit_critical: pcard = 1

    View Slide

  77. 7

    View Slide

  78. Atomic applications
    • Databases
    • RDBMS
    • NoSQL
    • Memory manager, allocator, garbage collector
    • Concurrent programming framework/language
    • Software transactional memory (STM)
    • Golang, Erlang, NodeJS(?)

    View Slide

  79. 8

    View Slide

  80. Beyond atomic
    • Thread local variable (C11/C++11)

    static __thread int x;
    • GCC transactional memory (experimental)

    __transaction_atomic { if (a > b) b++; }
    • Hardware transactional memory

    -mrtm (Restricted Transactional Memory)

    #include
    if ((status = _xbegin ()) == _XBEGIN_STARTED) {
    ... transaction code...
    _xend ();
    } else {
    ... non transactional fallback path...
    }

    View Slide

  81. References
    • CPP Atomic Operations Library
    • Preshing on programming blog posts on atomic
    • The Art of Multiprocessor Programming
    • C++ Concurrency In Action
    • GCC Transactional Memory
    • GCC X86 hard ware transaction reference
    • X86 references: http://x86.renejeschke.de
    • Compile code online: http://gcc.godbolt.org
    • https://github.com/dryman/atomic_patterns

    View Slide

  82. One more thing

    View Slide

  83. OPIC

    Object Persistence In C
    • https://github.com/dryman/opic
    • Bring data structures to big data
    • O(1) deserialization (mmap)
    • Target for high throughput (big data), but also low
    latency applications (this is why I entered atomic
    programming)
    • Version 3 is still under development

    (branch OPIC-33)

    View Slide

  84. Thank you!

    View Slide