Atomic programming

Cba7f423f9e0ee3a0be1ca18978a6684?s=47 Felix Chern
January 05, 2017
180

Atomic programming

Cba7f423f9e0ee3a0be1ca18978a6684?s=128

Felix Chern

January 05, 2017
Tweet

Transcript

  1. Atomic and Lock Free Programming Felix Chern

  2. Atomic and Sanity Free Programming Felix Chern The most torturing

    experience I ever had!
  3. About Me • Felix Chern • Google: Cloud networking •

    OpenX: Big data team tech lead • SupplyFrame: Built Hadoop pipeline • http://idryman.org
  4. Disclaimer: this talk is not in behave of google! Just

    a personal side project ;)
  5. Before we begin

  6. This is not a framework enthusiast talk

  7. More like an adventure

  8. Become a treasure hunter and pass the tests!

  9. What is atomic?

  10. You might think atomic is… yet another data type that

    is concurrent friendly Reference count, program counter; how hard can it be?
  11. Actually

  12. None
  13. Problems with C11/C++11 atomic API • Memory (re)ordering • Subtle

    memory model • Compiler bugs • Performance penalties
  14. 1

  15. Compiler optimizes stuff

  16. int a; if (b > 0) { a = 1;

    } else { a = 0; } int a; a = 0; if (b > 0) { a = 1; } Can be optimized as
  17. What has been correct for sequential program,

  18. may be wrong for concurrent program!

  19. None
  20. non atomic store x non atomic store y Atomic store

    a Atomic load a non atomic load x non atomic load y Time T1 T2 Store on x and y “happens before” load on x and y
  21. non atomic store x non atomic store y Atomic store

    a Time T1 memory order release: A store operation. No (reads or) writes in the current thread can be reordered after this store. // a init to 1 x = 1; y = 2; a.store(0, memory_order_release);
  22. non atomic store x non atomic store y Atomic store

    a Time T1 memory order release: A store operation. No (reads or) writes in the current thread can be reordered after this store.
  23. Atomic load a non atomic load x non atomic load

    y Time T2 while (a.load (memory_order_acquire)) { /* spin lock */ } // read x, y memory order acquire: A load operation. No reads (or writes) in the current thread can be reordered before this load.
  24. Atomic load a non atomic load x non atomic load

    y Time T2 memory order acquire: A load operation. No reads (or writes) in the current thread can be reordered before this load.
  25. non atomic store x non atomic store y Atomic store

    a Atomic load a non atomic load x non atomic load y Time T1 T2 Store on x and y “happens before” load on x and y
  26. non atomic store x Atomic store a Atomic test and

    load a Time T1 T2 Atomic test and load a non atomic store x Atomic store a critical section critical section
  27. non atomic store x Atomic store a T1 Atomic test

    and load a using namespace std; atomic_flag a = ATOMIC_FLAG_INIT; while (a.test_and_set( memory_order_acquire )) ; // spin lock x++; a.clear(memory_order_release); Alternatively: • atomic_fetch_and/atomic_fetch_or • atomic_exchange/atomic_store • atomic_compare_exchange_weak/strong
  28. Memory model • memory_order_relaxed: there are no synchronization or ordering

    constraints, only atomicity is required of this operation • memory_order_acquire: A load operation. No reads (or writes) in the current thread can be reordered before this load. • memory_order_release: A store operation. No (reads or) writes in the current thread can be reordered after this store. • memory_order_acq_rel: A read-modify-write operation. This is both acquire and release. • memory_order_seq_cst: Any operation with this memory order is both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order
  29. 2

  30. subtle semantics

  31. non atomic store x Atomic store a ok? Atomic test

    and load a while (a.test_and_set( memory_order_acquire )) ; // spin lock x++; a.clear(memory_order_release); memory order acquire: A load operation. No reads in the current thread can be reordered before this load. ok?
  32. non atomic store x Atomic store a ok? Atomic test

    and load a while (a.test_and_set( memory_order_acquire )) ; // spin lock x++; a.clear(memory_order_release); memory order acquire: A load operation. No reads or writes in the current thread can be reordered before this load. ok? Added in Oct, 2016
  33. None
  34. https://godbolt.org/g/atB98G Reordering seems fine in compiler implementation.

  35. Reference count // inlinable void retain(Object* obj) { obj->refcnt.fetch_add(1, memory_order_acquire);

    } memory order acquire: A load operation. No reads or writes in the current thread can be reordered before this load. Both a load and a store!
  36. None
  37. What should we do?

  38. atomic checklist • Check your code on gcc.godbolt.org • Get

    familiar with the assembly of the targeted platform • Unit test for data race
 What seems correct on godbolt may not be the case on your platform :(
  39. None
  40. 3

  41. How about the performance?

  42. https://godbolt.org/g/9Bb6yt • No system call • No thread fence (?)

    • Only xchg, mov • Should be fast, right?
  43. Spin lock has terrible performance! Latency

  44. None
  45. None
  46. Why? • xchg src, dest
 If one of the operands

    is a memory address, then the operation has an implicit LOCK prefix, that is, the exchange operation is atomic. • LOCK
 Causes the processor's LOCK# signal to be asserted during execution of the accompanying instruction. In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted.
  47. 4

  48. Optimize the spin Avoid calling LOCK# in the while loop

  49. https://godbolt.org/g/G0SfcB The red highlight does the job!

  50. Performance on different platforms may differ.

  51. Can we do better?

  52. • lock.load(memory_order_relaxed) • *reinterpret_cast<int*>(lock) // C++
 *(int*)lock // C •

    *reinterpret_cast<volatile int*>(lock) // C++
 *(volatile int*)lock // C
  53. https://godbolt.org/g/WkyOHY

  54. Infinite loop LOL

  55. None
  56. None
  57. https://godbolt.org/g/wR6XZg volatile int works

  58. Turns out.. • On x86 • *(volatile int*) lock is

    same as
 lock.load(memory_order_relaxed) • memory order relaxed, acquire, acq_rel, seq_cst doesn’t matter here • But only on X86! • Unless you compile and test, you know nothing!
  59. It’s depressing. I know

  60. • Acquire-Release semantic is subtle • Need to double check

    compiled result • Performance penalty can be huge! (20x times slower) • Usually slower than pthread mutex • LOCK# synchronizes all memory access.
  61. None
  62. None
  63. 5

  64. BUT!!!

  65. Atomic API is more flexible compare to pthread mutex locks

  66. Build new logic with atomic • Build concurrent data structures/algorithms


    boost, Facebook folly, golang, etc. • Also make use of thread local variables • Create new semantic pthread doesn’t provide
  67. None
  68. 6

  69. A mini concurrent queue • max_size init to 3 •

    enqueue:
 acquire read lock
 if (size < max_size)
 enqueue object into the queue
 release read lock, return
 else
 release read lock and acquire write lock
 max_size = max_size * 2
 release write lock, acquire read lock
 enqueue object
 release read lock Not thread safe
  70. A mini concurrent queue • Problem: when releasing the read

    lock, the state is not safe • Two threads increasing the size (ok) • One thread increasing the size, the other does the opposite (bad) • One thread free the resource, the other inserting object (crash!)
  71. Introducing punch card • check_in (like acquire read lock) •

    check_out (like release read lock) • book_critical
 exclude new check in
 if already booked, failed to book • enter_critical
 wait until all check_in checked out then enter • exit_critical
  72. punch card read • check_in • check_out

  73. punch card write • check_in • if (!book_critical)
 check_out and

    return • while (!enter_critical) {}
 // spin until others check out • // do critical stuff, like switch state • exit_critical • check_out
  74. State A State B State C check in check out

    check in check out temporal state excluding check in critical section
  75. Way better than pthread rwlock!

  76. https://github.com/dryman/atomic_patterns/blob/master/op_atomic.h • check_in: pcard += 1 (when >= 0) •

    check_out: pcard -= 1 • book_critical: Turn MSB to 1 (when > 0)
 i.e. pcard = INT_MIN + pcard • enter_critical: Spin until pcard == INT_MIN + 1
 pcard = INT_MIN • exit_critical: pcard = 1
  77. 7

  78. Atomic applications • Databases • RDBMS • NoSQL • Memory

    manager, allocator, garbage collector • Concurrent programming framework/language • Software transactional memory (STM) • Golang, Erlang, NodeJS(?)
  79. 8

  80. Beyond atomic • Thread local variable (C11/C++11)
 static __thread int

    x; • GCC transactional memory (experimental)
 __transaction_atomic { if (a > b) b++; } • Hardware transactional memory
 -mrtm (Restricted Transactional Memory)
 #include <immintrin.h> if ((status = _xbegin ()) == _XBEGIN_STARTED) { ... transaction code... _xend (); } else { ... non transactional fallback path... }
  81. References • CPP Atomic Operations Library • Preshing on programming

    blog posts on atomic • The Art of Multiprocessor Programming • C++ Concurrency In Action • GCC Transactional Memory • GCC X86 hard ware transaction reference • X86 references: http://x86.renejeschke.de • Compile code online: http://gcc.godbolt.org • https://github.com/dryman/atomic_patterns
  82. One more thing

  83. OPIC
 Object Persistence In C • https://github.com/dryman/opic • Bring data

    structures to big data • O(1) deserialization (mmap) • Target for high throughput (big data), but also low latency applications (this is why I entered atomic programming) • Version 3 is still under development
 (branch OPIC-33)
  84. Thank you!