Atomic programming

Atomic and Lock Free Programming Felix Chern

Atomic and Sanity Free Programming Felix Chern The most torturing
experience I ever had!

About Me • Felix Chern • Google: Cloud networking •
OpenX: Big data team tech lead • SupplyFrame: Built Hadoop pipeline • http://idryman.org

Disclaimer: this talk is not in behave of google! Just
a personal side project ;)

Before we begin

This is not a framework enthusiast talk

More like an adventure

Become a treasure hunter and pass the tests!

What is atomic?

You might think atomic is… yet another data type that
is concurrent friendly Reference count, program counter; how hard can it be?

Actually

Problems with C11/C++11 atomic API • Memory (re)ordering • Subtle
memory model • Compiler bugs • Performance penalties

Compiler optimizes stuff

int a; if (b > 0) { a = 1;
} else { a = 0; } int a; a = 0; if (b > 0) { a = 1; } Can be optimized as

What has been correct for sequential program,

may be wrong for concurrent program!

non atomic store x non atomic store y Atomic store
a Atomic load a non atomic load x non atomic load y Time T1 T2 Store on x and y “happens before” load on x and y

a Time T1 memory order release: A store operation. No (reads or) writes in the current thread can be reordered after this store. // a init to 1 x = 1; y = 2; a.store(0, memory_order_release);

a Time T1 memory order release: A store operation. No (reads or) writes in the current thread can be reordered after this store.

Atomic load a non atomic load x non atomic load
y Time T2 while (a.load (memory_order_acquire)) { /* spin lock */ } // read x, y memory order acquire: A load operation. No reads (or writes) in the current thread can be reordered before this load.

Atomic load a non atomic load x non atomic load
y Time T2 memory order acquire: A load operation. No reads (or writes) in the current thread can be reordered before this load.

a Atomic load a non atomic load x non atomic load y Time T1 T2 Store on x and y “happens before” load on x and y

non atomic store x Atomic store a Atomic test and
load a Time T1 T2 Atomic test and load a non atomic store x Atomic store a critical section critical section

non atomic store x Atomic store a T1 Atomic test
and load a using namespace std; atomic_flag a = ATOMIC_FLAG_INIT; while (a.test_and_set( memory_order_acquire )) ; // spin lock x++; a.clear(memory_order_release); Alternatively: • atomic_fetch_and/atomic_fetch_or • atomic_exchange/atomic_store • atomic_compare_exchange_weak/strong

Memory model • memory_order_relaxed: there are no synchronization or ordering
constraints, only atomicity is required of this operation • memory_order_acquire: A load operation. No reads (or writes) in the current thread can be reordered before this load. • memory_order_release: A store operation. No (reads or) writes in the current thread can be reordered after this store. • memory_order_acq_rel: A read-modify-write operation. This is both acquire and release. • memory_order_seq_cst: Any operation with this memory order is both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modiﬁcations in the same order

subtle semantics

non atomic store x Atomic store a ok? Atomic test
and load a while (a.test_and_set( memory_order_acquire )) ; // spin lock x++; a.clear(memory_order_release); memory order acquire: A load operation. No reads in the current thread can be reordered before this load. ok?

non atomic store x Atomic store a ok? Atomic test
and load a while (a.test_and_set( memory_order_acquire )) ; // spin lock x++; a.clear(memory_order_release); memory order acquire: A load operation. No reads or writes in the current thread can be reordered before this load. ok? Added in Oct, 2016

https://godbolt.org/g/atB98G Reordering seems ﬁne in compiler implementation.

Reference count // inlinable void retain(Object* obj) { obj->refcnt.fetch_add(1, memory_order_acquire);
} memory order acquire: A load operation. No reads or writes in the current thread can be reordered before this load. Both a load and a store!

What should we do?

atomic checklist • Check your code on gcc.godbolt.org • Get
familiar with the assembly of the targeted platform • Unit test for data race  What seems correct on godbolt may not be the case on your platform :(

How about the performance?

https://godbolt.org/g/9Bb6yt • No system call • No thread fence (?)
• Only xchg, mov • Should be fast, right?

Spin lock has terrible performance! Latency

Why? • xchg src, dest  If one of the operands
is a memory address, then the operation has an implicit LOCK preﬁx, that is, the exchange operation is atomic. • LOCK  Causes the processor's LOCK# signal to be asserted during execution of the accompanying instruction. In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted.

Optimize the spin Avoid calling LOCK# in the while loop

https://godbolt.org/g/G0SfcB The red highlight does the job!

Performance on different platforms may differ.

Can we do better?

• lock.load(memory_order_relaxed) • *reinterpret_cast<int*>(lock) // C++  *(int*)lock // C •
*reinterpret_cast<volatile int*>(lock) // C++  *(volatile int*)lock // C

https://godbolt.org/g/WkyOHY

Inﬁnite loop LOL

https://godbolt.org/g/wR6XZg volatile int works

Turns out.. • On x86 • *(volatile int*) lock is
same as  lock.load(memory_order_relaxed) • memory order relaxed, acquire, acq_rel, seq_cst doesn’t matter here • But only on X86! • Unless you compile and test, you know nothing!

It’s depressing. I know

• Acquire-Release semantic is subtle • Need to double check
compiled result • Performance penalty can be huge! (20x times slower) • Usually slower than pthread mutex • LOCK# synchronizes all memory access.

BUT!!!

Atomic API is more ﬂexible compare to pthread mutex locks

Build new logic with atomic • Build concurrent data structures/algorithms 
boost, Facebook folly, golang, etc. • Also make use of thread local variables • Create new semantic pthread doesn’t provide

A mini concurrent queue • max_size init to 3 •
enqueue:  acquire read lock  if (size < max_size)  enqueue object into the queue  release read lock, return  else  release read lock and acquire write lock  max_size = max_size * 2  release write lock, acquire read lock  enqueue object  release read lock Not thread safe

A mini concurrent queue • Problem: when releasing the read
lock, the state is not safe • Two threads increasing the size (ok) • One thread increasing the size, the other does the opposite (bad) • One thread free the resource, the other inserting object (crash!)

Introducing punch card • check_in (like acquire read lock) •
check_out (like release read lock) • book_critical  exclude new check in  if already booked, failed to book • enter_critical  wait until all check_in checked out then enter • exit_critical

punch card read • check_in • check_out

punch card write • check_in • if (!book_critical)  check_out and
return • while (!enter_critical) {}  // spin until others check out • // do critical stuff, like switch state • exit_critical • check_out

State A State B State C check in check out
check in check out temporal state excluding check in critical section

Way better than pthread rwlock!

https://github.com/dryman/atomic_patterns/blob/master/op_atomic.h • check_in: pcard += 1 (when >= 0) •
check_out: pcard -= 1 • book_critical: Turn MSB to 1 (when > 0)  i.e. pcard = INT_MIN + pcard • enter_critical: Spin until pcard == INT_MIN + 1  pcard = INT_MIN • exit_critical: pcard = 1

Atomic applications • Databases • RDBMS • NoSQL • Memory
manager, allocator, garbage collector • Concurrent programming framework/language • Software transactional memory (STM) • Golang, Erlang, NodeJS(?)

Beyond atomic • Thread local variable (C11/C++11)  static __thread int
x; • GCC transactional memory (experimental)  __transaction_atomic { if (a > b) b++; } • Hardware transactional memory  -mrtm (Restricted Transactional Memory)  #include <immintrin.h> if ((status = _xbegin ()) == _XBEGIN_STARTED) { ... transaction code... _xend (); } else { ... non transactional fallback path... }

References • CPP Atomic Operations Library • Preshing on programming
blog posts on atomic • The Art of Multiprocessor Programming • C++ Concurrency In Action • GCC Transactional Memory • GCC X86 hard ware transaction reference • X86 references: http://x86.renejeschke.de • Compile code online: http://gcc.godbolt.org • https://github.com/dryman/atomic_patterns

One more thing

OPIC  Object Persistence In C • https://github.com/dryman/opic • Bring data
structures to big data • O(1) deserialization (mmap) • Target for high throughput (big data), but also low latency applications (this is why I entered atomic programming) • Version 3 is still under development  (branch OPIC-33)

Thank you!

Atomic programming

Atomic programming

More Decks by Felix Chern

Featured

Transcript