Slide 1

Slide 1 text

Atomic and Lock Free Programming Felix Chern

Slide 2

Slide 2 text

Atomic and Sanity Free Programming Felix Chern The most torturing experience I ever had!

Slide 3

Slide 3 text

About Me • Felix Chern • Google: Cloud networking • OpenX: Big data team tech lead • SupplyFrame: Built Hadoop pipeline • http://idryman.org

Slide 4

Slide 4 text

Disclaimer: this talk is not in behave of google! Just a personal side project ;)

Slide 5

Slide 5 text

Before we begin

Slide 6

Slide 6 text

This is not a framework enthusiast talk

Slide 7

Slide 7 text

More like an adventure

Slide 8

Slide 8 text

Become a treasure hunter and pass the tests!

Slide 9

Slide 9 text

What is atomic?

Slide 10

Slide 10 text

You might think atomic is… yet another data type that is concurrent friendly Reference count, program counter; how hard can it be?

Slide 11

Slide 11 text

Actually

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Problems with C11/C++11 atomic API • Memory (re)ordering • Subtle memory model • Compiler bugs • Performance penalties

Slide 14

Slide 14 text

1

Slide 15

Slide 15 text

Compiler optimizes stuff

Slide 16

Slide 16 text

int a; if (b > 0) { a = 1; } else { a = 0; } int a; a = 0; if (b > 0) { a = 1; } Can be optimized as

Slide 17

Slide 17 text

What has been correct for sequential program,

Slide 18

Slide 18 text

may be wrong for concurrent program!

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

non atomic store x non atomic store y Atomic store a Atomic load a non atomic load x non atomic load y Time T1 T2 Store on x and y “happens before” load on x and y

Slide 21

Slide 21 text

non atomic store x non atomic store y Atomic store a Time T1 memory order release: A store operation. No (reads or) writes in the current thread can be reordered after this store. // a init to 1 x = 1; y = 2; a.store(0, memory_order_release);

Slide 22

Slide 22 text

non atomic store x non atomic store y Atomic store a Time T1 memory order release: A store operation. No (reads or) writes in the current thread can be reordered after this store.

Slide 23

Slide 23 text

Atomic load a non atomic load x non atomic load y Time T2 while (a.load (memory_order_acquire)) { /* spin lock */ } // read x, y memory order acquire: A load operation. No reads (or writes) in the current thread can be reordered before this load.

Slide 24

Slide 24 text

Atomic load a non atomic load x non atomic load y Time T2 memory order acquire: A load operation. No reads (or writes) in the current thread can be reordered before this load.

Slide 25

Slide 25 text

non atomic store x non atomic store y Atomic store a Atomic load a non atomic load x non atomic load y Time T1 T2 Store on x and y “happens before” load on x and y

Slide 26

Slide 26 text

non atomic store x Atomic store a Atomic test and load a Time T1 T2 Atomic test and load a non atomic store x Atomic store a critical section critical section

Slide 27

Slide 27 text

non atomic store x Atomic store a T1 Atomic test and load a using namespace std; atomic_flag a = ATOMIC_FLAG_INIT; while (a.test_and_set( memory_order_acquire )) ; // spin lock x++; a.clear(memory_order_release); Alternatively: • atomic_fetch_and/atomic_fetch_or • atomic_exchange/atomic_store • atomic_compare_exchange_weak/strong

Slide 28

Slide 28 text

Memory model • memory_order_relaxed: there are no synchronization or ordering constraints, only atomicity is required of this operation • memory_order_acquire: A load operation. No reads (or writes) in the current thread can be reordered before this load. • memory_order_release: A store operation. No (reads or) writes in the current thread can be reordered after this store. • memory_order_acq_rel: A read-modify-write operation. This is both acquire and release. • memory_order_seq_cst: Any operation with this memory order is both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order

Slide 29

Slide 29 text

2

Slide 30

Slide 30 text

subtle semantics

Slide 31

Slide 31 text

non atomic store x Atomic store a ok? Atomic test and load a while (a.test_and_set( memory_order_acquire )) ; // spin lock x++; a.clear(memory_order_release); memory order acquire: A load operation. No reads in the current thread can be reordered before this load. ok?

Slide 32

Slide 32 text

non atomic store x Atomic store a ok? Atomic test and load a while (a.test_and_set( memory_order_acquire )) ; // spin lock x++; a.clear(memory_order_release); memory order acquire: A load operation. No reads or writes in the current thread can be reordered before this load. ok? Added in Oct, 2016

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

https://godbolt.org/g/atB98G Reordering seems fine in compiler implementation.

Slide 35

Slide 35 text

Reference count // inlinable void retain(Object* obj) { obj->refcnt.fetch_add(1, memory_order_acquire); } memory order acquire: A load operation. No reads or writes in the current thread can be reordered before this load. Both a load and a store!

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

What should we do?

Slide 38

Slide 38 text

atomic checklist • Check your code on gcc.godbolt.org • Get familiar with the assembly of the targeted platform • Unit test for data race
 What seems correct on godbolt may not be the case on your platform :(

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

3

Slide 41

Slide 41 text

How about the performance?

Slide 42

Slide 42 text

https://godbolt.org/g/9Bb6yt • No system call • No thread fence (?) • Only xchg, mov • Should be fast, right?

Slide 43

Slide 43 text

Spin lock has terrible performance! Latency

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

Why? • xchg src, dest
 If one of the operands is a memory address, then the operation has an implicit LOCK prefix, that is, the exchange operation is atomic. • LOCK
 Causes the processor's LOCK# signal to be asserted during execution of the accompanying instruction. In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted.

Slide 47

Slide 47 text

4

Slide 48

Slide 48 text

Optimize the spin Avoid calling LOCK# in the while loop

Slide 49

Slide 49 text

https://godbolt.org/g/G0SfcB The red highlight does the job!

Slide 50

Slide 50 text

Performance on different platforms may differ.

Slide 51

Slide 51 text

Can we do better?

Slide 52

Slide 52 text

• lock.load(memory_order_relaxed) • *reinterpret_cast(lock) // C++
 *(int*)lock // C • *reinterpret_cast(lock) // C++
 *(volatile int*)lock // C

Slide 53

Slide 53 text

https://godbolt.org/g/WkyOHY

Slide 54

Slide 54 text

Infinite loop LOL

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

https://godbolt.org/g/wR6XZg volatile int works

Slide 58

Slide 58 text

Turns out.. • On x86 • *(volatile int*) lock is same as
 lock.load(memory_order_relaxed) • memory order relaxed, acquire, acq_rel, seq_cst doesn’t matter here • But only on X86! • Unless you compile and test, you know nothing!

Slide 59

Slide 59 text

It’s depressing. I know

Slide 60

Slide 60 text

• Acquire-Release semantic is subtle • Need to double check compiled result • Performance penalty can be huge! (20x times slower) • Usually slower than pthread mutex • LOCK# synchronizes all memory access.

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

5

Slide 64

Slide 64 text

BUT!!!

Slide 65

Slide 65 text

Atomic API is more flexible compare to pthread mutex locks

Slide 66

Slide 66 text

Build new logic with atomic • Build concurrent data structures/algorithms
 boost, Facebook folly, golang, etc. • Also make use of thread local variables • Create new semantic pthread doesn’t provide

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

6

Slide 69

Slide 69 text

A mini concurrent queue • max_size init to 3 • enqueue:
 acquire read lock
 if (size < max_size)
 enqueue object into the queue
 release read lock, return
 else
 release read lock and acquire write lock
 max_size = max_size * 2
 release write lock, acquire read lock
 enqueue object
 release read lock Not thread safe

Slide 70

Slide 70 text

A mini concurrent queue • Problem: when releasing the read lock, the state is not safe • Two threads increasing the size (ok) • One thread increasing the size, the other does the opposite (bad) • One thread free the resource, the other inserting object (crash!)

Slide 71

Slide 71 text

Introducing punch card • check_in (like acquire read lock) • check_out (like release read lock) • book_critical
 exclude new check in
 if already booked, failed to book • enter_critical
 wait until all check_in checked out then enter • exit_critical

Slide 72

Slide 72 text

punch card read • check_in • check_out

Slide 73

Slide 73 text

punch card write • check_in • if (!book_critical)
 check_out and return • while (!enter_critical) {}
 // spin until others check out • // do critical stuff, like switch state • exit_critical • check_out

Slide 74

Slide 74 text

State A State B State C check in check out check in check out temporal state excluding check in critical section

Slide 75

Slide 75 text

Way better than pthread rwlock!

Slide 76

Slide 76 text

https://github.com/dryman/atomic_patterns/blob/master/op_atomic.h • check_in: pcard += 1 (when >= 0) • check_out: pcard -= 1 • book_critical: Turn MSB to 1 (when > 0)
 i.e. pcard = INT_MIN + pcard • enter_critical: Spin until pcard == INT_MIN + 1
 pcard = INT_MIN • exit_critical: pcard = 1

Slide 77

Slide 77 text

7

Slide 78

Slide 78 text

Atomic applications • Databases • RDBMS • NoSQL • Memory manager, allocator, garbage collector • Concurrent programming framework/language • Software transactional memory (STM) • Golang, Erlang, NodeJS(?)

Slide 79

Slide 79 text

8

Slide 80

Slide 80 text

Beyond atomic • Thread local variable (C11/C++11)
 static __thread int x; • GCC transactional memory (experimental)
 __transaction_atomic { if (a > b) b++; } • Hardware transactional memory
 -mrtm (Restricted Transactional Memory)
 #include if ((status = _xbegin ()) == _XBEGIN_STARTED) { ... transaction code... _xend (); } else { ... non transactional fallback path... }

Slide 81

Slide 81 text

References • CPP Atomic Operations Library • Preshing on programming blog posts on atomic • The Art of Multiprocessor Programming • C++ Concurrency In Action • GCC Transactional Memory • GCC X86 hard ware transaction reference • X86 references: http://x86.renejeschke.de • Compile code online: http://gcc.godbolt.org • https://github.com/dryman/atomic_patterns

Slide 82

Slide 82 text

One more thing

Slide 83

Slide 83 text

OPIC
 Object Persistence In C • https://github.com/dryman/opic • Bring data structures to big data • O(1) deserialization (mmap) • Target for high throughput (big data), but also low latency applications (this is why I entered atomic programming) • Version 3 is still under development
 (branch OPIC-33)

Slide 84

Slide 84 text

Thank you!