Slide 1

Slide 1 text

Nonblocking Algorithms and Scalable Multicore Programming by Samy Al Bahra, backtrace.io Papers We Love, Too — 21 May 2015 — San Francisco, CA

Slide 2

Slide 2 text

Who am I? • Devon H. O’Dell, @dhobsd • Software Engineer, @Fastly • Performance and debugging nut • Zappa fan

Slide 3

Slide 3 text

Overview • Introduction • Environment • Cache coherency / Memory ordering / Atomics / NUMA • Non-blocking algorithms • Guarantees / Contention / Linearization / SMR • Conclusion

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Why Do I Love This Paper? • Hard problem • Under-represented area • Inspires shift in thinking • Enables further research

Slide 6

Slide 6 text

Our Environment • Cache coherence protocols • Memory ordering • Atomic instructions • NUMA topologies

Slide 7

Slide 7 text

Environment • Cache coherence protocols • Memory ordering • Atomic primitives • NUMA topologies

Slide 8

Slide 8 text

“Understanding the mechanisms that provide coherency guarantees on multiprocessors is a prerequisite to understanding contention on such systems.”

Slide 9

Slide 9 text

MESIF Cache Coherency Protocol State Machine Understanding how processors optimize memory access is crucial. Understanding how cache traffic propagates between processors is crucial. Marek Fiser http:/ /www.texample.net/tikz/examples/mesif/ CC BY 2.5

Slide 10

Slide 10 text

“[T]he cache line is the granularity at which coherency is maintained on a cache-coherent multiprocessor system. This is also the unit of contention.”

Slide 11

Slide 11 text

Cache Lines

Slide 12

Slide 12 text

#define N_THR 8 static volatile struct { uint64_t value; } counters[N_THR]; void * thread(void *thridp) { uint64_t thrid = *thridp; while (exit == false) { counters[thrid].value++; } return NULL; }

Slide 13

Slide 13 text

#define N_THR 8 static volatile struct { uint64_t value; char pad[64 - sizeof (uint64_t)]; } counters[N_THR]; void * thread(void *thridp) { uint64_t thrid = *thridp; while (exit == false) { counters[thrid].value++; } return NULL; }

Slide 14

Slide 14 text

Our Environment • Cache coherence protocols • Memory ordering • Atomic instructions • NUMA topologies

Slide 15

Slide 15 text

Memory Ordering • Total Store Order (TSO) - x86, x86-64, SPARC • Partial Store Order (PSO) - SPARC • Relaxed Memory Ordering (RMO) - Power, ARM

Slide 16

Slide 16 text

volatile int ready = 0; void produce(void) { message = message_new(); message->value = 5; message_send(message); ready = 1; } void consume(void) { while (ready == 0) ; message = message_receive(); result = operation(&message->value); }

Slide 17

Slide 17 text

Memory Barriers • AKA memory fences • Required for portability across architectures • Impacts correctness of concurrent memory access

Slide 18

Slide 18 text

volatile int ready = 0; void produce(void) { message = message_new(); message->value = 5; message_send(message); memory_barrier(); ready = 1; } void consume(void) { while (ready == 0) ; memory_barrier(); message = message_receive(); result = operation(&message->value); }

Slide 19

Slide 19 text

volatile int ready = 0; void produce(void) { message = message_new(); message->value = 5; message_send(message); store_barrier(); ready = 1; } void consume(void) { while (ready == 0) ; load_barrier(); message = message_receive(); result = operation(&message->value); }

Slide 20

Slide 20 text

Our Environment • Cache coherence protocols • Memory ordering • Atomic primitives • NUMA topologies

Slide 21

Slide 21 text

Atomic Primitives • They’re not cheap, even without contention • Languages with strong memory guarantees may increase cost • Compilers providing builtins may increase cost

Slide 22

Slide 22 text

Atomic Primitives • Read-Modify-Write (RMW) • Compare-and-Swap (CAS) • Load-Linked / Store- Conditional (LL/SC)

Slide 23

Slide 23 text

atomic bool compare_and_swap (uint64_t *val, uint64_t *cmp, uint64_t update) { if (*val == *cmp) { *val = update; return true; } return false; }

Slide 24

Slide 24 text

atomic uint64_t * load_linked(uint64_t *v) { return *v; } atomic bool store_conditional (uint64_t *val, uint64_t update) { if (*val == *magic_val_at_time_of_load_linked_call) { *val = update; return true; } return false; }

Slide 25

Slide 25 text

Our Environment • Cache coherence protocols • Memory ordering • Atomic instructions • NUMA topologies

Slide 26

Slide 26 text

NUMA Topologies • Non-Uniform Memory Access: memory placement matters • Accessing remote memory is slow • Locking on remote memory can cause starvation and livelocks

Slide 27

Slide 27 text

NUMA Topologies • Fair locks help with starvation, but are sensitive to preemption • “Cohort locks” are NUMA- friendly

Slide 28

Slide 28 text

Non-blocking Algorithms • Guarantees • Contention • Linearization • SMR

Slide 29

Slide 29 text

Non-blocking Algorithms • Guarantees • Contention • Linearization • SMR

Slide 30

Slide 30 text

Non-blocking Guarantees • Wait-freedom • Lock-freedom • Obstruction-freedom

Slide 31

Slide 31 text

static uint64_t counter; static spinlock_t lock = SPINLOCK_INITIALIZER; void counter_inc(void) { uint64_t t; spinlock_lock(&lock); t = counter; t = t + 1; counter = t; spinlock_unlock(&lock);
 }

Slide 32

Slide 32 text

Blocking Algorithms Th0 Th1 … ThN ⚛ R M W

Slide 33

Slide 33 text

static uint64_t counter; void counter_inc(void) { uint64_t t, update; do { t = counter; update = t + 1; } while (!compare_and_swap(&counter, t, update)); }

Slide 34

Slide 34 text

Lock-Free Algorithms ⚛ Th0 Th1 … ⚛ ⚛ ⚛ ThN ⚛ ⚛ retry retry retry complete complete

Slide 35

Slide 35 text

static uint64_t counter; void counter_inc(void) { atomic_inc(&counter); }

Slide 36

Slide 36 text

Wait-Free Algorithms ⚛ Th0 Th1 … ⚛ ThN ⚛ complete complete complete

Slide 37

Slide 37 text

Non-blocking Algorithms • Guarantees • Contention • Linearization • SMR

Slide 38

Slide 38 text

“Contention [is] a product of the effective arrival rate of requests to a shared resource, [which] leads to a queuing effect that directly impacts the responsiveness of a system.”

Slide 39

Slide 39 text

Awful Person

Slide 40

Slide 40 text

Locks Create Queues • Threads become queue elements • Composability • Fairness

Slide 41

Slide 41 text

Locks Create Queues • Completion guarantees • Latency of critical section impacts system throughput

Slide 42

Slide 42 text

Little’s law: L = λW • Applies to stable systems • Capacity (L) • Throughput (λ) • Latency (W)

Slide 43

Slide 43 text

Uncontended Workload

Slide 44

Slide 44 text

Contention Spinlock vs. Lock-Free Stack

Slide 45

Slide 45 text

Non-blocking Algorithms • Guarantees • Contention • Linearization • SMR

Slide 46

Slide 46 text

Linearizing • Need to prove correctness • “Linearization points” • State space explosion

Slide 47

Slide 47 text

“What may seem like memory-model minutiae may actually be a violation in the correctness of your program.”

Slide 48

Slide 48 text

Non-blocking Algorithms • Guarantees • Contention • Linearization • SMR

Slide 49

Slide 49 text

• Decouple liveness with respect to visibility • Unmanaged languages need additional help Safe Memory Reclamation

Slide 50

Slide 50 text

• Refcounts are common mechanism • Difficulty using them without locks • Perform poorly with many long- lived, frequently accessed objects Safe Memory Reclamation

Slide 51

Slide 51 text

• Epoch-Based Reclamation (EBR) • Hazard Pointers • Quiescent State-Based Reclamation (QSBR) • Proxy Collection Safe Memory Reclamation

Slide 52

Slide 52 text

Conclusions •Crucial to understand the environment •Understanding workload allows optimization

Slide 53

Slide 53 text

Conclusions •Traditional lock-based synchronization techniques create queues and make no progress guarantees

Slide 54

Slide 54 text

References • Implementations at concurrencykit.org • Great Book: https:/ /www.kernel.org/ pub/linux/kernel/people/paulmck/ perfbook/perfbook.html

Slide 55

Slide 55 text

Conclusions •Special thanks to "" Inés Sombra "" •Thanks to Samy Al Bahra, Tyler McMullen, Nathan Taylor, Grant Zhang

Slide 56

Slide 56 text

Thank you! Questions?