Non-Blocking Algorithms and Scalable Multicore Programming

Nonblocking Algorithms and Scalable Multicore Programming by Samy Al Bahra,
backtrace.io Papers We Love, Too — 21 May 2015 — San Francisco, CA

Who am I? • Devon H. O’Dell, @dhobsd • Software
Engineer, @Fastly • Performance and debugging nut • Zappa fan

Overview • Introduction • Environment • Cache coherency / Memory
ordering / Atomics / NUMA • Non-blocking algorithms • Guarantees / Contention / Linearization / SMR • Conclusion

Why Do I Love This Paper? • Hard problem •
Under-represented area • Inspires shift in thinking • Enables further research

Our Environment • Cache coherence protocols • Memory ordering •
Atomic instructions • NUMA topologies

Environment • Cache coherence protocols • Memory ordering • Atomic
primitives • NUMA topologies

“Understanding the mechanisms that provide coherency guarantees on multiprocessors is
a prerequisite to understanding contention on such systems.”

MESIF Cache Coherency Protocol State Machine Understanding how processors optimize
memory access is crucial. Understanding how cache traffic propagates between processors is crucial. Marek Fiser http:/ /www.texample.net/tikz/examples/mesif/ CC BY 2.5

“[T]he cache line is the granularity at which coherency is
maintained on a cache-coherent multiprocessor system. This is also the unit of contention.”

Cache Lines

#define N_THR 8 static volatile struct { uint64_t value; }
counters[N_THR]; void * thread(void *thridp) { uint64_t thrid = *thridp; while (exit == false) { counters[thrid].value++; } return NULL; }

#define N_THR 8 static volatile struct { uint64_t value; char
pad[64 - sizeof (uint64_t)]; } counters[N_THR]; void * thread(void *thridp) { uint64_t thrid = *thridp; while (exit == false) { counters[thrid].value++; } return NULL; }

Memory Ordering • Total Store Order (TSO) - x86, x86-64,
SPARC • Partial Store Order (PSO) - SPARC • Relaxed Memory Ordering (RMO) - Power, ARM

volatile int ready = 0; void produce(void) { message =
message_new(); message->value = 5; message_send(message); ready = 1; } void consume(void) { while (ready == 0) ; message = message_receive(); result = operation(&message->value); }

Memory Barriers • AKA memory fences • Required for portability
across architectures • Impacts correctness of concurrent memory access

message_new(); message->value = 5; message_send(message); memory_barrier(); ready = 1; } void consume(void) { while (ready == 0) ; memory_barrier(); message = message_receive(); result = operation(&message->value); }

message_new(); message->value = 5; message_send(message); store_barrier(); ready = 1; } void consume(void) { while (ready == 0) ; load_barrier(); message = message_receive(); result = operation(&message->value); }

Atomic primitives • NUMA topologies

Atomic Primitives • They’re not cheap, even without contention •
Languages with strong memory guarantees may increase cost • Compilers providing builtins may increase cost

Atomic Primitives • Read-Modify-Write (RMW) • Compare-and-Swap (CAS) • Load-Linked
/ Store- Conditional (LL/SC)

atomic bool compare_and_swap (uint64_t *val, uint64_t *cmp, uint64_t update) {
if (*val == *cmp) { *val = update; return true; } return false; }

atomic uint64_t * load_linked(uint64_t *v) { return *v; } atomic
bool store_conditional (uint64_t *val, uint64_t update) { if (*val == *magic_val_at_time_of_load_linked_call) { *val = update; return true; } return false; }

NUMA Topologies • Non-Uniform Memory Access: memory placement matters •
Accessing remote memory is slow • Locking on remote memory can cause starvation and livelocks

NUMA Topologies • Fair locks help with starvation, but are
sensitive to preemption • “Cohort locks” are NUMA- friendly

Non-blocking Algorithms • Guarantees • Contention • Linearization • SMR

Non-blocking Guarantees • Wait-freedom • Lock-freedom • Obstruction-freedom

static uint64_t counter; static spinlock_t lock = SPINLOCK_INITIALIZER; void counter_inc(void)
{ uint64_t t; spinlock_lock(&lock); t = counter; t = t + 1; counter = t; spinlock_unlock(&lock);  }

Blocking Algorithms Th0 Th1 … ThN ⚛ R M W

static uint64_t counter; void counter_inc(void) { uint64_t t, update; do
{ t = counter; update = t + 1; } while (!compare_and_swap(&counter, t, update)); }

Lock-Free Algorithms ⚛ Th0 Th1 … ⚛ ⚛ ⚛ ThN
⚛ ⚛ retry retry retry complete complete

static uint64_t counter; void counter_inc(void) { atomic_inc(&counter); }

Wait-Free Algorithms ⚛ Th0 Th1 … ⚛ ThN ⚛ complete
complete complete

“Contention [is] a product of the effective arrival rate of
requests to a shared resource, [which] leads to a queuing effect that directly impacts the responsiveness of a system.”

Awful Person

Locks Create Queues • Threads become queue elements • Composability
• Fairness

Locks Create Queues • Completion guarantees • Latency of critical
section impacts system throughput

Little’s law: L = λW • Applies to stable systems
• Capacity (L) • Throughput (λ) • Latency (W)

Uncontended Workload

Contention Spinlock vs. Lock-Free Stack

Linearizing • Need to prove correctness • “Linearization points” •
State space explosion

“What may seem like memory-model minutiae may actually be a
violation in the correctness of your program.”

• Decouple liveness with respect to visibility • Unmanaged languages
need additional help Safe Memory Reclamation

• Refcounts are common mechanism • Difficulty using them without
locks • Perform poorly with many long- lived, frequently accessed objects Safe Memory Reclamation

• Epoch-Based Reclamation (EBR) • Hazard Pointers • Quiescent State-Based
Reclamation (QSBR) • Proxy Collection Safe Memory Reclamation

Conclusions •Crucial to understand the environment •Understanding workload allows optimization

Conclusions •Traditional lock-based synchronization techniques create queues and make no
progress guarantees

References • Implementations at concurrencykit.org • Great Book: https:/ /www.kernel.org/
pub/linux/kernel/people/paulmck/ perfbook/perfbook.html

Conclusions •Special thanks to "" Inés Sombra "" •Thanks to
Samy Al Bahra, Tyler McMullen, Nathan Taylor, Grant Zhang

Thank you! Questions?

Non-Blocking Algorithms and Scalable Multicore ...

Non-Blocking Algorithms and Scalable Multicore Programming

More Decks by Devon H. O'Dell

Other Decks in Programming

Featured

Transcript