Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Non-Blocking Algorithms and Scalable Multicore Programming

Non-Blocking Algorithms and Scalable Multicore Programming

Presented at Papers We Love, Too (SF). https://youtu.be/FXuHxVIMIhw?t=15m00s

Apart from being a fantastic primer on designing portable concurrent data structures, this well-sourced paper provides a number of captivating ideas on thinking about, modeling, and solving concurrency and contention problems.

Devon H. O'Dell

May 21, 2015
Tweet

More Decks by Devon H. O'Dell

Other Decks in Programming

Transcript

  1. Nonblocking Algorithms and Scalable Multicore Programming by Samy Al Bahra,

    backtrace.io Papers We Love, Too — 21 May 2015 — San Francisco, CA
  2. Who am I? • Devon H. O’Dell, @dhobsd • Software

    Engineer, @Fastly • Performance and debugging nut • Zappa fan
  3. Overview • Introduction • Environment • Cache coherency / Memory

    ordering / Atomics / NUMA • Non-blocking algorithms • Guarantees / Contention / Linearization / SMR • Conclusion
  4. Why Do I Love This Paper? • Hard problem •

    Under-represented area • Inspires shift in thinking • Enables further research
  5. “Understanding the mechanisms that provide coherency guarantees on multiprocessors is

    a prerequisite to understanding contention on such systems.”
  6. MESIF Cache Coherency Protocol State Machine Understanding how processors optimize

    memory access is crucial. Understanding how cache traffic propagates between processors is crucial. Marek Fiser http:/ /www.texample.net/tikz/examples/mesif/ CC BY 2.5
  7. “[T]he cache line is the granularity at which coherency is

    maintained on a cache-coherent multiprocessor system. This is also the unit of contention.”
  8. #define N_THR 8 static volatile struct { uint64_t value; }

    counters[N_THR]; void * thread(void *thridp) { uint64_t thrid = *thridp; while (exit == false) { counters[thrid].value++; } return NULL; }
  9. #define N_THR 8 static volatile struct { uint64_t value; char

    pad[64 - sizeof (uint64_t)]; } counters[N_THR]; void * thread(void *thridp) { uint64_t thrid = *thridp; while (exit == false) { counters[thrid].value++; } return NULL; }
  10. Memory Ordering • Total Store Order (TSO) - x86, x86-64,

    SPARC • Partial Store Order (PSO) - SPARC • Relaxed Memory Ordering (RMO) - Power, ARM
  11. volatile int ready = 0; void produce(void) { message =

    message_new(); message->value = 5; message_send(message); ready = 1; } void consume(void) { while (ready == 0) ; message = message_receive(); result = operation(&message->value); }
  12. Memory Barriers • AKA memory fences • Required for portability

    across architectures • Impacts correctness of concurrent memory access
  13. volatile int ready = 0; void produce(void) { message =

    message_new(); message->value = 5; message_send(message); memory_barrier(); ready = 1; } void consume(void) { while (ready == 0) ; memory_barrier(); message = message_receive(); result = operation(&message->value); }
  14. volatile int ready = 0; void produce(void) { message =

    message_new(); message->value = 5; message_send(message); store_barrier(); ready = 1; } void consume(void) { while (ready == 0) ; load_barrier(); message = message_receive(); result = operation(&message->value); }
  15. Atomic Primitives • They’re not cheap, even without contention •

    Languages with strong memory guarantees may increase cost • Compilers providing builtins may increase cost
  16. atomic bool compare_and_swap (uint64_t *val, uint64_t *cmp, uint64_t update) {

    if (*val == *cmp) { *val = update; return true; } return false; }
  17. atomic uint64_t * load_linked(uint64_t *v) { return *v; } atomic

    bool store_conditional (uint64_t *val, uint64_t update) { if (*val == *magic_val_at_time_of_load_linked_call) { *val = update; return true; } return false; }
  18. NUMA Topologies • Non-Uniform Memory Access: memory placement matters •

    Accessing remote memory is slow • Locking on remote memory can cause starvation and livelocks
  19. NUMA Topologies • Fair locks help with starvation, but are

    sensitive to preemption • “Cohort locks” are NUMA- friendly
  20. static uint64_t counter; static spinlock_t lock = SPINLOCK_INITIALIZER; void counter_inc(void)

    { uint64_t t; spinlock_lock(&lock); t = counter; t = t + 1; counter = t; spinlock_unlock(&lock);
 }
  21. static uint64_t counter; void counter_inc(void) { uint64_t t, update; do

    { t = counter; update = t + 1; } while (!compare_and_swap(&counter, t, update)); }
  22. Lock-Free Algorithms ⚛ Th0 Th1 … ⚛ ⚛ ⚛ ThN

    ⚛ ⚛ retry retry retry complete complete
  23. “Contention [is] a product of the effective arrival rate of

    requests to a shared resource, [which] leads to a queuing effect that directly impacts the responsiveness of a system.”
  24. Little’s law: L = λW • Applies to stable systems

    • Capacity (L) • Throughput (λ) • Latency (W)
  25. “What may seem like memory-model minutiae may actually be a

    violation in the correctness of your program.”
  26. • Refcounts are common mechanism • Difficulty using them without

    locks • Perform poorly with many long- lived, frequently accessed objects Safe Memory Reclamation
  27. • Epoch-Based Reclamation (EBR) • Hazard Pointers • Quiescent State-Based

    Reclamation (QSBR) • Proxy Collection Safe Memory Reclamation
  28. Conclusions •Special thanks to "" Inés Sombra "" •Thanks to

    Samy Al Bahra, Tyler McMullen, Nathan Taylor, Grant Zhang