Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Non-Blocking Algorithms and Scalable Multicore Programming

Non-Blocking Algorithms and Scalable Multicore Programming

Presented at Papers We Love, Too (SF). https://youtu.be/FXuHxVIMIhw?t=15m00s

Apart from being a fantastic primer on designing portable concurrent data structures, this well-sourced paper provides a number of captivating ideas on thinking about, modeling, and solving concurrency and contention problems.

083a38739359a3cb689c5be4cd7a1985?s=128

Devon H. O'Dell

May 21, 2015
Tweet

Transcript

  1. Nonblocking Algorithms and Scalable Multicore Programming by Samy Al Bahra,

    backtrace.io Papers We Love, Too — 21 May 2015 — San Francisco, CA
  2. Who am I? • Devon H. O’Dell, @dhobsd • Software

    Engineer, @Fastly • Performance and debugging nut • Zappa fan
  3. Overview • Introduction • Environment • Cache coherency / Memory

    ordering / Atomics / NUMA • Non-blocking algorithms • Guarantees / Contention / Linearization / SMR • Conclusion
  4. None
  5. Why Do I Love This Paper? • Hard problem •

    Under-represented area • Inspires shift in thinking • Enables further research
  6. Our Environment • Cache coherence protocols • Memory ordering •

    Atomic instructions • NUMA topologies
  7. Environment • Cache coherence protocols • Memory ordering • Atomic

    primitives • NUMA topologies
  8. “Understanding the mechanisms that provide coherency guarantees on multiprocessors is

    a prerequisite to understanding contention on such systems.”
  9. MESIF Cache Coherency Protocol State Machine Understanding how processors optimize

    memory access is crucial. Understanding how cache traffic propagates between processors is crucial. Marek Fiser http:/ /www.texample.net/tikz/examples/mesif/ CC BY 2.5
  10. “[T]he cache line is the granularity at which coherency is

    maintained on a cache-coherent multiprocessor system. This is also the unit of contention.”
  11. Cache Lines

  12. #define N_THR 8 static volatile struct { uint64_t value; }

    counters[N_THR]; void * thread(void *thridp) { uint64_t thrid = *thridp; while (exit == false) { counters[thrid].value++; } return NULL; }
  13. #define N_THR 8 static volatile struct { uint64_t value; char

    pad[64 - sizeof (uint64_t)]; } counters[N_THR]; void * thread(void *thridp) { uint64_t thrid = *thridp; while (exit == false) { counters[thrid].value++; } return NULL; }
  14. Our Environment • Cache coherence protocols • Memory ordering •

    Atomic instructions • NUMA topologies
  15. Memory Ordering • Total Store Order (TSO) - x86, x86-64,

    SPARC • Partial Store Order (PSO) - SPARC • Relaxed Memory Ordering (RMO) - Power, ARM
  16. volatile int ready = 0; void produce(void) { message =

    message_new(); message->value = 5; message_send(message); ready = 1; } void consume(void) { while (ready == 0) ; message = message_receive(); result = operation(&message->value); }
  17. Memory Barriers • AKA memory fences • Required for portability

    across architectures • Impacts correctness of concurrent memory access
  18. volatile int ready = 0; void produce(void) { message =

    message_new(); message->value = 5; message_send(message); memory_barrier(); ready = 1; } void consume(void) { while (ready == 0) ; memory_barrier(); message = message_receive(); result = operation(&message->value); }
  19. volatile int ready = 0; void produce(void) { message =

    message_new(); message->value = 5; message_send(message); store_barrier(); ready = 1; } void consume(void) { while (ready == 0) ; load_barrier(); message = message_receive(); result = operation(&message->value); }
  20. Our Environment • Cache coherence protocols • Memory ordering •

    Atomic primitives • NUMA topologies
  21. Atomic Primitives • They’re not cheap, even without contention •

    Languages with strong memory guarantees may increase cost • Compilers providing builtins may increase cost
  22. Atomic Primitives • Read-Modify-Write (RMW) • Compare-and-Swap (CAS) • Load-Linked

    / Store- Conditional (LL/SC)
  23. atomic bool compare_and_swap (uint64_t *val, uint64_t *cmp, uint64_t update) {

    if (*val == *cmp) { *val = update; return true; } return false; }
  24. atomic uint64_t * load_linked(uint64_t *v) { return *v; } atomic

    bool store_conditional (uint64_t *val, uint64_t update) { if (*val == *magic_val_at_time_of_load_linked_call) { *val = update; return true; } return false; }
  25. Our Environment • Cache coherence protocols • Memory ordering •

    Atomic instructions • NUMA topologies
  26. NUMA Topologies • Non-Uniform Memory Access: memory placement matters •

    Accessing remote memory is slow • Locking on remote memory can cause starvation and livelocks
  27. NUMA Topologies • Fair locks help with starvation, but are

    sensitive to preemption • “Cohort locks” are NUMA- friendly
  28. Non-blocking Algorithms • Guarantees • Contention • Linearization • SMR

  29. Non-blocking Algorithms • Guarantees • Contention • Linearization • SMR

  30. Non-blocking Guarantees • Wait-freedom • Lock-freedom • Obstruction-freedom

  31. static uint64_t counter; static spinlock_t lock = SPINLOCK_INITIALIZER; void counter_inc(void)

    { uint64_t t; spinlock_lock(&lock); t = counter; t = t + 1; counter = t; spinlock_unlock(&lock);
 }
  32. Blocking Algorithms Th0 Th1 … ThN ⚛ R M W

  33. static uint64_t counter; void counter_inc(void) { uint64_t t, update; do

    { t = counter; update = t + 1; } while (!compare_and_swap(&counter, t, update)); }
  34. Lock-Free Algorithms ⚛ Th0 Th1 … ⚛ ⚛ ⚛ ThN

    ⚛ ⚛ retry retry retry complete complete
  35. static uint64_t counter; void counter_inc(void) { atomic_inc(&counter); }

  36. Wait-Free Algorithms ⚛ Th0 Th1 … ⚛ ThN ⚛ complete

    complete complete
  37. Non-blocking Algorithms • Guarantees • Contention • Linearization • SMR

  38. “Contention [is] a product of the effective arrival rate of

    requests to a shared resource, [which] leads to a queuing effect that directly impacts the responsiveness of a system.”
  39. Awful Person

  40. Locks Create Queues • Threads become queue elements • Composability

    • Fairness
  41. Locks Create Queues • Completion guarantees • Latency of critical

    section impacts system throughput
  42. Little’s law: L = λW • Applies to stable systems

    • Capacity (L) • Throughput (λ) • Latency (W)
  43. Uncontended Workload

  44. Contention Spinlock vs. Lock-Free Stack

  45. Non-blocking Algorithms • Guarantees • Contention • Linearization • SMR

  46. Linearizing • Need to prove correctness • “Linearization points” •

    State space explosion
  47. “What may seem like memory-model minutiae may actually be a

    violation in the correctness of your program.”
  48. Non-blocking Algorithms • Guarantees • Contention • Linearization • SMR

  49. • Decouple liveness with respect to visibility • Unmanaged languages

    need additional help Safe Memory Reclamation
  50. • Refcounts are common mechanism • Difficulty using them without

    locks • Perform poorly with many long- lived, frequently accessed objects Safe Memory Reclamation
  51. • Epoch-Based Reclamation (EBR) • Hazard Pointers • Quiescent State-Based

    Reclamation (QSBR) • Proxy Collection Safe Memory Reclamation
  52. Conclusions •Crucial to understand the environment •Understanding workload allows optimization

  53. Conclusions •Traditional lock-based synchronization techniques create queues and make no

    progress guarantees
  54. References • Implementations at concurrencykit.org • Great Book: https:/ /www.kernel.org/

    pub/linux/kernel/people/paulmck/ perfbook/perfbook.html
  55. Conclusions •Special thanks to "" Inés Sombra "" •Thanks to

    Samy Al Bahra, Tyler McMullen, Nathan Taylor, Grant Zhang
  56. Thank you! Questions?