Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Samy Al Bahra on Making Lockless Synchronization Fast

Samy Al Bahra on Making Lockless Synchronization Fast

Multicore systems are ubiquitous but modern concurrent programming techniques still do not see wide-spread adoption. Most concurrent software (developed in low-level languages) still relies on error-prone and unscalable memory management techniques for correctness despite the introduction of superior methods over 30 years ago. Safe memory reclamation allows for performant and robust memory management that is also suitable for advanced concurrent programming techniques such as non-blocking synchronization. If properly used, safe memory reclamation techniques allow improved performance and simplicity without the complexity of full-blown garbage collection.

This paper provides a terrific overview of common safe memory reclamation mechanisms and then explores their performance implications. In this talk, I will do the same but with stronger emphasis on the introductory aspects of safe memory reclamation and contrast with a refreshed performance analysis.

Papers_We_Love

May 26, 2015
Tweet

More Decks by Papers_We_Love

Other Decks in Programming

Transcript

  1. PWLNY SAMY AL BAHRA Co-founder of Backtrace Building high performance

    debugging platform for native applications.
 
 http://backtrace.io @0xCD03 Founder of Concurrency Kit A concurrent memory model for C99 and arsenal of tools for high performance synchronization. http://concurrencykit.org @0xF390 Previously at AppNexus, Message Systems and GWU HPCL
  2. PWLNY 1 INTRODUCTION As multiprocessors become mainstream, multithreaded applications will

    become more common, increasing the need for efficient coordination of concurrent accesses to shared data structures.
  3. PWLNY #define EMPLOYEE_MAX 8 struct employee { const char *name;

    unsigned long long number; }; struct directory { struct employee *employee[EMPLOYEE_MAX]; }; bool employee_add(struct directory *, const char *, unsigned long long); void employee_delete(struct directory *, const char *); unsigned long long employee_number_get(struct directory *, const char *); Concurrent Data Structures Special precautions are necessary to guarantee consistency of data structure. An employee directory data structure.
  4. PWLNY unsigned long long employee_number_get(struct directory *d, const char *n)

    { size_t i; for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; return d->employee[i]->number; } return 0; } Concurrent Data Structures void employee_delete(struct directory *d, const char *n) { size_t i; for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; free(d->employee[i]); d->employee[i] = NULL; return; } return; } Special precautions are necessary to guarantee consistency of data structure.
  5. PWLNY Concurrent Data Structures A read-reclaim race occurs if an

    object is destroyed while there are references or accesses to it. strcmp(em->name, … number = em->number free(em) Time T0 T1 T2
  6. PWLNY Concurrent Data Structures Special precautions are necessary to guarantee

    consistency of data structure. • Atomicity • Reordering by compiler and processor Blocking synchronization primitives can be used to provide consistency guarantees at the cost of forward progress guarantees.
  7. PWLNY #define EMPLOYEE_MAX 8 struct employee { const char *name;

    unsigned long long number; }; struct directory { struct employee *employee[EMPLOYEE_MAX]; rwlock_t rwlock; }; bool employee_add(struct directory *, const char *, unsigned long long); void employee_delete(struct directory *, const char *); unsigned long long employee_number_get(struct directory *, const char *); Concurrent Data Structures rwlock
  8. PWLNY unsigned long long employee_number_get(struct directory *d, const char *n)

    { struct employee *em; unsigned long number; size_t i; rwlock_read_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { em = d->employee[i]; if (em == NULL) continue; if (strcmp(em->name, n) != 0) continue; number = em->number; rwlock_read_unlock(&d->rwlock); return number; } rwlock_read_unlock(&d->rwlock); return 0; } Concurrent Data Structures The rwlock_t object provides correctness at cost of forward progress. “Samy” 2912984911 rwlock
  9. PWLNY bool employee_add(struct directory *d, const char *n, unsigned long

    long number) { struct employee *em; size_t i; rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] != NULL) continue; em = xmalloc(sizeof *em); em->name = n; em->number = number; d->employee[i] = em; rwlock_write_unlock(&d->rwlock); return true; } rwlock_write_unlock(&d->rwlock); return false; } Concurrent Data Structures “Samy” 2912984911 rwlock The rwlock_t object provides correctness at cost of forward progress.
  10. PWLNY Concurrent Data Structures “Samy” 2912984911 rwlock The rwlock_t object

    provides correctness at cost of forward progress. void employee_delete(struct directory *d, const char *n) { struct employee *em; size_t i; rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; em = d->employee[i]; d->employee[i] = NULL; rwlock_write_unlock(&d->rwlock); free(em); return; } rwlock_write_unlock(&d->rwlock); return; }
  11. PWLNY void employee_delete(struct directory *d, const char *n) { struct

    employee *em; size_t i; rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; em = d->employee[i]; d->employee[i] = NULL; rwlock_write_unlock(&d->rwlock); free(em); return; } rwlock_write_unlock(&d->rwlock); return; } Concurrent Data Structures “Samy” 2912984911 rwlock If reachability and liveness are coupled, you also protect against a read-reclaim race.
  12. PWLNY Concurrent Data Structures If reachability and liveness are coupled,

    you also protect against a read-reclaim race. strcmp(em->name, … number = em->number employee_delete waits on readers Time T0 T1 T2 employee_delete destroys object
  13. PWLNY Concurrent Data Structures Decoupling is sometimes necessary, but requires

    a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it. static struct employee * employee_number_get(struct directory *d, const char *n, ck_brlock_reader_t *reader) { … ck_brlock_read_lock(&d->brlock, reader); for (i = 0; i < EMPLOYEE_MAX; i++) { em = d->employee[i]; if (em == NULL) continue; if (strcmp(em->name, n) != 0) continue; ck_pr_inc_uint(&em->ref); ck_brlock_read_unlock(reader); return em; } ck_brlock_read_unlock(reader); …
  14. PWLNY Concurrent Data Structures Decoupling is sometimes necessary, but requires

    a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it. static void employee_delref(struct employee *em) { bool z; ck_pr_dec_uint_zero(&em->ref, &z); if (z == true) free(em); return; }
  15. PWLNY Concurrent Data Structures Decoupling is sometimes necessary, but requires

    a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it. static void employee_delete(struct directory *d, const char *n) { … ck_brlock_write_lock(&d->brlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; em = d->employee[i]; d->employee[i] = NULL; ck_brlock_write_unlock(&d->brlock); employee_delref(em); return; } ck_brlock_write_unlock(&d->brlock); …
  16. PWLNY Concurrent Data Structures Decoupling is sometimes necessary, but requires

    a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it. get get logical delete T0 T1 T2 logical delete physical delete active reference active reference
  17. PWLNY Concurrent Data Structures Traditional locking requires expensive atomic operations,

    such as compare-and-swap (CAS), even when locks are uncontended. Locking is also susceptible to priority inversion, convoying, deadlock, and blocking due to thread failure [3, 10], leading researchers to pursue non-blocking (or lock-free) synchro- nization [6, 12, 13, 14, 16, 29]. In some cases, lock-free approaches can bring performance benefits [25]. reference counting [5, 29] has high overhead in the base case and scales poorly with data- structure size.
  18. PWLNY Concurrent Data Structures […] concurrently-readable synchronization […] uses locks

    for updates but not for reads. static bool employee_add(struct directory *d, const char *n, unsigned long long number) { struct employee *em; size_t i; ck_rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] != NULL) continue; em = malloc(sizeof *em); em->name = n; em->number = number; ck_pr_fence_store(); ck_pr_store_ptr(&d->employee[i], em); ck_rwlock_write_unlock(&d->rwlock); return true; } ck_rwlock_write_unlock(&d->rwlock); return false; } static unsigned long long employee_number_get(struct directory *d, const char *n) { struct employee *em; unsigned long number; size_t i; for (i = 0; i < EMPLOYEE_MAX; i++) { em = ck_pr_load_ptr(&d->employee[i]); if (em == NULL) continue; ck_pr_fence_load_depends(); if (strcmp(em->name, n) != 0) continue; number = em->number; return number; } return 0; }
  19. PWLNY Concurrent Data Structures […] concurrently-readable synchronization […] uses locks

    for updates but not for reads. static void employee_delete(struct directory *d, const char *n) { struct employee *em; size_t i; ck_rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; em = d->employee[i]; ck_pr_store_ptr(&d->employee[i], NULL); ck_rwlock_write_unlock(&d->rwlock); /* XXX: When is it safe to free em? */ return; } ck_rwlock_write_unlock(&d->rwlock); return; } static unsigned long long employee_number_get(struct directory *d, const char *n) { struct employee *em; unsigned long number; size_t i; for (i = 0; i < EMPLOYEE_MAX; i++) { em = ck_pr_load_ptr(&d->employee[i]); if (em == NULL) continue; ck_pr_fence_load_depends(); if (strcmp(em->name, n) != 0) continue; number = em->number; return number; } return 0; }
  20. PWLNY EXPERIMENT • Uniform read-mostly workload • Single writer attempts

    pessimistic add operation at fixed frequency • Readers attempt to get the number of the first employee Environment • 12 cores across 2 sockets • Intel Xeon E5-2630L at 2.40 GHz • Linux 2.6.32 Workload Machine (64GB) NUMANode L#0 (P#0 32GB) Socket L#0 + L3 L#0 (15MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
  21. PWLNY Concurrent Data Structures A read-reclaim race occurs if an

    object is destroyed while there are references or accesses to it. strcmp(em->name, … number = em->number free(em) Time T0 T1 T2
  22. PWLNY 2 MEMORY RECLAMATION SCHEMES This section briefly reviews the

    reclamation schemes we consider: quiescent-state-based reclamation (QSBR) [22, 2], epoch-based reclamation (EBR) [6], hazard-pointer- based reclamation (HPBR) [23, 24], and reference counting [29, 26]. We provide an overview of each scheme to help the reader understand our work; further details are available in the papers cited. Reference Counting Pass The Buck Hazard Pointers QSBR EBR Proxy Collection Passive Serialization
  23. PWLNY HAZARD POINTERS An algorithm using HPBR must identify all

    hazardous references — references to shared nodes that may have been removed by other threads or that are vulnerable to the ABA problem [24].
  24. PWLNY HAZARD POINTERS An algorithm using HPBR must identify all

    hazardous references — references to shared nodes that may have been removed by other threads or that are vulnerable to the ABA problem [24]. employee_number_get: […] for (i = 0; i < EMPLOYEE_MAX; i++) { do { em = ck_pr_load_ptr(&d->employee[i]); if (em == NULL) continue; ck_hp_set_fence(reader, 0, em); } while (ck_pr_load_ptr(&d->employee[i]) != em); if (strcmp(em->name, n) != 0) continue; return em; } […]
  25. PWLNY HAZARD POINTERS employee_number_delete […] ck_pr_store_ptr(slot, NULL); ck_pr_fence_memory(); defer(em); […]

    defer: add_to_list(deferrals, em); if (length(deferrals) > R) reclaim; reclaim: for candidate in deferrals { bool found = false; for hazard in threads { if (hazard == candidate) { found = true; break; } } if (found == false) { free(candidate); remove(deferrals, candidate); } } Logical Delete Physical Delete
  26. PWLNY HAZARD POINTERS employee_number_delete […] ck_pr_store_ptr(slot, NULL); ck_pr_fence_memory(); defer(em); […]

    defer: add_to_list(deferrals, em); if (length(deferrals) > R) reclaim; employee_number_get: […] for (i = 0; i < EMPLOYEE_MAX; i++) { do { em = ck_pr_load_ptr(&d->employe if (em == NULL) continue; ck_hp_set_fence(reader, 0, em); } while (ck_pr_load_ptr(&d->employee[i] […] If a hazard pointer is set, it is guaranteed to be visible to any subsequent reclamation operations.
  27. PWLNY HAZARD POINTERS • Outside-in references requires storage space for

    every hazardous reference. • Requires a strong barrier on the read-side. • Provable bound on memory usage. • Read-side amortization possible with a proxy object. • In Hart’s implementations, deferral is amortized across 100 objects. Implementation available in Concurrency Kit (ck_hp).
  28. PWLNY BLOCKING SCHEMES • Read-side critical sections smr_read_lock(); <protected section>

    smr_read_unlock(); • Explicit Reclamation smr_synchronize(); smr_read_begin(); <protected section> smr_read_end();
  29. PWLNY QUIESCENT-STATE-BASED RECLAMATION Time T0 T1 T2 logical delete synchronize

    q read q read q q read q G0 A quiescent state for thread T is a state in which T holds no references to shared nodes; hence, a grace period for QSBR is any interval of time during which all threads pass through at least one quiescent state. read q synchronize G1 destroy
  30. PWLNY QUIESCENT-STATE-BASED RECLAMATION employee_number_delete […] ck_pr_store_ptr(slot, NULL); qsbr_synchronize(); free(em); […]

    Writer Readers […] for (;;) { em = employee_get(…); do_stuff(em); quiesce(); } […] Time ck_pr_store_ptr(slot, NULL); em = employee_get(…); qsbr_synchronize(); do_stuff(em); quiesce(); em = employee_get(…); do_stuff(em); quiesce(); em = employee_get(…); do_stuff(em); em = employee_get(…); do_stuff(em); free(em); T0 T1 T2
  31. PWLNY QUIESCENT-STATE-BASED RECLAMATION static void qsbr_synchronize(void) { int i; uint64_t

    goal; ck_pr_fence_memory(); goal = ck_pr_faa_64(&global.value, 1) + 1; for (i = 0; i < n_reader; i++) { uint64_t *c = &threads.readers[i].counter.value; while (ck_pr_load_64(c) < goal) ck_pr_stall(); } return; } Readers static void qsbr_quiesce(struct thread *th) { uint64_t v; ck_pr_fence_memory(); v = ck_pr_load_64(&global.value); ck_pr_store_64(&th->counter.value, v); ck_pr_fence_memory(); return; } Writers static void qsbr_read_lock(struct thread *th) { ck_pr_barrier(); /* Compiler barrier. */ return; } static void qsbr_read_unlock(struct thread *th) { ck_pr_barrier(); /* Compiler barrier. */ return; }
  32. PWLNY • Practically zero overhead read-side synchronization. • Readers must

    be augmented to mark quiescent points. • Reclamation blocks until a quiescent point has passed. • Hart’s implementation is very similar to epoch reclamation in that relies on 3 retirement lists, but threads are effectively always in a critical section. • Reclamation occurs at quiescent points if counter value is the same. Implementation available in Userspace RCU (urcu-qsbr). QUIESCENT-STATE-BASED RECLAMATION
  33. PWLNY EPOCH-BASED RECLAMATION Fraser’s EBR [6] follows QSBR in using

    grace periods, but uses epochs in place of QSBR’s quiescent states. Each thread executes in one of three logical epochs, and may lag at most one epoch behind the global epoch. Each thread atomically sets a per-thread flag upon entry into a critical region …
  34. PWLNY EPOCH-BASED RECLAMATION CK_CC_INLINE static void ck_epoch_begin(ck_epoch_t *epoch, ck_epoch_record_t *record)

    { unsigned int g_epoch = ck_pr_load_uint(&epoch->epoch); ck_pr_store_uint(&record->epoch, g_epoch); ck_pr_fas_uint(&record->active, 1); ck_pr_fence_memory(); return; } CK_CC_INLINE static void ck_epoch_end(ck_epoch_t *global, ck_epoch_record_t *record) { (void)global; ck_pr_fence_memory(); /* Release suffices. */ ck_pr_store_uint(&record->active, 0); return; }
  35. PWLNY EPOCH-BASED RECLAMATION void ck_epoch_synchronize(struct ck_epoch *global, struct ck_epoch_record *record)

    { struct ck_epoch_record *cr; unsigned int delta, epoch, goal, i; bool active; delta = epoch = ck_pr_load_uint(&global->epoch); goal = epoch + CK_EPOCH_GRACE; ck_pr_fence_memory(); for (i = 0, cr = NULL; i < CK_EPOCH_GRACE - 1; cr = NULL, i++) { while (cr = ck_epoch_scan(global, cr, delta, &active), cr != NULL) { e_d = ck_pr_load_uint(&global->epoch); if (e_d != delta) { delta = e_d; goto reload; } } if (active == false) break; if (ck_pr_cas_uint_value(&global->epoch, delta, delta + 1, &delta) == true) { delta = delta + 1; continue; } reload: if (delta >= goal) break; }
  36. PWLNY EPOCH-BASED RECLAMATION • Atomic operation or barrier required on

    entry. • On relaxed ordering architectures, a release fence is required on exit. • Application-agnostic and doesn’t require explicit quiescent points. • Reclamation blocks until a quiescent point has passed. • Competitive with amortization (NEBR) and deferred callbacks. • In Hart’s implementation, entry into a protected section forces reclamation attempt every 100 critical sections. Implementation available in Concurrency Kit (ck_epoch).
  37. PWLNY SUMMARY • Non-Blocking Schemes • Hazard Pointers • Requires

    a fence on read-side for every hazardous reference but has strong bound on growth. • Blocking Schemes • Quiescent State-Based Reclamation (QSBR) • Quiescent states are used to implement grace period detection. • Epoch-Based Reclamation (EBR) • Grace period detection relies on conditionally increased epoch counter, does not rely on quiescence.
  38. PWLNY 3 Reclamation Performance Factors We categorize factors which can

    affect reclamation scheme performance; we vary these factors in Section 5. • Memory Consistency • Required complexity of the various safe memory reclamation mechanisms. • Data Structures and Workloads • Different mechanisms excel in different workloads. • Threads and Scheduling • High levels of preemption will do bad things. • Memory Constraints • Deferral may violate memory requirements.
  39. PWLNY Operating Environment XServe IBM Power Dell * CPUs 2x2GHz

    PPC G5 8x1.45GHz POWER4+ 6x2x2.4Ghz Xeon E5-2630L Kernel Linux 2.6.8-1.ydl.7g5- smp Linux 2.6.13 (kernel.org) Linux 2.6.32-504.1.3.el6 Fence 78ns 76ns ~14ns CAS 52ns 59ns ~8ns Lock 231ns 243ns ~10ns * Not from original study
  40. PWLNY Test Program Figure 8. Per-thread test pseudo-code. • Benchmark

    starts N threads and then stops them after the test duration. • Average execution time = total number of operations / test duration • CPU time = execution time / min(|threads|,|processors|) • Workload varied in reclamation-agnostic manner. • Free list for memory allocation. • Data structures used • Ordered lock-free list • MCS FIFO Queue • Hash table
  41. PWLNY Base Costs • Measure best-case execution times with no

    contention and minimal jitter. • Dominated by complexity of fast path.
  42. PWLNY Consequences Figure 18. Lock-free versus concurrently- readable algorithms —

    hash tables with load factor 1, four threads, 8 CPUs.