Samy Al Bahra on Making Lockless Synchronization Fast

PWLNY SAMY AL BAHRA Co-founder of Backtrace Building high performance
debugging platform for native applications.    http://backtrace.io @0xCD03 Founder of Concurrency Kit A concurrent memory model for C99 and arsenal of tools for high performance synchronization. http://concurrencykit.org @0xF390 Previously at AppNexus, Message Systems and GWU HPCL

PWLNY OVERVIEW

PWLNY 1 INTRODUCTION As multiprocessors become mainstream, multithreaded applications will
become more common, increasing the need for efficient coordination of concurrent accesses to shared data structures.

PWLNY Concurrent Data Structures A data structure that is shared
by cooperating processes. P0 P1 P2

PWLNY #define EMPLOYEE_MAX 8 struct employee { const char *name;
unsigned long long number; }; struct directory { struct employee *employee[EMPLOYEE_MAX]; }; bool employee_add(struct directory *, const char *, unsigned long long); void employee_delete(struct directory *, const char *); unsigned long long employee_number_get(struct directory *, const char *); Concurrent Data Structures Special precautions are necessary to guarantee consistency of data structure. An employee directory data structure.

PWLNY unsigned long long employee_number_get(struct directory *d, const char *n)
{ size_t i; for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; return d->employee[i]->number; } return 0; } Concurrent Data Structures void employee_delete(struct directory *d, const char *n) { size_t i; for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; free(d->employee[i]); d->employee[i] = NULL; return; } return; } Special precautions are necessary to guarantee consistency of data structure.

PWLNY Concurrent Data Structures A read-reclaim race occurs if an
object is destroyed while there are references or accesses to it. strcmp(em->name, … number = em->number free(em) Time T0 T1 T2

PWLNY Concurrent Data Structures Special precautions are necessary to guarantee
consistency of data structure. • Atomicity • Reordering by compiler and processor Blocking synchronization primitives can be used to provide consistency guarantees at the cost of forward progress guarantees.

PWLNY #define EMPLOYEE_MAX 8 struct employee { const char *name;
unsigned long long number; }; struct directory { struct employee *employee[EMPLOYEE_MAX]; rwlock_t rwlock; }; bool employee_add(struct directory *, const char *, unsigned long long); void employee_delete(struct directory *, const char *); unsigned long long employee_number_get(struct directory *, const char *); Concurrent Data Structures rwlock

PWLNY unsigned long long employee_number_get(struct directory *d, const char *n)
{ struct employee *em; unsigned long number; size_t i; rwlock_read_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { em = d->employee[i]; if (em == NULL) continue; if (strcmp(em->name, n) != 0) continue; number = em->number; rwlock_read_unlock(&d->rwlock); return number; } rwlock_read_unlock(&d->rwlock); return 0; } Concurrent Data Structures The rwlock_t object provides correctness at cost of forward progress. “Samy” 2912984911 rwlock

PWLNY bool employee_add(struct directory *d, const char *n, unsigned long
long number) { struct employee *em; size_t i; rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] != NULL) continue; em = xmalloc(sizeof *em); em->name = n; em->number = number; d->employee[i] = em; rwlock_write_unlock(&d->rwlock); return true; } rwlock_write_unlock(&d->rwlock); return false; } Concurrent Data Structures “Samy” 2912984911 rwlock The rwlock_t object provides correctness at cost of forward progress.

PWLNY Concurrent Data Structures “Samy” 2912984911 rwlock The rwlock_t object
provides correctness at cost of forward progress. void employee_delete(struct directory *d, const char *n) { struct employee *em; size_t i; rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; em = d->employee[i]; d->employee[i] = NULL; rwlock_write_unlock(&d->rwlock); free(em); return; } rwlock_write_unlock(&d->rwlock); return; }

PWLNY void employee_delete(struct directory *d, const char *n) { struct
employee *em; size_t i; rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; em = d->employee[i]; d->employee[i] = NULL; rwlock_write_unlock(&d->rwlock); free(em); return; } rwlock_write_unlock(&d->rwlock); return; } Concurrent Data Structures “Samy” 2912984911 rwlock If reachability and liveness are coupled, you also protect against a read-reclaim race.

PWLNY Concurrent Data Structures If reachability and liveness are coupled,
you also protect against a read-reclaim race. strcmp(em->name, … number = em->number employee_delete waits on readers Time T0 T1 T2 employee_delete destroys object

PWLNY Concurrent Data Structures Decoupling is sometimes necessary, but requires
a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it. static struct employee * employee_number_get(struct directory *d, const char *n, ck_brlock_reader_t *reader) { … ck_brlock_read_lock(&d->brlock, reader); for (i = 0; i < EMPLOYEE_MAX; i++) { em = d->employee[i]; if (em == NULL) continue; if (strcmp(em->name, n) != 0) continue; ck_pr_inc_uint(&em->ref); ck_brlock_read_unlock(reader); return em; } ck_brlock_read_unlock(reader); …

a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it. static void employee_delref(struct employee *em) { bool z; ck_pr_dec_uint_zero(&em->ref, &z); if (z == true) free(em); return; }

a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it. static void employee_delete(struct directory *d, const char *n) { … ck_brlock_write_lock(&d->brlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; em = d->employee[i]; d->employee[i] = NULL; ck_brlock_write_unlock(&d->brlock); employee_delref(em); return; } ck_brlock_write_unlock(&d->brlock); …

a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it. get get logical delete T0 T1 T2 logical delete physical delete active reference active reference

PWLNY Concurrent Data Structures Traditional locking requires expensive atomic operations,
such as compare-and-swap (CAS), even when locks are uncontended. Locking is also susceptible to priority inversion, convoying, deadlock, and blocking due to thread failure [3, 10], leading researchers to pursue non-blocking (or lock-free) synchronization [6, 12, 13, 14, 16, 29]. In some cases, lock-free approaches can bring performance benefits [25]. reference counting [5, 29] has high overhead in the base case and scales poorly with data- structure size.

PWLNY Concurrent Data Structures […] concurrently-readable synchronization […] uses locks
for updates but not for reads. static bool employee_add(struct directory *d, const char *n, unsigned long long number) { struct employee *em; size_t i; ck_rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] != NULL) continue; em = malloc(sizeof *em); em->name = n; em->number = number; ck_pr_fence_store(); ck_pr_store_ptr(&d->employee[i], em); ck_rwlock_write_unlock(&d->rwlock); return true; } ck_rwlock_write_unlock(&d->rwlock); return false; } static unsigned long long employee_number_get(struct directory *d, const char *n) { struct employee *em; unsigned long number; size_t i; for (i = 0; i < EMPLOYEE_MAX; i++) { em = ck_pr_load_ptr(&d->employee[i]); if (em == NULL) continue; ck_pr_fence_load_depends(); if (strcmp(em->name, n) != 0) continue; number = em->number; return number; } return 0; }

PWLNY Concurrent Data Structures […] concurrently-readable synchronization […] uses locks
for updates but not for reads. static void employee_delete(struct directory *d, const char *n) { struct employee *em; size_t i; ck_rwlock_write_lock(&d->rwlock); for (i = 0; i < EMPLOYEE_MAX; i++) { if (d->employee[i] == NULL) continue; if (strcmp(d->employee[i]->name, n) != 0) continue; em = d->employee[i]; ck_pr_store_ptr(&d->employee[i], NULL); ck_rwlock_write_unlock(&d->rwlock); /* XXX: When is it safe to free em? */ return; } ck_rwlock_write_unlock(&d->rwlock); return; } static unsigned long long employee_number_get(struct directory *d, const char *n) { struct employee *em; unsigned long number; size_t i; for (i = 0; i < EMPLOYEE_MAX; i++) { em = ck_pr_load_ptr(&d->employee[i]); if (em == NULL) continue; ck_pr_fence_load_depends(); if (strcmp(em->name, n) != 0) continue; number = em->number; return number; } return 0; }

PWLNY EXPERIMENT • Uniform read-mostly workload • Single writer attempts
pessimistic add operation at ﬁxed frequency • Readers attempt to get the number of the ﬁrst employee Environment • 12 cores across 2 sockets • Intel Xeon E5-2630L at 2.40 GHz • Linux 2.6.32 Workload Machine (64GB) NUMANode L#0 (P#0 32GB) Socket L#0 + L3 L#0 (15MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)

PWLNY Read Latency No updates

PWLNY Read Latency Single writer

PWLNY Write Latency Single writer

PWLNY Concurrent Data Structures A read-reclaim race occurs if an
object is destroyed while there are references or accesses to it. strcmp(em->name, … number = em->number free(em) Time T0 T1 T2

PWLNY 2 MEMORY RECLAMATION SCHEMES This section briefly reviews the
reclamation schemes we consider: quiescent-state-based reclamation (QSBR) [22, 2], epoch-based reclamation (EBR) [6], hazard-pointer- based reclamation (HPBR) [23, 24], and reference counting [29, 26]. We provide an overview of each scheme to help the reader understand our work; further details are available in the papers cited. Reference Counting Pass The Buck Hazard Pointers QSBR EBR Proxy Collection Passive Serialization

PWLNY NON-BLOCKING SCHEMES

PWLNY HAZARD POINTERS An algorithm using HPBR must identify all
hazardous references — references to shared nodes that may have been removed by other threads or that are vulnerable to the ABA problem [24].

PWLNY HAZARD POINTERS An algorithm using HPBR must identify all
hazardous references — references to shared nodes that may have been removed by other threads or that are vulnerable to the ABA problem [24]. employee_number_get: […] for (i = 0; i < EMPLOYEE_MAX; i++) { do { em = ck_pr_load_ptr(&d->employee[i]); if (em == NULL) continue; ck_hp_set_fence(reader, 0, em); } while (ck_pr_load_ptr(&d->employee[i]) != em); if (strcmp(em->name, n) != 0) continue; return em; } […]

PWLNY HAZARD POINTERS employee_number_delete […] ck_pr_store_ptr(slot, NULL); ck_pr_fence_memory(); defer(em); […]
defer: add_to_list(deferrals, em); if (length(deferrals) > R) reclaim; reclaim: for candidate in deferrals { bool found = false; for hazard in threads { if (hazard == candidate) { found = true; break; } } if (found == false) { free(candidate); remove(deferrals, candidate); } } Logical Delete Physical Delete

PWLNY HAZARD POINTERS employee_number_delete […] ck_pr_store_ptr(slot, NULL); ck_pr_fence_memory(); defer(em); […]
defer: add_to_list(deferrals, em); if (length(deferrals) > R) reclaim; employee_number_get: […] for (i = 0; i < EMPLOYEE_MAX; i++) { do { em = ck_pr_load_ptr(&d->employe if (em == NULL) continue; ck_hp_set_fence(reader, 0, em); } while (ck_pr_load_ptr(&d->employee[i] […] If a hazard pointer is set, it is guaranteed to be visible to any subsequent reclamation operations.

PWLNY HAZARD POINTERS • Outside-in references requires storage space for
every hazardous reference. • Requires a strong barrier on the read-side. • Provable bound on memory usage. • Read-side amortization possible with a proxy object. • In Hart’s implementations, deferral is amortized across 100 objects. Implementation available in Concurrency Kit (ck_hp).

PWLNY BLOCKING SCHEMES

PWLNY BLOCKING SCHEMES • Read-side critical sections smr_read_lock(); <protected section>
smr_read_unlock(); • Explicit Reclamation smr_synchronize(); smr_read_begin(); <protected section> smr_read_end();

PWLNY QUIESCENT-STATE-BASED RECLAMATION Time T0 T1 T2 logical delete synchronize
q read q read q q read q G0 A quiescent state for thread T is a state in which T holds no references to shared nodes; hence, a grace period for QSBR is any interval of time during which all threads pass through at least one quiescent state. read q synchronize G1 destroy

PWLNY QUIESCENT-STATE-BASED RECLAMATION employee_number_delete […] ck_pr_store_ptr(slot, NULL); qsbr_synchronize(); free(em); […]
Writer Readers […] for (;;) { em = employee_get(…); do_stuff(em); quiesce(); } […] Time ck_pr_store_ptr(slot, NULL); em = employee_get(…); qsbr_synchronize(); do_stuff(em); quiesce(); em = employee_get(…); do_stuff(em); quiesce(); em = employee_get(…); do_stuff(em); em = employee_get(…); do_stuff(em); free(em); T0 T1 T2

PWLNY QUIESCENT-STATE-BASED RECLAMATION static void qsbr_synchronize(void) { int i; uint64_t
goal; ck_pr_fence_memory(); goal = ck_pr_faa_64(&global.value, 1) + 1; for (i = 0; i < n_reader; i++) { uint64_t *c = &threads.readers[i].counter.value; while (ck_pr_load_64(c) < goal) ck_pr_stall(); } return; } Readers static void qsbr_quiesce(struct thread *th) { uint64_t v; ck_pr_fence_memory(); v = ck_pr_load_64(&global.value); ck_pr_store_64(&th->counter.value, v); ck_pr_fence_memory(); return; } Writers static void qsbr_read_lock(struct thread *th) { ck_pr_barrier(); /* Compiler barrier. */ return; } static void qsbr_read_unlock(struct thread *th) { ck_pr_barrier(); /* Compiler barrier. */ return; }

PWLNY • Practically zero overhead read-side synchronization. • Readers must
be augmented to mark quiescent points. • Reclamation blocks until a quiescent point has passed. • Hart’s implementation is very similar to epoch reclamation in that relies on 3 retirement lists, but threads are effectively always in a critical section. • Reclamation occurs at quiescent points if counter value is the same. Implementation available in Userspace RCU (urcu-qsbr). QUIESCENT-STATE-BASED RECLAMATION

PWLNY EPOCH-BASED RECLAMATION Fraser’s EBR [6] follows QSBR in using
grace periods, but uses epochs in place of QSBR’s quiescent states. Each thread executes in one of three logical epochs, and may lag at most one epoch behind the global epoch. Each thread atomically sets a per-thread flag upon entry into a critical region …

PWLNY EPOCH-BASED RECLAMATION

PWLNY EPOCH-BASED RECLAMATION CK_CC_INLINE static void ck_epoch_begin(ck_epoch_t *epoch, ck_epoch_record_t *record)
{ unsigned int g_epoch = ck_pr_load_uint(&epoch->epoch); ck_pr_store_uint(&record->epoch, g_epoch); ck_pr_fas_uint(&record->active, 1); ck_pr_fence_memory(); return; } CK_CC_INLINE static void ck_epoch_end(ck_epoch_t *global, ck_epoch_record_t *record) { (void)global; ck_pr_fence_memory(); /* Release suffices. */ ck_pr_store_uint(&record->active, 0); return; }

PWLNY EPOCH-BASED RECLAMATION void ck_epoch_synchronize(struct ck_epoch *global, struct ck_epoch_record *record)
{ struct ck_epoch_record *cr; unsigned int delta, epoch, goal, i; bool active; delta = epoch = ck_pr_load_uint(&global->epoch); goal = epoch + CK_EPOCH_GRACE; ck_pr_fence_memory(); for (i = 0, cr = NULL; i < CK_EPOCH_GRACE - 1; cr = NULL, i++) { while (cr = ck_epoch_scan(global, cr, delta, &active), cr != NULL) { e_d = ck_pr_load_uint(&global->epoch); if (e_d != delta) { delta = e_d; goto reload; } } if (active == false) break; if (ck_pr_cas_uint_value(&global->epoch, delta, delta + 1, &delta) == true) { delta = delta + 1; continue; } reload: if (delta >= goal) break; }

PWLNY EPOCH-BASED RECLAMATION • Atomic operation or barrier required on
entry. • On relaxed ordering architectures, a release fence is required on exit. • Application-agnostic and doesn’t require explicit quiescent points. • Reclamation blocks until a quiescent point has passed. • Competitive with amortization (NEBR) and deferred callbacks. • In Hart’s implementation, entry into a protected section forces reclamation attempt every 100 critical sections. Implementation available in Concurrency Kit (ck_epoch).

PWLNY SUMMARY • Non-Blocking Schemes • Hazard Pointers • Requires
a fence on read-side for every hazardous reference but has strong bound on growth. • Blocking Schemes • Quiescent State-Based Reclamation (QSBR) • Quiescent states are used to implement grace period detection. • Epoch-Based Reclamation (EBR) • Grace period detection relies on conditionally increased epoch counter, does not rely on quiescence.

PWLNY Read Latency No updates

PWLNY 3 Reclamation Performance Factors We categorize factors which can
affect reclamation scheme performance; we vary these factors in Section 5. • Memory Consistency • Required complexity of the various safe memory reclamation mechanisms. • Data Structures and Workloads • Different mechanisms excel in different workloads. • Threads and Scheduling • High levels of preemption will do bad things. • Memory Constraints • Deferral may violate memory requirements.

PWLNY Operating Environment XServe IBM Power Dell * CPUs 2x2GHz
PPC G5 8x1.45GHz POWER4+ 6x2x2.4Ghz Xeon E5-2630L Kernel Linux 2.6.8-1.ydl.7g5- smp Linux 2.6.13 (kernel.org) Linux 2.6.32-504.1.3.el6 Fence 78ns 76ns ~14ns CAS 52ns 59ns ~8ns Lock 231ns 243ns ~10ns * Not from original study

PWLNY Test Program Figure 8. Per-thread test pseudo-code. • Benchmark
starts N threads and then stops them after the test duration. • Average execution time = total number of operations / test duration • CPU time = execution time / min(|threads|,|processors|) • Workload varied in reclamation-agnostic manner. • Free list for memory allocation. • Data structures used • Ordered lock-free list • MCS FIFO Queue • Hash table

PWLNY Base Costs • Measure best-case execution times with no
contention and minimal jitter. • Dominated by complexity of fast path.

PWLNY Scalability with Traversal Length Read-only workload with increasing list
length.

PWLNY Scalability with Threads

PWLNY Scalability with Threads Figure 13. Effect of adding threads
— lock- free queue, 8 CPUs.

PWLNY Scalability with Threads

PWLNY Consequences

PWLNY Consequences Figure 18. Lock-free versus concurrently- readable algorithms —
hash tables with load factor 1, four threads, 8 CPUs.

PWLNY Summary

http://backtrace.io/careers I’m @0xF390 - expect updates soon!

Samy Al Bahra on Making Lockless Synchronizatio...

Samy Al Bahra on Making Lockless Synchronization Fast

More Decks by Papers_We_Love

Other Decks in Programming

Featured

Transcript