Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Racing to Win: Using Race Conditions in Correct...

Racing to Win: Using Race Conditions in Correct Concurrent Software

If you've ever worked on concurrent or parallel systems, race conditions have invariably plagued your existence. They are difficult to identify, debug, and nearly impossible to test repeatably. While race conditions intuitively seem bad, it turns out there are cases in which we can use them to our advantage! In this talk, we'll discuss a number of ways that race conditions -- and correctly detecting them -- are used in improving throughput and reducing latency in high-performance systems.

We begin this exploration with a brief discussion of the various types of locks, non-blocking algorithms, and the benefits thereof. We'll look at a naive test-and-set spinlock and show how introducing a race condition on reads significantly improves lock acquisition throughput. From here, we'll investigate non-blocking algorithms and how they incorporate detection of race events to ensure correct, deterministic, and bounded behavior by analyzing a durable, lock-free memory allocator written in C using the Concurrency Kit library.

Videos of this talk are available at:
* Strangeloop 2015 https://www.youtube.com/watch?v=3LcNHxBJw2Q
* OSCon EU 2015 https://www.youtube.com/watch?v=jmSiMCENcVY

Devon H. O'Dell

September 26, 2015
Tweet

More Decks by Devon H. O'Dell

Other Decks in Programming

Transcript

  1. Utilizing race conditions to build correct concurrent software Racing to

    Win Devon H. O’Dell | Engineer @Fastly | @dhobsd | [email protected] | https://9vx.org/
  2. Process A stack heap text Process B stack heap text

    Process C stack heap text Cache node
  3. Cache node Process A stack heap text Process B stack

    heap text Process C stack heap text μslab
  4. Allocator Protocol •A request to allocate receives a response containing

    an object •A request to free receives a response when the supplied object is freed •Allocate must not return allocated object •Free must not release unallocated object
  5. Sequential History Time A(allocate request) A(allocate response) { } A(free

    request) A(free response) { } B(allocate request) B(allocate response) { }
  6. Sequential History Time A(allocate request) A(allocate response) { } A(free

    request) A(free response) { } B(allocate request) B(allocate response) { }
  7. obj *allocate() {
 
 obj *h = freelist->head; freelist->head =

    h->next;
 return h;
 } void free(obj *o) { o->next = freelist->head; freelist->head = o; }
  8. obj *allocate() {
 lock(&global_mutex);
 obj *h = freelist->head; freelist->head =

    h->next;
 unlock(&global_mutex); return h;
 } void free(obj *o) { lock(&global_mutex); o->next = freelist->head; freelist->head = o; unlock(&global_mutex); }
  9. Snapshot Current Lock State Update State to Locked Was Snapshot

    Locked? Yes Done No Atomic Test and Set Lock
  10. typedef spinlock int #define LOCKED 1 #define UNLOCKED 0 void

    lock(spinlock *m) { while (atomic_tas(m, LOCKED) == LOCKED) stall(); } void unlock(spinlock *m) { atomic_store(m, UNLOCKED); } Many code examples derived from Concurrency Kit http://concurrencykit.org
  11. A(TAS request) A(TAS response) { } A(lock request) A(lock response)

    Time TAS & Store can’t be reordered B(unlock request) B(unlock response) B(Store request) B(Store response) { }
  12. A(TAS request) A(TAS response) { } A(lock request) A(lock response)

    Time Others can be reordered B(unlock request) B(unlock response) B(Store request) B(Store response) { }
  13. A(TAS request) A(TAS response) { } A(lock request) A(lock response)

    Time Others can be reordered B(unlock request) B(unlock response) B(Store request) B(Store response) { }
  14. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } void unlock(spinlock *m) {
 atomic_store(m, UNLOCKED);
 }
  15. obj *allocate() {
 lock(&global_lock);
 obj *h = freelist->head; freelist->head =

    h->next;
 unlock(&global_lock); return h;
 } void free(obj *o) { lock(&global_lock); o->next = freelist->head; freelist->head = o; unlock(&global_lock); }
  16. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } void unlock(spinlock *m) {
 atomic_store(m, UNLOCKED);
 }
  17. Locked alloc/free (10s) 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 Threads

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 52,720,687 Test and Set Spinlock Performance
  18. Locked alloc/free (10s) 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 Threads

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 52,720,687 2,876,615 Test and Set Spinlock Performance
  19. Locked alloc/free (10s) 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 Threads

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Test and Set Spinlock Performance
  20. Locked alloc/free (10s) 10 100 1,000 10,000 100,000 1,000,000 10,000,000

    100,000,000 Threads 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Test and Set Spinlock Performance
  21. typedef spinlock int;
 #define LOCKED 1
 #define UNLOCKED 0
 


    void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();
 }
  22. void lock(spinlock *m) { while (atomic_tas(m, LOCKED) == LOCKED) while

    (*m == LOCKED) stall(); } Test and Test and Set
  23. Locked alloc/free (10s) 10 100 1,000 10,000 100,000 1,000,000 10,000,000

    100,000,000 Threads 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Test and Set T&T&S Spinlock Performance
  24. TAS + Backoff void lock(spinlock *m) { uint64_t backoff =

    0, exp = 0; while (atomic_tas(m, LOCKED) == LOCKED) { for (uint64_t b = 0; b < backoff; b++) stall(); backoff = (1ULL << exp++); } }
  25. Locked alloc/free (10s) 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 Threads

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Test and Set T&T&S TAS + EB Spinlock Performance
  26. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();
 } spinlock global_lock = UNLOCKED
  27. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();
 } spinlock global_lock = UNLOCKED
  28. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();
 } spinlock global_lock = LOCKED
  29. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();
 } spinlock global_lock = LOCKED
  30. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();
 } spinlock global_lock = LOCKED
  31. A function is lock-free if at all times at least

    one thread is guaranteed to be making progress. (Herlihy & Shavit)
  32. obj *allocate() {
 /* TODO linearize */
 obj *h =

    freelist->head; freelist->head = h->next;
 return h;
 } void free(obj *o) { /* TODO linearize */ o->next = freelist->head; freelist->head = o; }
  33. Old value New value = Destination Address Copy to *

    Return true Cmpr and * Compare-And-Swap ≠ Return false
  34. Old value New value = Destination Address Copy to *

    Return true Cmpr and * Compare-And-Swap ≠ Return false Atomic
  35. obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a

    = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 } slab head A B …
  36. slab head A B … obj *allocate(slab *s) {
 obj

    *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  37. obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a

    = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 } A B … slab head
  38. B … slab head A obj *allocate(slab *s) {
 obj

    *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  39. obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a

    = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 } Cmpr and * a a slab head A B … b a, &s->head, b
  40. slab head Z Cmpr and * obj *allocate(slab *s) {


    obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 } a a b a, &s->head, b
  41. slab head Z A B Cmpr and * obj *allocate(slab

    *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 } a a b a, &s->head, b
  42. slab head B … Cmpr and * obj *allocate(slab *s)

    {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 } a a b a, &s->head, b
  43. void free(slab *s, obj *o) { do { obj *t

    = s->head; o->next = t; } while (!cas(&s->head, t, o)); } B … slab head
  44. slab head A B … void free(slab *s, obj *o)

    { do { obj *t = s->head; o->next = t; } while (!cas(&s->head, t, o)); }
  45. obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a

    = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 } A B C slab head
  46. A B C slab head obj *allocate(slab *s) {
 obj

    *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  47. A B C some_object = allocate(&shared_slab); slab head obj *allocate(slab

    *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  48. B C A slab head some_object = allocate(&shared_slab); obj *allocate(slab

    *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  49. B C another_obj = allocate(&shared_slab); A slab head some_object =

    allocate(&shared_slab); obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  50. C A B slab head some_object = allocate(&shared_slab); another_obj =

    allocate(&shared_slab); obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  51. some_object = allocate(&shared_slab); free(some_object); B C A slab head another_obj

    = allocate(&shared_slab); obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  52. B A C slab head some_object = allocate(&shared_slab); free(some_object); another_obj

    = allocate(&shared_slab); obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  53. B C slab head some_object = allocate(&shared_slab); free(some_object); another_obj =

    allocate(&shared_slab); A obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  54. B B C slab head some_object = allocate(&shared_slab); free(some_object); another_obj

    = allocate(&shared_slab); A obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  55. The ABA Problem “A reference about to be modified by

    a CAS changes from a to b and back to a again. As a result, the CAS succeeds even though its effect on the data structure has changed and no longer has the desired effect.” —Herlihy & Shavit, p. 235
  56. A B … slab head 166 obj *allocate(slab *s) {


    obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(a, b, &s->head));
 return a;
 }
  57. obj *allocate(slab *s) {
 slab orig, update;
 do {
 orig.gen

    = s->gen;
 orig.head = s->head;
 update.gen = orig.gen + 1;
 update.head = orig.head->next;
 } while (!dcas(s, &orig, &update));
 return orig.head;
 } A B … slab head 166
  58. void free(slab *s, obj *o) { do { obj *t

    = s->head; o->next = t; } while (!cas(&s->head, t, o)); }
  59. alloc/free pairs (10s) 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 Threads

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 TAS T&T&S TAS + EB Concurrent Allocator pthread Allocator Throughput
  60. alloc/free pairs (10s) 10 100 1,000 10,000 100,000 1,000,000 10,000,000

    100,000,000 Threads 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 TAS T&T&S TAS + EB Concurrent Allocator pthread Allocator Throughput
  61. Further Reading •“Is Parallel Programming Hard, And, If So, What

    Can You Do About It?”, created and edited by Paul McKenney, https://www.kernel.org/pub/ linux/kernel/people/paulmck/perfbook/perfbook.html •“Nonblocking Algorithms and Scalable Multicore Programming”, Samy Al Bahra, https://queue.acm.org/detail.cfm?id=2492433 •“What Every Programmer Should Know About Memory”, Ulrich Drepper, http://www.akkadia.org/drepper/cpumemory.pdf
  62. •“The C++ Memory Model Meets High-Update-Rate Data Structures”, Paul McKenney,

    http://www.rdrop.com/~paulmck/RCU/C++Updates. 2014.09.11a.pdf, https://www.youtube.com/watch?v=1Q-RH2tiyt0 •Obstruction-Free Algorithms can be Practically Wait-Free: Finch, Luchangco, Moir, Shavit - 2005 http://people.csail.mit.edu/shanir/ publications/DISC2005.pdf •Are Lock-Free Concurrent Algorithms Practically Wait-Free?: Alistarh, Censor-Hillel, Shavit - 2013 http://arxiv.org/abs/1311.3200 Further Reading
  63. • I assume reproduction rights to all images under fair

    use; this slide is for reference purposes and fair attribution. • Mario, Mario Kart, and other related franchises are registered trademarks of Nintendo and its associates. I am in no way affiliated with or endorsed by Nintendo. • Mario Kart 8 screenshot from eBash video game centers at http://ebash.com/wp-content/uploads/ 2015/02/mariokart.jpg • Lakitu stop light animation from IGN at http://31.media.tumblr.com/ bd374359bd39369cdbf25755d4b2e570/tumblr_mfj2dnGBTL1rfjowdo1_500.gif • Brittney Griner blocking from Bleacher Report, but high-res source seems to have disappeared :( Image Credits