Racing to Win: Using Race Conditions in Correct Concurrent Software

Racing to Win: Using Race Conditions in Correct Concurrent Software

If you've ever worked on concurrent or parallel systems, race conditions have invariably plagued your existence. They are difficult to identify, debug, and nearly impossible to test repeatably. While race conditions intuitively seem bad, it turns out there are cases in which we can use them to our advantage! In this talk, we'll discuss a number of ways that race conditions -- and correctly detecting them -- are used in improving throughput and reducing latency in high-performance systems.

We begin this exploration with a brief discussion of the various types of locks, non-blocking algorithms, and the benefits thereof. We'll look at a naive test-and-set spinlock and show how introducing a race condition on reads significantly improves lock acquisition throughput. From here, we'll investigate non-blocking algorithms and how they incorporate detection of race events to ensure correct, deterministic, and bounded behavior by analyzing a durable, lock-free memory allocator written in C using the Concurrency Kit library.

Videos of this talk are available at:
* Strangeloop 2015 https://www.youtube.com/watch?v=3LcNHxBJw2Q
* OSCon EU 2015 https://www.youtube.com/watch?v=jmSiMCENcVY

083a38739359a3cb689c5be4cd7a1985?s=128

Devon H. O'Dell

September 26, 2015
Tweet

Transcript

  1. Utilizing race conditions to build correct concurrent software Racing to

    Win Devon H. O’Dell | Engineer @Fastly | @dhobsd | devon.odell@gmail.com | https://9vx.org/
  2. •Devon H. O’Dell, @dhobsd • •Performance and debugging nut •Zappa

    fan
  3. None
  4. None
  5. None
  6. Process A stack heap text Process B stack heap text

    Process C stack heap text Cache node
  7. Cache node Process A stack heap text Process B stack

    heap text Process C stack heap text μslab
  8. Slab allocator

  9. Object Slab Object Object Object Object Object Object Object Object

    Object Object
  10. Object Allocation Object Object Object Object Object Object Object Object

    Object Object alloc() alloc() alloc()
  11. Object Freeing Object Object Object Object Object Object Object Object

    Object Object free( ) free( ) free( )
  12. Object Object Object Object Object Object Object Object Object Object

    Object
  13. Allocator Protocol •A request to allocate receives a response containing

    an object •A request to free receives a response when the supplied object is freed •Allocate must not return allocated object •Free must not release unallocated object
  14. Execution Histories Time A(allocate response) A(allocate request) B(allocate request) B(allocate

    response)
  15. Protocol Violation! Time A(allocate response) A(allocate request) B(allocate request) B(allocate

    response)
  16. A(allocation request) B(allocation request) A(allocation response) B(allocation response) Time

  17. https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf 1990

  18. Sequential History Time A(allocate request) A(allocate response) { } A(free

    request) A(free response) { } B(allocate request) B(allocate response) { }
  19. Sequential History Time A(allocate request) A(allocate response) { } A(free

    request) A(free response) { } B(allocate request) B(allocate response) { }
  20. obj *allocate() {
 
 obj *h = freelist->head; freelist->head =

    h->next;
 return h;
 } void free(obj *o) { o->next = freelist->head; freelist->head = o; }
  21. obj *allocate() {
 lock(&global_mutex);
 obj *h = freelist->head; freelist->head =

    h->next;
 unlock(&global_mutex); return h;
 } void free(obj *o) { lock(&global_mutex); o->next = freelist->head; freelist->head = o; unlock(&global_mutex); }
  22. Snapshot Current Lock State Update State to Locked Was Snapshot

    Locked? Yes Done No Atomic Test and Set Lock
  23. Set State Unlocked Atomic Test and Set Unlock

  24. typedef spinlock int #define LOCKED 1 #define UNLOCKED 0 void

    lock(spinlock *m) { while (atomic_tas(m, LOCKED) == LOCKED) stall(); } void unlock(spinlock *m) { atomic_store(m, UNLOCKED); } Many code examples derived from Concurrency Kit http://concurrencykit.org
  25. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } A(TAS request) A(TAS response) { }
  26. A(TAS request) A(TAS response) { } A(lock request) A(lock response)

    Time TAS is embedded in Lock
  27. A(TAS request) A(TAS response) { } A(lock request) A(lock response)

    Time TAS & Store can’t be reordered B(unlock request) B(unlock response) B(Store request) B(Store response) { }
  28. All execution histories All sequentially-consistent execution histories All ??? execution

    histories ⊇ ⊇
  29. All execution histories All sequentially-consistent execution histories All linearizable execution

    histories ⊇ ⊇
  30. http://dl.acm.org/citation.cfm?id=176576 1994

  31. Linearizability •Easier to use in formal verification •Applies to individual

    objects •Composable
  32. A(TAS request) A(TAS response) { } A(lock request) A(lock response)

    Time Others can be reordered B(unlock request) B(unlock response) B(Store request) B(Store response) { }
  33. A(TAS request) A(TAS response) { } A(lock request) A(lock response)

    Time Others can be reordered B(unlock request) B(unlock response) B(Store request) B(Store response) { }
  34. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } void unlock(spinlock *m) {
 atomic_store(m, UNLOCKED);
 }
  35. http://dl.acm.org/citation.cfm?id=69624.357207 1983

  36. obj *allocate() {
 lock(&global_lock);
 obj *h = freelist->head; freelist->head =

    h->next;
 unlock(&global_lock); return h;
 } void free(obj *o) { lock(&global_lock); o->next = freelist->head; freelist->head = o; unlock(&global_lock); }
  37. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } void unlock(spinlock *m) {
 atomic_store(m, UNLOCKED);
 }
  38. Locked alloc/free (10s) 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 Threads

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 52,720,687 Test and Set Spinlock Performance
  39. Locked alloc/free (10s) 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 Threads

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 52,720,687 2,876,615 Test and Set Spinlock Performance
  40. Locked alloc/free (10s) 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 Threads

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Test and Set Spinlock Performance
  41. Locked alloc/free (10s) 10 100 1,000 10,000 100,000 1,000,000 10,000,000

    100,000,000 Threads 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Test and Set Spinlock Performance
  42. typedef spinlock int;
 #define LOCKED 1
 #define UNLOCKED 0
 


    void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();
 }
  43. void lock(spinlock *m) { while (atomic_tas(m, LOCKED) == LOCKED) while

    (*m == LOCKED) stall(); } Test and Test and Set
  44. Locked alloc/free (10s) 10 100 1,000 10,000 100,000 1,000,000 10,000,000

    100,000,000 Threads 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Test and Set T&T&S Spinlock Performance
  45. TAS + Backoff void lock(spinlock *m) { uint64_t backoff =

    0, exp = 0; while (atomic_tas(m, LOCKED) == LOCKED) { for (uint64_t b = 0; b < backoff; b++) stall(); backoff = (1ULL << exp++); } }
  46. Locked alloc/free (10s) 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 Threads

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Test and Set T&T&S TAS + EB Spinlock Performance
  47. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();
 } spinlock global_lock = UNLOCKED
  48. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();
 } spinlock global_lock = UNLOCKED
  49. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();
 } spinlock global_lock = LOCKED
  50. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();
 } spinlock global_lock = LOCKED
  51. void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();


    } void lock(spinlock *m) {
 while (atomic_tas(m, LOCKED) == LOCKED) stall();
 } spinlock global_lock = LOCKED
  52. A function is lock-free if at all times at least

    one thread is guaranteed to be making progress. (Herlihy & Shavit)
  53. obj *allocate() {
 /* TODO linearize */
 obj *h =

    freelist->head; freelist->head = h->next;
 return h;
 } void free(obj *o) { /* TODO linearize */ o->next = freelist->head; freelist->head = o; }
  54. Non-Blocking Algorithms

  55. Compare and Swap

  56. Compare-And-Swap Cmpr and * Old value Destination Address

  57. Compare-And-Swap ≠ Return false Cmpr and * Old value Destination

    Address
  58. Old value New value = Destination Address Copy to *

    Return true Cmpr and * Compare-And-Swap ≠ Return false
  59. Old value New value = Destination Address Copy to *

    Return true Cmpr and * Compare-And-Swap ≠ Return false Atomic
  60. obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a

    = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 } slab head A B …
  61. slab head A B … obj *allocate(slab *s) {
 obj

    *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  62. obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a

    = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 } A B … slab head
  63. B … slab head A obj *allocate(slab *s) {
 obj

    *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  64. obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a

    = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 } Cmpr and * a a slab head A B … b a, &s->head, b
  65. slab head Z Cmpr and * obj *allocate(slab *s) {


    obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 } a a b a, &s->head, b
  66. slab head Z A B Cmpr and * obj *allocate(slab

    *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 } a a b a, &s->head, b
  67. slab head B … Cmpr and * obj *allocate(slab *s)

    {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 } a a b a, &s->head, b
  68. void free(slab *s, obj *o) { do { obj *t

    = s->head; o->next = t; } while (!cas(&s->head, t, o)); } B … slab head
  69. slab head A B … void free(slab *s, obj *o)

    { do { obj *t = s->head; o->next = t; } while (!cas(&s->head, t, o)); }
  70. A B C slab head

  71. obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a

    = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 } A B C slab head
  72. A B C slab head obj *allocate(slab *s) {
 obj

    *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  73. A B C some_object = allocate(&shared_slab); slab head obj *allocate(slab

    *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  74. B C A slab head some_object = allocate(&shared_slab); obj *allocate(slab

    *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  75. B C another_obj = allocate(&shared_slab); A slab head some_object =

    allocate(&shared_slab); obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  76. C A B slab head some_object = allocate(&shared_slab); another_obj =

    allocate(&shared_slab); obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  77. some_object = allocate(&shared_slab); free(some_object); B C A slab head another_obj

    = allocate(&shared_slab); obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  78. B A C slab head some_object = allocate(&shared_slab); free(some_object); another_obj

    = allocate(&shared_slab); obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  79. B C slab head some_object = allocate(&shared_slab); free(some_object); another_obj =

    allocate(&shared_slab); A obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  80. B B C slab head some_object = allocate(&shared_slab); free(some_object); another_obj

    = allocate(&shared_slab); A obj *allocate(slab *s) {
 obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(&s->head, a, b));
 return a;
 }
  81. The ABA Problem “A reference about to be modified by

    a CAS changes from a to b and back to a again. As a result, the CAS succeeds even though its effect on the data structure has changed and no longer has the desired effect.” —Herlihy & Shavit, p. 235
  82. A B … slab head 166 obj *allocate(slab *s) {


    obj *a, *b;
 do {
 a = s->head;
 b = a->next;
 } while (!cas(a, b, &s->head));
 return a;
 }
  83. obj *allocate(slab *s) {
 slab orig, update;
 do {
 orig.gen

    = s->gen;
 orig.head = s->head;
 update.gen = orig.gen + 1;
 update.head = orig.head->next;
 } while (!dcas(s, &orig, &update));
 return orig.head;
 } A B … slab head 166
  84. void free(slab *s, obj *o) { do { obj *t

    = s->head; o->next = t; } while (!cas(&s->head, t, o)); }
  85. alloc/free pairs (10s) 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 Threads

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 TAS T&T&S TAS + EB Concurrent Allocator pthread Allocator Throughput
  86. alloc/free pairs (10s) 10 100 1,000 10,000 100,000 1,000,000 10,000,000

    100,000,000 Threads 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 TAS T&T&S TAS + EB Concurrent Allocator pthread Allocator Throughput
  87. CPU Cycles (rdtscp)

  88. CPU Cycles (rdtscp)

  89. CPU Cycles (rdtscp)

  90. CPU Cycles (rdtscp)

  91. Takeaways

  92. uslab.io

  93. None
  94. Thanks @dhobsd | https://9vx.org | http://uslab.io Come see us at

    the Fastly booth!
  95. Further Reading •“Is Parallel Programming Hard, And, If So, What

    Can You Do About It?”, created and edited by Paul McKenney, https://www.kernel.org/pub/ linux/kernel/people/paulmck/perfbook/perfbook.html •“Nonblocking Algorithms and Scalable Multicore Programming”, Samy Al Bahra, https://queue.acm.org/detail.cfm?id=2492433 •“What Every Programmer Should Know About Memory”, Ulrich Drepper, http://www.akkadia.org/drepper/cpumemory.pdf
  96. •“The C++ Memory Model Meets High-Update-Rate Data Structures”, Paul McKenney,

    http://www.rdrop.com/~paulmck/RCU/C++Updates. 2014.09.11a.pdf, https://www.youtube.com/watch?v=1Q-RH2tiyt0 •Obstruction-Free Algorithms can be Practically Wait-Free: Finch, Luchangco, Moir, Shavit - 2005 http://people.csail.mit.edu/shanir/ publications/DISC2005.pdf •Are Lock-Free Concurrent Algorithms Practically Wait-Free?: Alistarh, Censor-Hillel, Shavit - 2013 http://arxiv.org/abs/1311.3200 Further Reading
  97. •“Lock-Free By Example”, Tony Van Eerd, https://www.youtube.com/ watch?v=Xf35TLFKiO8 •Concurrency Kit:

    http://concurrencykit.org •µSlab: http://uslab.io Further Reading
  98. • I assume reproduction rights to all images under fair

    use; this slide is for reference purposes and fair attribution. • Mario, Mario Kart, and other related franchises are registered trademarks of Nintendo and its associates. I am in no way affiliated with or endorsed by Nintendo. • Mario Kart 8 screenshot from eBash video game centers at http://ebash.com/wp-content/uploads/ 2015/02/mariokart.jpg • Lakitu stop light animation from IGN at http://31.media.tumblr.com/ bd374359bd39369cdbf25755d4b2e570/tumblr_mfj2dnGBTL1rfjowdo1_500.gif • Brittney Griner blocking from Bleacher Report, but high-res source seems to have disappeared :( Image Credits