Racing to Win: Using Race Conditions in Correct Concurrent Software

Utilizing race conditions to build correct concurrent software Racing to
Win Devon H. O’Dell | Engineer @Fastly | @dhobsd | [email protected] | https://9vx.org/

•Devon H. O’Dell, @dhobsd • •Performance and debugging nut •Zappa
fan

Process A stack heap text Process B stack heap text
Process C stack heap text Cache node

Cache node Process A stack heap text Process B stack
heap text Process C stack heap text μslab

Slab allocator

Object Slab Object Object Object Object Object Object Object Object
Object Object

Object Allocation Object Object Object Object Object Object Object Object
Object Object alloc() alloc() alloc()

Object Freeing Object Object Object Object Object Object Object Object
Object Object free( ) free( ) free( )

Object Object Object Object Object Object Object Object Object Object
Object

Allocator Protocol •A request to allocate receives a response containing
an object •A request to free receives a response when the supplied object is freed •Allocate must not return allocated object •Free must not release unallocated object

Execution Histories Time A(allocate response) A(allocate request) B(allocate request) B(allocate
response)

Protocol Violation! Time A(allocate response) A(allocate request) B(allocate request) B(allocate
response)

A(allocation request) B(allocation request) A(allocation response) B(allocation response) Time

https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf 1990

Sequential History Time A(allocate request) A(allocate response) { } A(free
request) A(free response) { } B(allocate request) B(allocate response) { }

obj *allocate() {    obj *h = freelist->head; freelist->head =
h->next;  return h;  } void free(obj *o) { o->next = freelist->head; freelist->head = o; }

obj *allocate() {  lock(&global_mutex);  obj *h = freelist->head; freelist->head =
h->next;  unlock(&global_mutex); return h;  } void free(obj *o) { lock(&global_mutex); o->next = freelist->head; freelist->head = o; unlock(&global_mutex); }

Snapshot Current Lock State Update State to Locked Was Snapshot
Locked? Yes Done No Atomic Test and Set Lock

Set State Unlocked Atomic Test and Set Unlock

typedef spinlock int #define LOCKED 1 #define UNLOCKED 0 void
lock(spinlock *m) { while (atomic_tas(m, LOCKED) == LOCKED) stall(); } void unlock(spinlock *m) { atomic_store(m, UNLOCKED); } Many code examples derived from Concurrency Kit http://concurrencykit.org

void lock(spinlock *m) {  while (atomic_tas(m, LOCKED) == LOCKED) stall(); 
} A(TAS request) A(TAS response) { }

A(TAS request) A(TAS response) { } A(lock request) A(lock response)
Time TAS is embedded in Lock

Time TAS & Store can’t be reordered B(unlock request) B(unlock response) B(Store request) B(Store response) { }

All execution histories All sequentially-consistent execution histories All ??? execution
histories ⊇ ⊇

All execution histories All sequentially-consistent execution histories All linearizable execution
histories ⊇ ⊇

http://dl.acm.org/citation.cfm?id=176576 1994

Linearizability •Easier to use in formal verification •Applies to individual
objects •Composable

Time Others can be reordered B(unlock request) B(unlock response) B(Store request) B(Store response) { }

} void unlock(spinlock *m) {  atomic_store(m, UNLOCKED);  }

http://dl.acm.org/citation.cfm?id=69624.357207 1983

obj *allocate() {  lock(&global_lock);  obj *h = freelist->head; freelist->head =
h->next;  unlock(&global_lock); return h;  } void free(obj *o) { lock(&global_lock); o->next = freelist->head; freelist->head = o; unlock(&global_lock); }

} void unlock(spinlock *m) {  atomic_store(m, UNLOCKED);  }

Locked alloc/free (10s) 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 Threads
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 52,720,687 Test and Set Spinlock Performance

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 52,720,687 2,876,615 Test and Set Spinlock Performance

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Test and Set Spinlock Performance

Locked alloc/free (10s) 10 100 1,000 10,000 100,000 1,000,000 10,000,000
100,000,000 Threads 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Test and Set Spinlock Performance

typedef spinlock int;  #define LOCKED 1  #define UNLOCKED 0   
void lock(spinlock *m) {  while (atomic_tas(m, LOCKED) == LOCKED) stall();  }

void lock(spinlock *m) { while (atomic_tas(m, LOCKED) == LOCKED) while
(*m == LOCKED) stall(); } Test and Test and Set

Locked alloc/free (10s) 10 100 1,000 10,000 100,000 1,000,000 10,000,000
100,000,000 Threads 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Test and Set T&T&S Spinlock Performance

TAS + Backoff void lock(spinlock *m) { uint64_t backoff =
0, exp = 0; while (atomic_tas(m, LOCKED) == LOCKED) { for (uint64_t b = 0; b < backoff; b++) stall(); backoff = (1ULL << exp++); } }

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Test and Set T&T&S TAS + EB Spinlock Performance

} void lock(spinlock *m) {  while (atomic_tas(m, LOCKED) == LOCKED) stall();  } spinlock global_lock = UNLOCKED

} void lock(spinlock *m) {  while (atomic_tas(m, LOCKED) == LOCKED) stall();  } spinlock global_lock = LOCKED

A function is lock-free if at all times at least
one thread is guaranteed to be making progress. (Herlihy & Shavit)

obj *allocate() {  /* TODO linearize */  obj *h =
freelist->head; freelist->head = h->next;  return h;  } void free(obj *o) { /* TODO linearize */ o->next = freelist->head; freelist->head = o; }

Non-Blocking Algorithms

Compare and Swap

Compare-And-Swap Cmpr and * Old value Destination Address

Compare-And-Swap ≠ Return false Cmpr and * Old value Destination
Address

Old value New value = Destination Address Copy to *
Return true Cmpr and * Compare-And-Swap ≠ Return false

Old value New value = Destination Address Copy to *
Return true Cmpr and * Compare-And-Swap ≠ Return false Atomic

obj *allocate(slab *s) {  obj *a, *b;  do {  a
= s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  } slab head A B …

slab head A B … obj *allocate(slab *s) {  obj
*a, *b;  do {  a = s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  }

= s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  } A B … slab head

B … slab head A obj *allocate(slab *s) {  obj

= s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  } Cmpr and * a a slab head A B … b a, &s->head, b

slab head Z Cmpr and * obj *allocate(slab *s) { 
obj *a, *b;  do {  a = s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  } a a b a, &s->head, b

slab head Z A B Cmpr and * obj *allocate(slab
*s) {  obj *a, *b;  do {  a = s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  } a a b a, &s->head, b

slab head B … Cmpr and * obj *allocate(slab *s)
{  obj *a, *b;  do {  a = s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  } a a b a, &s->head, b

void free(slab *s, obj *o) { do { obj *t
= s->head; o->next = t; } while (!cas(&s->head, t, o)); } B … slab head

slab head A B … void free(slab *s, obj *o)
{ do { obj *t = s->head; o->next = t; } while (!cas(&s->head, t, o)); }

A B C slab head

= s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  } A B C slab head

A B C slab head obj *allocate(slab *s) {  obj

A B C some_object = allocate(&shared_slab); slab head obj *allocate(slab
*s) {  obj *a, *b;  do {  a = s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  }

B C A slab head some_object = allocate(&shared_slab); obj *allocate(slab
*s) {  obj *a, *b;  do {  a = s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  }

B C another_obj = allocate(&shared_slab); A slab head some_object =
allocate(&shared_slab); obj *allocate(slab *s) {  obj *a, *b;  do {  a = s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  }

C A B slab head some_object = allocate(&shared_slab); another_obj =
allocate(&shared_slab); obj *allocate(slab *s) {  obj *a, *b;  do {  a = s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  }

some_object = allocate(&shared_slab); free(some_object); B C A slab head another_obj
= allocate(&shared_slab); obj *allocate(slab *s) {  obj *a, *b;  do {  a = s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  }

B A C slab head some_object = allocate(&shared_slab); free(some_object); another_obj
= allocate(&shared_slab); obj *allocate(slab *s) {  obj *a, *b;  do {  a = s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  }

B C slab head some_object = allocate(&shared_slab); free(some_object); another_obj =
allocate(&shared_slab); A obj *allocate(slab *s) {  obj *a, *b;  do {  a = s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  }

B B C slab head some_object = allocate(&shared_slab); free(some_object); another_obj
= allocate(&shared_slab); A obj *allocate(slab *s) {  obj *a, *b;  do {  a = s->head;  b = a->next;  } while (!cas(&s->head, a, b));  return a;  }

The ABA Problem “A reference about to be modiﬁed by
a CAS changes from a to b and back to a again. As a result, the CAS succeeds even though its effect on the data structure has changed and no longer has the desired effect.” —Herlihy & Shavit, p. 235

A B … slab head 166 obj *allocate(slab *s) { 
obj *a, *b;  do {  a = s->head;  b = a->next;  } while (!cas(a, b, &s->head));  return a;  }

obj *allocate(slab *s) {  slab orig, update;  do {  orig.gen
= s->gen;  orig.head = s->head;  update.gen = orig.gen + 1;  update.head = orig.head->next;  } while (!dcas(s, &orig, &update));  return orig.head;  } A B … slab head 166

void free(slab *s, obj *o) { do { obj *t
= s->head; o->next = t; } while (!cas(&s->head, t, o)); }

alloc/free pairs (10s) 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 Threads
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 TAS T&T&S TAS + EB Concurrent Allocator pthread Allocator Throughput

alloc/free pairs (10s) 10 100 1,000 10,000 100,000 1,000,000 10,000,000
100,000,000 Threads 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 TAS T&T&S TAS + EB Concurrent Allocator pthread Allocator Throughput

CPU Cycles (rdtscp)

Takeaways

uslab.io

Thanks @dhobsd | https://9vx.org | http://uslab.io Come see us at
the Fastly booth!

Further Reading •“Is Parallel Programming Hard, And, If So, What
Can You Do About It?”, created and edited by Paul McKenney, https://www.kernel.org/pub/ linux/kernel/people/paulmck/perfbook/perfbook.html •“Nonblocking Algorithms and Scalable Multicore Programming”, Samy Al Bahra, https://queue.acm.org/detail.cfm?id=2492433 •“What Every Programmer Should Know About Memory”, Ulrich Drepper, http://www.akkadia.org/drepper/cpumemory.pdf

•“The C++ Memory Model Meets High-Update-Rate Data Structures”, Paul McKenney,
http://www.rdrop.com/~paulmck/RCU/C++Updates. 2014.09.11a.pdf, https://www.youtube.com/watch?v=1Q-RH2tiyt0 •Obstruction-Free Algorithms can be Practically Wait-Free: Finch, Luchangco, Moir, Shavit - 2005 http://people.csail.mit.edu/shanir/ publications/DISC2005.pdf •Are Lock-Free Concurrent Algorithms Practically Wait-Free?: Alistarh, Censor-Hillel, Shavit - 2013 http://arxiv.org/abs/1311.3200 Further Reading

•“Lock-Free By Example”, Tony Van Eerd, https://www.youtube.com/ watch?v=Xf35TLFKiO8 •Concurrency Kit:
http://concurrencykit.org •µSlab: http://uslab.io Further Reading

• I assume reproduction rights to all images under fair
use; this slide is for reference purposes and fair attribution. • Mario, Mario Kart, and other related franchises are registered trademarks of Nintendo and its associates. I am in no way affiliated with or endorsed by Nintendo. • Mario Kart 8 screenshot from eBash video game centers at http://ebash.com/wp-content/uploads/ 2015/02/mariokart.jpg • Lakitu stop light animation from IGN at http://31.media.tumblr.com/ bd374359bd39369cdbf25755d4b2e570/tumblr_mfj2dnGBTL1rfjowdo1_500.gif • Brittney Griner blocking from Bleacher Report, but high-res source seems to have disappeared :( Image Credits

Racing to Win: Using Race Conditions in Correct...

Racing to Win: Using Race Conditions in Correct Concurrent Software

More Decks by Devon H. O'Dell

Other Decks in Programming

Featured

Transcript