Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Racing to Win: Using Race Conditions in Correct Concurrent Software

Racing to Win: Using Race Conditions in Correct Concurrent Software

If you've ever worked on concurrent or parallel systems, race conditions have invariably plagued your existence. They are difficult to identify, debug, and nearly impossible to test repeatably. While race conditions intuitively seem bad, it turns out there are cases in which we can use them to our advantage! In this talk, we'll discuss a number of ways that race conditions -- and correctly detecting them -- are used in improving throughput and reducing latency in high-performance systems.

We begin this exploration with a brief discussion of the various types of locks, non-blocking algorithms, and the benefits thereof. We'll look at a naive test-and-set spinlock and show how introducing a race condition on reads significantly improves lock acquisition throughput. From here, we'll investigate non-blocking algorithms and how they incorporate detection of race events to ensure correct, deterministic, and bounded behavior by analyzing a durable, lock-free memory allocator written in C using the Concurrency Kit library.

Videos of this talk are available at:
* Strangeloop 2015 https://www.youtube.com/watch?v=3LcNHxBJw2Q
* OSCon EU 2015 https://www.youtube.com/watch?v=jmSiMCENcVY

Devon H. O'Dell

September 26, 2015
Tweet

More Decks by Devon H. O'Dell

Other Decks in Programming

Transcript

  1. Utilizing race conditions to build
    correct concurrent software
    Racing to Win
    Devon H. O’Dell | Engineer @Fastly | @dhobsd | [email protected] | https://9vx.org/

    View Slide

  2. •Devon H. O’Dell,
    @dhobsd

    •Performance and
    debugging nut
    •Zappa fan

    View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. Process A
    stack
    heap
    text
    Process B
    stack
    heap
    text
    Process C
    stack
    heap
    text
    Cache node

    View Slide

  7. Cache node
    Process A
    stack
    heap
    text
    Process B
    stack
    heap
    text
    Process C
    stack
    heap
    text
    μslab

    View Slide

  8. Slab
    allocator

    View Slide

  9. Object
    Slab
    Object Object Object Object Object Object Object Object Object Object

    View Slide

  10. Object
    Allocation
    Object Object Object Object Object Object Object Object Object Object
    alloc() alloc() alloc()

    View Slide

  11. Object
    Freeing
    Object Object
    Object Object Object Object Object Object Object Object
    free( ) free( )
    free( )

    View Slide

  12. Object Object Object Object Object Object Object Object Object Object Object

    View Slide

  13. Allocator Protocol
    •A request to allocate receives a
    response containing an object
    •A request to free receives a response
    when the supplied object is freed
    •Allocate must not return allocated object
    •Free must not release unallocated object

    View Slide

  14. Execution Histories
    Time
    A(allocate response)
    A(allocate request)
    B(allocate request)
    B(allocate response)

    View Slide

  15. Protocol Violation!
    Time
    A(allocate response)
    A(allocate request)
    B(allocate request)
    B(allocate response)

    View Slide

  16. A(allocation request)
    B(allocation request)
    A(allocation response)
    B(allocation response)
    Time

    View Slide

  17. https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf
    1990

    View Slide

  18. Sequential History
    Time
    A(allocate request)
    A(allocate response)
    { }
    A(free request)
    A(free response)
    { }
    B(allocate request)
    B(allocate response)
    { }

    View Slide

  19. Sequential History
    Time
    A(allocate request)
    A(allocate response)
    { }
    A(free request)
    A(free response)
    { }
    B(allocate request)
    B(allocate response)
    { }

    View Slide

  20. obj *allocate() {


    obj *h = freelist->head;
    freelist->head = h->next;

    return h;

    }
    void free(obj *o) {
    o->next = freelist->head;
    freelist->head = o;
    }

    View Slide

  21. obj *allocate() {

    lock(&global_mutex);

    obj *h = freelist->head;
    freelist->head = h->next;

    unlock(&global_mutex);
    return h;

    }
    void free(obj *o) {
    lock(&global_mutex);
    o->next = freelist->head;
    freelist->head = o;
    unlock(&global_mutex);
    }

    View Slide

  22. Snapshot Current Lock State
    Update State to Locked
    Was Snapshot Locked?
    Yes
    Done
    No
    Atomic
    Test and Set Lock

    View Slide

  23. Set State Unlocked
    Atomic
    Test and Set Unlock

    View Slide

  24. typedef spinlock int
    #define LOCKED 1
    #define UNLOCKED 0
    void lock(spinlock *m) {
    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();
    }
    void unlock(spinlock *m) {
    atomic_store(m, UNLOCKED);
    }
    Many code examples
    derived from Concurrency Kit
    http://concurrencykit.org

    View Slide

  25. void lock(spinlock *m) {

    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();

    }
    A(TAS request)
    A(TAS response)
    { }

    View Slide

  26. A(TAS request)
    A(TAS response)
    { }
    A(lock request)
    A(lock response)
    Time
    TAS is embedded in Lock

    View Slide

  27. A(TAS request)
    A(TAS response)
    { }
    A(lock request)
    A(lock response)
    Time
    TAS & Store can’t be reordered
    B(unlock request)
    B(unlock response)
    B(Store request)
    B(Store response)
    { }

    View Slide

  28. All execution histories
    All sequentially-consistent
    execution histories
    All ??? execution histories


    View Slide

  29. All execution histories
    All sequentially-consistent
    execution histories
    All linearizable execution
    histories


    View Slide

  30. http://dl.acm.org/citation.cfm?id=176576
    1994

    View Slide

  31. Linearizability
    •Easier to use in formal verification
    •Applies to individual objects
    •Composable

    View Slide

  32. A(TAS request)
    A(TAS response)
    { }
    A(lock request)
    A(lock response)
    Time
    Others can be reordered
    B(unlock request)
    B(unlock response)
    B(Store request)
    B(Store response)
    { }

    View Slide

  33. A(TAS request)
    A(TAS response)
    { }
    A(lock request)
    A(lock response)
    Time
    Others can be reordered
    B(unlock request)
    B(unlock response)
    B(Store request)
    B(Store response)
    { }

    View Slide

  34. void lock(spinlock *m) {

    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();

    }
    void unlock(spinlock *m) {

    atomic_store(m, UNLOCKED);

    }

    View Slide

  35. http://dl.acm.org/citation.cfm?id=69624.357207
    1983

    View Slide

  36. obj *allocate() {

    lock(&global_lock);

    obj *h = freelist->head;
    freelist->head = h->next;

    unlock(&global_lock);
    return h;

    }
    void free(obj *o) {
    lock(&global_lock);
    o->next = freelist->head;
    freelist->head = o;
    unlock(&global_lock);
    }

    View Slide

  37. void lock(spinlock *m) {

    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();

    }
    void unlock(spinlock *m) {

    atomic_store(m, UNLOCKED);

    }

    View Slide

  38. Locked alloc/free (10s)
    10,000,000
    20,000,000
    30,000,000
    40,000,000
    50,000,000
    60,000,000
    Threads
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
    52,720,687
    Test and Set
    Spinlock Performance

    View Slide

  39. Locked alloc/free (10s)
    10,000,000
    20,000,000
    30,000,000
    40,000,000
    50,000,000
    60,000,000
    Threads
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
    52,720,687
    2,876,615
    Test and Set
    Spinlock Performance

    View Slide

  40. Locked alloc/free (10s)
    10,000,000
    20,000,000
    30,000,000
    40,000,000
    50,000,000
    60,000,000
    Threads
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
    Test and Set
    Spinlock Performance

    View Slide

  41. Locked alloc/free (10s)
    10
    100
    1,000
    10,000
    100,000
    1,000,000
    10,000,000
    100,000,000
    Threads
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
    Test and Set
    Spinlock Performance

    View Slide

  42. typedef spinlock int;

    #define LOCKED 1

    #define UNLOCKED 0


    void lock(spinlock *m) {

    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();

    }

    View Slide

  43. void lock(spinlock *m) {
    while (atomic_tas(m, LOCKED) == LOCKED)
    while (*m == LOCKED)
    stall();
    }
    Test and Test and Set

    View Slide

  44. Locked alloc/free (10s)
    10
    100
    1,000
    10,000
    100,000
    1,000,000
    10,000,000
    100,000,000
    Threads
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
    Test and Set T&T&S
    Spinlock Performance

    View Slide

  45. TAS + Backoff
    void lock(spinlock *m) {
    uint64_t backoff = 0, exp = 0;
    while (atomic_tas(m, LOCKED) == LOCKED) {
    for (uint64_t b = 0; b < backoff; b++)
    stall();
    backoff = (1ULL << exp++);
    }
    }

    View Slide

  46. Locked alloc/free (10s)
    10,000,000
    20,000,000
    30,000,000
    40,000,000
    50,000,000
    60,000,000
    Threads
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
    Test and Set T&T&S TAS + EB
    Spinlock Performance

    View Slide

  47. void lock(spinlock *m) {

    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();

    }
    void lock(spinlock *m) {

    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();

    }
    spinlock global_lock = UNLOCKED

    View Slide

  48. void lock(spinlock *m) {

    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();

    }
    void lock(spinlock *m) {

    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();

    }
    spinlock global_lock = UNLOCKED

    View Slide

  49. void lock(spinlock *m) {

    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();

    }
    void lock(spinlock *m) {

    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();

    }
    spinlock global_lock = LOCKED

    View Slide

  50. void lock(spinlock *m) {

    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();

    }
    void lock(spinlock *m) {

    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();

    }
    spinlock global_lock = LOCKED

    View Slide

  51. void lock(spinlock *m) {

    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();

    }
    void lock(spinlock *m) {

    while (atomic_tas(m, LOCKED) == LOCKED)
    stall();

    }
    spinlock global_lock = LOCKED

    View Slide

  52. A function is lock-free if at all times
    at least one thread is
    guaranteed to be making progress.
    (Herlihy & Shavit)

    View Slide

  53. obj *allocate() {

    /* TODO linearize */

    obj *h = freelist->head;
    freelist->head = h->next;

    return h;

    }
    void free(obj *o) {
    /* TODO linearize */
    o->next = freelist->head;
    freelist->head = o;
    }

    View Slide

  54. Non-Blocking
    Algorithms

    View Slide

  55. Compare
    and Swap

    View Slide

  56. Compare-And-Swap
    Cmpr and *
    Old value
    Destination
    Address

    View Slide

  57. Compare-And-Swap

    Return false
    Cmpr and *
    Old value
    Destination
    Address

    View Slide

  58. Old value New value
    =
    Destination
    Address
    Copy to *
    Return true
    Cmpr and *
    Compare-And-Swap

    Return false

    View Slide

  59. Old value New value
    =
    Destination
    Address
    Copy to *
    Return true
    Cmpr and *
    Compare-And-Swap

    Return false
    Atomic

    View Slide

  60. obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }
    slab head A B …

    View Slide

  61. slab head A B …
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }

    View Slide

  62. obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }
    A B …
    slab head

    View Slide

  63. B …
    slab head
    A
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }

    View Slide

  64. obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }
    Cmpr and *
    a
    a
    slab head A B …
    b
    a,
    &s->head, b

    View Slide

  65. slab head Z
    Cmpr and *
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }
    a
    a
    b
    a,
    &s->head, b

    View Slide

  66. slab head Z A B
    Cmpr and *
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }
    a
    a
    b
    a,
    &s->head, b

    View Slide

  67. slab head B …
    Cmpr and *
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }
    a
    a
    b
    a,
    &s->head, b

    View Slide

  68. void free(slab *s, obj *o) {
    do {
    obj *t = s->head;
    o->next = t;
    } while (!cas(&s->head, t, o));
    }
    B …
    slab head

    View Slide

  69. slab head A B …
    void free(slab *s, obj *o) {
    do {
    obj *t = s->head;
    o->next = t;
    } while (!cas(&s->head, t, o));
    }

    View Slide

  70. A B C
    slab head

    View Slide

  71. obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }
    A B C
    slab head

    View Slide

  72. A B C
    slab head
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }

    View Slide

  73. A B C
    some_object = allocate(&shared_slab);
    slab head
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }

    View Slide

  74. B C
    A
    slab head
    some_object = allocate(&shared_slab);
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }

    View Slide

  75. B C
    another_obj = allocate(&shared_slab);
    A
    slab head
    some_object = allocate(&shared_slab);
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }

    View Slide

  76. C
    A
    B
    slab head
    some_object = allocate(&shared_slab);
    another_obj = allocate(&shared_slab);
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }

    View Slide

  77. some_object = allocate(&shared_slab);
    free(some_object);
    B
    C
    A
    slab head
    another_obj = allocate(&shared_slab);
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }

    View Slide

  78. B
    A C
    slab head
    some_object = allocate(&shared_slab);
    free(some_object);
    another_obj = allocate(&shared_slab);
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }

    View Slide

  79. B
    C
    slab head
    some_object = allocate(&shared_slab);
    free(some_object);
    another_obj = allocate(&shared_slab);
    A
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }

    View Slide

  80. B
    B C
    slab head
    some_object = allocate(&shared_slab);
    free(some_object);
    another_obj = allocate(&shared_slab);
    A
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(&s->head, a, b));

    return a;

    }

    View Slide

  81. The ABA Problem
    “A reference about to be modified by a CAS changes
    from a to b and back to a again. As a result, the CAS
    succeeds even though its effect on the data structure
    has changed and no longer has the desired effect.”
    —Herlihy & Shavit, p. 235

    View Slide

  82. A B …
    slab head
    166
    obj *allocate(slab *s) {

    obj *a, *b;

    do {

    a = s->head;

    b = a->next;

    } while (!cas(a, b, &s->head));

    return a;

    }

    View Slide

  83. obj *allocate(slab *s) {

    slab orig, update;

    do {

    orig.gen = s->gen;

    orig.head = s->head;

    update.gen = orig.gen + 1;

    update.head = orig.head->next;

    } while (!dcas(s, &orig, &update));

    return orig.head;

    }
    A B …
    slab head
    166

    View Slide

  84. void free(slab *s, obj *o) {
    do {
    obj *t = s->head;
    o->next = t;
    } while (!cas(&s->head, t, o));
    }

    View Slide

  85. alloc/free pairs (10s)
    10,000,000
    20,000,000
    30,000,000
    40,000,000
    50,000,000
    60,000,000
    Threads
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
    TAS T&T&S TAS + EB Concurrent Allocator
    pthread
    Allocator Throughput

    View Slide

  86. alloc/free pairs (10s)
    10
    100
    1,000
    10,000
    100,000
    1,000,000
    10,000,000
    100,000,000
    Threads
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
    TAS T&T&S TAS + EB Concurrent Allocator
    pthread
    Allocator Throughput

    View Slide

  87. CPU Cycles (rdtscp)

    View Slide

  88. CPU Cycles (rdtscp)

    View Slide

  89. CPU Cycles (rdtscp)

    View Slide

  90. CPU Cycles (rdtscp)

    View Slide

  91. Takeaways

    View Slide

  92. uslab.io

    View Slide

  93. View Slide

  94. Thanks
    @dhobsd | https://9vx.org | http://uslab.io
    Come see us at the Fastly booth!

    View Slide

  95. Further Reading
    •“Is Parallel Programming Hard, And, If So, What Can You Do About It?”,
    created and edited by Paul McKenney, https://www.kernel.org/pub/
    linux/kernel/people/paulmck/perfbook/perfbook.html
    •“Nonblocking Algorithms and Scalable Multicore Programming”, Samy Al
    Bahra, https://queue.acm.org/detail.cfm?id=2492433
    •“What Every Programmer Should Know About Memory”, Ulrich Drepper,
    http://www.akkadia.org/drepper/cpumemory.pdf

    View Slide

  96. •“The C++ Memory Model Meets High-Update-Rate Data Structures”,
    Paul McKenney, http://www.rdrop.com/~paulmck/RCU/C++Updates.
    2014.09.11a.pdf, https://www.youtube.com/watch?v=1Q-RH2tiyt0
    •Obstruction-Free Algorithms can be Practically Wait-Free: Finch,
    Luchangco, Moir, Shavit - 2005 http://people.csail.mit.edu/shanir/
    publications/DISC2005.pdf
    •Are Lock-Free Concurrent Algorithms Practically Wait-Free?: Alistarh,
    Censor-Hillel, Shavit - 2013 http://arxiv.org/abs/1311.3200
    Further Reading

    View Slide

  97. •“Lock-Free By Example”, Tony Van Eerd, https://www.youtube.com/
    watch?v=Xf35TLFKiO8
    •Concurrency Kit: http://concurrencykit.org
    •µSlab: http://uslab.io
    Further Reading

    View Slide

  98. • I assume reproduction rights to all images under fair use; this slide is for reference purposes and fair
    attribution.
    • Mario, Mario Kart, and other related franchises are registered trademarks of Nintendo and its
    associates. I am in no way affiliated with or endorsed by Nintendo.
    • Mario Kart 8 screenshot from eBash video game centers at http://ebash.com/wp-content/uploads/
    2015/02/mariokart.jpg
    • Lakitu stop light animation from IGN at http://31.media.tumblr.com/
    bd374359bd39369cdbf25755d4b2e570/tumblr_mfj2dnGBTL1rfjowdo1_500.gif
    • Brittney Griner blocking from Bleacher Report, but high-res source seems to have disappeared :(
    Image Credits

    View Slide