Recon at PLDI 2011

Recon at PLDI 2011

A talk I gave at PLDI 2011 about Recon, a new technique and tool we made for debugging concurrent programs.

4d7bad4018644d2e5ebc1cb49c3a4278?s=128

Brandon Lucia

May 28, 2012
Tweet

Transcript

  1. Recon: Finding and Understanding Concurrency Errors with Reconstructed Execution Fragments

    Brandon Lucia, Benjamin P. Wood, Luis Ceze University of Washington Department of Computer Science and Engineering Monday, May 28, 2012
  2. Concurrent Programming is a lot like... Monday, May 28, 2012

  3. Monday, May 28, 2012

  4. Monday, May 28, 2012

  5. Monday, May 28, 2012

  6. Concurrency Errors 6 Ready = true if Ready == true

    SObj = new p() myObj = SObj Thread 1 Thread 2 o = Q.dq o->use() Thread 3 Q.nq(myObj) Initially, Ready is false, SObj is an invalid pointer Intended Invariant: If Ready is true, SObj is a valid pointer Monday, May 28, 2012
  7. Concurrency Errors 7 Ready = true if Ready == true

    SObj = new p() myObj = SObj Thread 1 Thread 2 o = Q.dq o->use() Thread 3 Q.nq(myObj) Initially, Ready is false, SObj is an invalid pointer Buggy Behavior: Ready can be true when SObj is invalid Monday, May 28, 2012
  8. Concurrency Errors 7 Ready = true if Ready == true

    SObj = new p() myObj = SObj Thread 1 Thread 2 o = Q.dq o->use() Thread 3 Q.nq(myObj) Several days later... Initially, Ready is false, SObj is an invalid pointer Buggy Behavior: Ready can be true when SObj is invalid Monday, May 28, 2012
  9. Concurrency Errors 8 Ready = true if Ready == true

    SObj = new p() myObj = SObj Thread 1 Thread 2 o = Q.dq o->use() Thread 3 Q.nq(myObj) Several days later... Symptom of bug is much later than cause and in different thread! ! Monday, May 28, 2012
  10. Tools to Make Bugs Happen 9 Ready = true if

    Ready == true SObj = new p() myObj = SObj Thread 1 Thread 2 o = Q.dq o->use() Thread 3 Q.nq(myObj) Several days later... Bug-exposing test tools report buggy execution schedules Too much information! Monday, May 28, 2012
  11. Tools to Show What Happened 10 Ready = true if

    Ready == true SObj = new p() myObj = SObj Thread 1 Thread 2 o = Q.dq o->use() Thread 3 Q.nq(myObj) Bug isolation tools guide programmers to a code point or two Too little information! Monday, May 28, 2012
  12. Tools to Show What Happened 10 Ready = true if

    Ready == true SObj = new p() myObj = SObj Thread 1 Thread 2 o = Q.dq o->use() Thread 3 Q.nq(myObj) Bug isolation tools guide programmers to a code point or two Several days later... Too little information! Monday, May 28, 2012
  13. 11 Ready = true if Ready == true SObj =

    new p() myObj = SObj Thread 1 Thread 2 o = Q.dq o->use() Thread 3 Q.nq(myObj) Programmers need focused information about bugs’ causes. Monday, May 28, 2012
  14. 11 Ready = true if Ready == true SObj =

    new p() myObj = SObj Thread 1 Thread 2 o = Q.dq o->use() Thread 3 Q.nq(myObj) Several days later... Programmers need focused information about bugs’ causes. Monday, May 28, 2012
  15. Focusing on the Right Stuff 12 Ready = true if

    Ready == true SObj = new p() myObj = SObj Thread 1 Thread 2 o = Q.dq o->use() Thread 3 Q.nq(myObj) Several days later... Monday, May 28, 2012
  16. Focusing on the Right Stuff 12 Ready = true if

    Ready == true SObj = new p() myObj = SObj Thread 1 Thread 2 Thread 3 Monday, May 28, 2012
  17. 13 Ready = true if Ready == true SObj =

    new p() myObj = SObj Focusing on the Right Stuff Monday, May 28, 2012
  18. 13 Ready = true if Ready == true SObj =

    new p() myObj = SObj Focusing on the Right Stuff Monday, May 28, 2012
  19. 14 Ready = true if Ready == true SObj =

    new p() myObj = SObj communication Focusing on the Right Stuff Monday, May 28, 2012
  20. 15 Execution Reconstructions Ready = true if Ready == true

    SObj = new p() myObj = SObj communication Reconstruction A reconstructions is a focused subset of the execution schedule around the root cause of a bug Monday, May 28, 2012
  21. 16 Recon Workflow Crashes when I run it! Monday, May

    28, 2012
  22. 16 Recon Workflow Crashes when I run it! Collect communication

    graphs from many executions Monday, May 28, 2012
  23. 16 Recon Workflow Label graphs as buggy or non-buggy buggy

    non Crashes when I run it! Collect communication graphs from many executions Monday, May 28, 2012
  24. 16 Recon Workflow Label graphs as buggy or non-buggy buggy

    non Build and aggregate reconstructions : Crashes when I run it! Collect communication graphs from many executions Monday, May 28, 2012
  25. 16 Recon Workflow Label graphs as buggy or non-buggy buggy

    non Build and aggregate reconstructions : Crashes when I run it! Collect communication graphs from many executions Rank reconstructions and report 1 2 3 Monday, May 28, 2012
  26. Building Reconstructions Building Communication Graphs 1 2 Ranking Reconstructions 3

    Evaluating Recon 4 ? Monday, May 28, 2012
  27. Building Communication Graphs 1 Monday, May 28, 2012

  28. Communication Graphs 19 Ready = true if Ready == true

    SObj = new p() myObj = SObj Thread 1 Thread 2 Nodes are static instructions Edges are inter-thread communication via shared memory Monday, May 28, 2012
  29. Context-Aware Communication Graphs [MICRO ’09] 20 Ready = true if

    Ready == true SObj = new p() myObj = SObj Rem Wr Rem Wr Loc Rd Loc Wr Rem Rd Rem Rd Communication context is a short history of recent communication events Nodes are instances of instructions within their context Monday, May 28, 2012
  30. Timestamped Context-Aware Communication Graphs 21 Ready = true if Ready

    == true SObj = new p() myObj = SObj Rem Wr Rem Wr Loc Rd Loc Wr Rem Rd Rem Rd Timestamps encode ordering of non-communicating nodes T=5 T=7 T=15 T=16 Monday, May 28, 2012
  31. Building Reconstructions 2 Monday, May 28, 2012

  32. 23 Reconstructions A reconstruction is built around a single communication

    event from a single execution Source Sink Monday, May 28, 2012
  33. 24 Reconstructions A reconstruction is a time-ordered sequence of memory

    operations. Time Memory Operation Monday, May 28, 2012
  34. 25 Reconstructions The regions of a reconstruction are computed using

    graph timestamps. { { { Body Prefix Suffix Monday, May 28, 2012
  35. 26 Multiple Buggy Executions Behavior differs across runs. Same edge,

    different reconstructions. Idea: Focus on common behavior by combining multiple executions = Monday, May 28, 2012
  36. 26 Multiple Buggy Executions Buggy Execution #1 Buggy Execution #2

    Behavior differs across runs. Same edge, different reconstructions. Idea: Focus on common behavior by combining multiple executions = Monday, May 28, 2012
  37. 27 Aggregate Reconstructions + = 50% 50% 50% 50% 100%

    Monday, May 28, 2012
  38. 27 Aggregate Reconstructions + = 50% 50% 50% 50% 100%

    Aggregate Reconstruction Buggy Execution #1 Buggy Execution #2 Monday, May 28, 2012
  39. 28 Aggregate Reconstructions 50% 50% 50% 50% 100% Aggregation focuses

    on typical behavior and deemphasizes rare behavior Monday, May 28, 2012
  40. 28 Aggregate Reconstructions 50% 50% 50% 50% 100% Aggregate Reconstruction

    Aggregation focuses on typical behavior and deemphasizes rare behavior Monday, May 28, 2012
  41. 29 Recon Workflow Collect communication graphs from many executions Label

    graphs as buggy or non-buggy buggy non Build and aggregate reconstructions : Rank reconstructions and report 1 Crashes when I run it! 2 3 Monday, May 28, 2012
  42. 29 Recon Workflow Collect communication graphs from many executions Label

    graphs as buggy or non-buggy buggy non Build and aggregate reconstructions : Rank reconstructions and report 1 Crashes when I run it! 2 3 Monday, May 28, 2012
  43. Ranking Reconstructions 3 Monday, May 28, 2012

  44. 31 Ranking Reconstructions 50% 50% 50% 50% 100% Each reconstruction

    is described by a vector of numeric features Statistical inference on feature vectors ranks reconstructions [ ] B C R BUG! Monday, May 28, 2012
  45. 32 Feature Definitions [ ] Buggy Frequency Ratio Context Variation

    Ratio Reconstruction Consistency Monday, May 28, 2012
  46. 33 Buggy Frequency Ratio buggy buggy non non Buggy Frequency

    Ratio is large if edge occurs often in buggy runs, and rarely in non-buggy runs Monday, May 28, 2012
  47. 33 Buggy Frequency Ratio buggy buggy non non Buggy Frequency

    Ratio is large if edge occurs often in buggy runs, and rarely in non-buggy runs Monday, May 28, 2012
  48. 34 Reconstruction Consistency 50% 50% 50% 50% 100% Reconstruction Consistency

    is high if the behavior is typical in buggy executions. Monday, May 28, 2012
  49. 35 Ranking by Feature A BUG! Not a BUG! By

    design: higher feature values mean more likely buggy A reconstruction’s rank is a linear combination of its features Monday, May 28, 2012
  50. Evaluating Recon 4 ? Monday, May 28, 2012

  51. 0 2 4 6 8 10 logandswp circlist textreflow jsstrlen

    apache mysql pbzip2 aget stringbuffer vector weblech Rank of Bug’s Reconstruction 25 Buggy Runs C/C++ Java Using 25 non-buggy runs and Evaluating Recon’s Precision Monday, May 28, 2012
  52. 0 2 4 6 8 10 logandswp circlist textreflow jsstrlen

    apache mysql pbzip2 aget stringbuffer vector weblech Rank of Bug’s Reconstruction 25 Buggy Runs 15 Buggy Runs C/C++ Java Using 25 non-buggy runs and Evaluating Recon’s Precision Monday, May 28, 2012
  53. 0 2 4 6 8 10 logandswp circlist textreflow jsstrlen

    apache mysql pbzip2 aget stringbuffer vector weblech Rank of Bug’s Reconstruction 25 Buggy Runs 15 Buggy Runs 5 Buggy Runs C/C++ Java 34 Using 25 non-buggy runs and Evaluating Recon’s Precision Monday, May 28, 2012
  54. 0 5 10 15 20 25 apache mysql pbzip2 aget

    PARSEC weblech D aCapo Java Grande Slowdown (x) 79x C/C++ Java Evaluating Recon’s Performance Monday, May 28, 2012
  55. 0 5 10 15 20 25 apache mysql pbzip2 aget

    PARSEC weblech D aCapo Java Grande Slowdown (x) 79x 27:32 07:08 1:51:56 59:41 13:36 Total Graph Collection Time C/C++ Java Evaluating Recon’s Performance Monday, May 28, 2012
  56. 41 Recon Ready = true if Ready == true SObj

    = new p() myObj = SObj Recon reconstructs execution fragments to help programmers understand their bugs BUG! Recon uses statistical inference to identify reconstructions useful to understanding bugs Monday, May 28, 2012
  57. 42 Try it out! http://cs.washington.edu/homes/blucia0a/recon.html Monday, May 28, 2012

  58. 0 20 40 60 80 100 logandswp circlist textreflow jsstrlen

    apache mysql pbzip2 aget stringbuffer vector weblech Fraction of Reconstruction Relevant to Bug C/C++ Java Reconstructions Signal-to-Noise Awesome! Lame! Monday, May 28, 2012