Concurrency Bugs Make Bad Things Happen • Concurrency bugs manifest under certain instruction interleavings • Interleavings happen nondeterministically • We need good tools to help find concurrency bugs 2 *http://www.availabilitydigest.com/private/0203/northeast_blackout.pdf Monday, May 28, 2012
Debugging With Communication Graphs From 10,000’ 1.Collect communication graphs, and label them as Buggy or Correct 2. Identify edges in Buggy graphs, but not in Correct graphs 6 Monday, May 28, 2012
Debugging With Communication Graphs From 10,000’ 1.Collect communication graphs, and label them as Buggy or Correct blkOut = 0 while (blkOut) 2. Identify edges in Buggy graphs, but not in Correct graphs 3. Inspect code involved in Buggy- only edges 6 Monday, May 28, 2012
System Design Requirements 7 Text Graphs Must Encode Enough Information to Identify Buggy Communication Graph Collection Must be Cheap Debugging Must Be Simple Debug Monday, May 28, 2012
A More Interesting Example 9 str = getStr(); len = getLen(); int l = len; string s = str; Multi-Variable Atomicity Violation can result in reads of inconsistent str and len Monday, May 28, 2012
int l = len; string s = str; Communication Alone Is Insufficient 10 str = getStr(); len = getLen(); There is no edge in the Buggy graph that isn’t in the Correct graph! ✗ ✓ Monday, May 28, 2012
✓ Adding Context to Graphs 11 These writes should not be interleaved... ...so these instructions should be ordered before, or after both writes ✗ Monday, May 28, 2012
✓ Adding Context to Graphs 11 These writes should not be interleaved... ...so these instructions should be ordered before, or after both writes ✗ Monday, May 28, 2012
Communication graphs do not encode relative ordering of communications ✓ Adding Context to Graphs 11 These writes should not be interleaved... ...so these instructions should be ordered before, or after both writes ✗ Monday, May 28, 2012
Communication graphs do not encode relative ordering of communications ✓ Adding Context to Graphs 11 These writes should not be interleaved... ...so these instructions should be ordered before, or after both writes Communication Context is a short history of preceding communication events added to each node ✗ Monday, May 28, 2012
Communication graphs do not encode relative ordering of communications ✓ Adding Context to Graphs 11 These writes should not be interleaved... ...so these instructions should be ordered before, or after both writes Communication Context is a short history of preceding communication events added to each node Context encodes ordering amongst communication events, enabling more general bug detection ✗ Monday, May 28, 2012
int l = len; string s = str; Context-Aware Communication Graphs 12 str = getStr(); len = getLen(); Loc Wr Rem Rd Rem Rd Rem Wr Rem Wr Loc Rd Monday, May 28, 2012
13 Loc Wr Rem Rd Rem Rd Rem Wr Rem Wr Loc Rd Loc Wr Loc Wr Rem Rd Rem Rd Rem Wr Rem Wr Loc Rd Loc Rd Rem Wr Rem Wr Rem Rd Rem Rd ✗ ✓ Context-Aware Communication Graphs ✓ Monday, May 28, 2012
13 Loc Wr Rem Rd Rem Rd Rem Wr Rem Wr Loc Rd Loc Wr Loc Wr Rem Rd Rem Rd Rem Wr Rem Wr Loc Rd Loc Rd Rem Wr Rem Wr Rem Rd Rem Rd This edge is unique to the buggy context-aware graph ✗ ✓ Context-Aware Communication Graphs ✓ Monday, May 28, 2012
Architectural Support 15 Processor Context Register Cache Context of last write Inst. Addr. of last writer Context of last write Inst. Addr. of last writer Cache-line meta-data { 2-bit communication event code Old New Shift Monday, May 28, 2012
Architectural Support 15 Processor Context Register Cache Context of last write Inst. Addr. of last writer Cache-line meta-data { 2-bit communication event code Old New Shift Monday, May 28, 2012
Architectural Support $ Processor 1 $ Processor 2 $ Processor N-1 $ Processor N •As communication is observed, it is recorded in the Communication Table •Organized as a queue with entries holding source and destination PC and Context 16 Communication Table Software Runtime Monday, May 28, 2012
•When the communication table fills, the system traps to a software runtime to preserve the contents of the communication table to memory or disk Architectural Support $ Processor 1 $ Processor 2 $ Processor N-1 $ Processor N 16 Communication Table Software Runtime Monday, May 28, 2012
•When the communication table fills, the system traps to a software runtime to preserve the contents of the communication table to memory or disk Architectural Support $ Processor 1 $ Processor 2 $ Processor N-1 $ Processor N 16 Communication Table Software Runtime Monday, May 28, 2012
Architectural Support $ Processor 1 $ Processor 2 $ Processor N-1 $ Processor N 0xC: int l = len; 0xD: string s = str; 0xA: str = getStr(); 0xB: len = getLen(); Rem Wr Loc Wr Rem Wr Loc Wr 0xA 0xB Loc Wr Rem Rd Loc Rd Communication Table Producer Consumer Inst. Addr Context Inst. Addr Context Null Null 0xA Null Null Null 0xB Loc Wr 17 Processor 1 Processor 2 Monday, May 28, 2012
Architectural Support $ Processor 1 $ Processor 2 $ Processor N-1 $ Processor N 0xC: int l = len; 0xD: string s = str; 0xA: str = getStr(); 0xB: len = getLen(); Rem Wr Loc Wr Rem Wr Loc Wr 0xA 0xB Loc Wr Rem Rd Loc Rd Rd Rep Communication Table Producer Consumer Inst. Addr Context Inst. Addr Context Null Null 0xA Null Null Null 0xB Loc Wr 17 Processor 1 Processor 2 Monday, May 28, 2012
Architectural Support $ Processor 1 $ Processor 2 $ Processor N-1 $ Processor N 0xC: int l = len; 0xD: string s = str; 0xA: str = getStr(); 0xB: len = getLen(); Rem Wr Loc Wr Rem Wr Loc Wr 0xA 0xB Loc Wr Rem Rd Loc Rd 0xB Loc Wr Communication Table Producer Consumer Inst. Addr Context Inst. Addr Context Null Null 0xA Null Null Null 0xB Loc Wr 17 Processor 1 Processor 2 Monday, May 28, 2012
Architectural Support $ Processor 1 $ Processor 2 $ Processor N-1 $ Processor N 0xC: int l = len; 0xD: string s = str; 0xA: str = getStr(); 0xB: len = getLen(); Rem Wr Loc Wr Rem Wr Loc Wr 0xA 0xB Loc Wr Rem Rd Loc Rd 0xB Loc Wr Communication Table Producer Consumer Inst. Addr Context Inst. Addr Context Null Null 0xA Null Null Null 0xB Loc Wr 0xC 0xB Loc Wr Rem Wr Rem Wr 17 Processor 1 Processor 2 Monday, May 28, 2012
Architectural Support $ Processor 1 $ Processor 2 $ Processor N-1 $ Processor N 0xC: int l = len; 0xD: string s = str; 0xA: str = getStr(); 0xB: len = getLen(); Rem Wr Loc Wr Rem Wr Loc Wr 0xA 0xB Loc Wr Rem Rd Loc Rd 0xB Loc Wr Communication Table Producer Consumer Inst. Addr Context Inst. Addr Context Null Null 0xA Null Null Null 0xB Loc Wr 0xC 0xB Loc Wr Rem Wr Rem Wr 17 Processor 1 Processor 2 Monday, May 28, 2012
Labeled Graph Debugging 19 Starting with a bug report or buggy behavior... ...collect graphs from many runs, labeling as buggy or correct ✓ ✗ Monday, May 28, 2012
Labeled Graph Debugging 19 Starting with a bug report or buggy behavior... ...collect graphs from many runs, labeling as buggy or correct ✓ ✗ Find edges in any buggy graph, and in no correct graph Monday, May 28, 2012
Labeled Graph Debugging 19 Starting with a bug report or buggy behavior... ...collect graphs from many runs, labeling as buggy or correct ✓ ✗ Find edges in any buggy graph, and in no correct graph Rank the resulting edges, giving high rank to: •Rare communication events •Communication in a rare context Monday, May 28, 2012
Anomaly-Based Detection 20 The Bugs-As-Anomalies Hypothesis: Programs usually work correctly, hence bugs are anomalies By looking for anomalies, we are apt to find bugs Monday, May 28, 2012
Anomaly-Based Detection 20 The Bugs-As-Anomalies Hypothesis: Programs usually work correctly, hence bugs are anomalies By looking for anomalies, we are apt to find bugs Frequency Monday, May 28, 2012
Anomaly-Based Detection 20 The Bugs-As-Anomalies Hypothesis: Programs usually work correctly, hence bugs are anomalies By looking for anomalies, we are apt to find bugs Likely bugs are low-frequency communication events Frequency Monday, May 28, 2012
Anomaly-Based Detection 20 The Bugs-As-Anomalies Hypothesis: Programs usually work correctly, hence bugs are anomalies By looking for anomalies, we are apt to find bugs Likely bugs are low-frequency communication events Frequency Fully Automatic Detection - No labeling required! Monday, May 28, 2012
Bug Detection Capability 22 0 2 4 6 8 10 BankAcct C ircularList Log & Sweep M ulti-O rder M oz-jsStr M oz-jsInterp M oz-m acN etIO M oz-TxtFram e M ySQ L-ID Init M ySQ L-BinLog Apache-LogSz PBZip2-O rder AG et-M ultVa Labeled Graph Debugging Anomaly-Based Detection 34.0 19.2 12.0 80.2 14.5 Avg. # Inspection Required To Find A Known Bug Full Applications Bug Kernels Synthetic Bugs Monday, May 28, 2012
Conclusions Bugaboo: General concurrency bug detection Context-Aware communication graphs make general detection possible Architectural support makes graph collection efficient Our results show that Bugaboo efficiently guides developers to concurrency bugs 25 Debug Monday, May 28, 2012
Lots More in our Paper! • Post-deployment uses • A software-only implementation • Debugging case study • Detailed graph characterization • Sensitivity analysis of detection and collection 26 Monday, May 28, 2012