Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Systematic Architecture Level Fault Diagnosis Using Statistical Techniques

Fabian Keller
November 11, 2014

Systematic Architecture Level Fault Diagnosis Using Statistical Techniques

In the past various spectrum-based fault localization (SBFL) algorithms have been developed to pinpoint a fault location given a set of failing and passing test executions. Most of the algorithms use similarity coefficients and have only been evaluated on established benchmark programs like the Siemens set or the space program from the Software-artifact Infrastructure Repository. In addition to that, SBFL has not been applied by developers in practice yet. This study evaluates the feasibility of applying SBFL to a real-world project, namely AspectJ. From an initial set of 110 manually classified faulty versions, a maximum of seven bugs can be found after examining the 1000 most suspicious lines produced by various SBFL techniques. To explain the result, the influence of the program size is examined using different metrics and evaluations. In general, the program size has a slight influence on some metrics, but is not the primary explanation for the results. The results seem to originate from the metrics currently used throughout the research community to assess SBFL performance. The study showcases the limitations of SBFL with the help of different performance metrics and the insights learned during manual classification. Moreover, additional performance metrics that are better suited to evaluate the fault localization performance are proposed.

Fabian Keller

November 11, 2014
Tweet

More Decks by Fabian Keller

Other Decks in Research

Transcript

  1. Estimated Costs 2012 as reported by Britton et al. [2013]

    11.11.2014 STARDUST - Fabian Keller 2
  2. Agenda 1. Automated Fault Diagnosis 2. State of the Art

    3. Case Study: AspectJ 4. Evaluation 5. Conclusions 11.11.2014 STARDUST - Fabian Keller 3
  3. Agenda 1. Automated Fault Diagnosis 2. State of the Art

    3. Case Study: AspectJ 4. Evaluation 5. Conclusions 11.11.2014 STARDUST - Fabian Keller 4
  4. Fault Diagnosis what is the current practice? Goal: Pinpoint single/multiple

    failure/s Commonly used techniques: • System.out.println() • Symbolic Debugging • Static Slicing / Dynamic Slicing  There is room for improvement! 11.11.2014 STARDUST - Fabian Keller 5
  5. Automated Fault Diagnosis is it possible? B1 B2 B3 B4

    B5 Error Test1 1 0 0 0 0 0 Test2 1 1 0 0 0 0 Test3 1 1 1 1 1 0 Test4 1 1 1 1 1 0 Test5 1 1 1 1 1 1 Test6 1 1 1 0 1 0 11.11.2014 STARDUST - Fabian Keller 6 By intuition: A block is more suspicious, if: - It is involved in failing test cases - It is not involved in passing test cases
  6. Ranking Metrics … it is possible Tarantula = # #

    + #𝑁𝑁 # # + #𝑁𝑁 + # # + #𝑁𝑁 Jaccard = # # + #𝑁𝑁 + # Ochiai = # (# + #𝑁𝑁) ⋅ # + # Involved / Not involved / Failing / Passing 11.11.2014 STARDUST - Fabian Keller 7 B1 B2 B3 B4 B5 Error Test1 1 0 0 0 0 0 Test2 1 1 0 0 0 0 Test3 1 1 1 1 1 0 Test4 1 1 1 1 1 0 Test5 1 1 1 1 1 1 Test6 1 1 1 0 1 0 0,50 0,56 0,63 0,71 0,63 0,17 0,20 0,25 0,33 0,25 0,41 0,45 0,50 0,58 0,50 Ranking: 1. B4 2. B3, B5 3. B2 4. B1
  7. Agenda 1. Automated Fault Diagnosis 2. State of the Art

    3. Case Study: AspectJ 4. Evaluation 5. Conclusions 11.11.2014 STARDUST - Fabian Keller 8
  8. Commonly Used Data and its limiting factors 11.11.2014 STARDUST -

    Fabian Keller 9 Software-artifact Infrastructure Repository • Siemens set • space program Program Faulty versions LOC Test cases Description print_tokens 7 478 4130 Lexical anayzer print_tokens2 10 399 4115 Lexical analyzer replace 32 512 5542 Pattern recognition schedule 9 292 2650 Priority scheduler schedule2 10 301 2710 Priority scheduler tcas 41 141 1608 Altitude separation tot_info 23 440 1052 Information measure space 38 6218 13585 Array definition language
  9. Performance Metrics how can fault localization performance be evaluated? •

    Wasted Effort (WE): Ranking: L4, L3, L2, L7, L6, L1, L5, L9, L10, L8 Wasted Effort (prominent bug): 2 (or 20%) • Proportion of Bugs Localized (PBL) Percentage of bugs localized with WE < p% • Hit@X Number of bugs localized after inspecting X elements 11.11.2014 STARDUST - Fabian Keller 10
  10. Agenda 1. Automated Fault Diagnosis 2. State of the Art

    3. Case Study: AspectJ 4. Evaluation 5. Conclusions 11.11.2014 STARDUST - Fabian Keller 11
  11. AspectJ – Lines of Code nearly doubled in the examined

    timespan 11.11.2014 STARDUST - Fabian Keller 12
  12. AspectJ – Commits active development with mostly 50+ commits per

    month 11.11.2014 STARDUST - Fabian Keller 13
  13. AspectJ – Bugs nearly 2500 bugs reported in the examined

    time span 11.11.2014 STARDUST - Fabian Keller 14
  14. AspectJ – Data less than 40% of the investigated bugs

    are applicable for SBFL AspectJ AJDT Sum All bugs 1544 886 2430 Bugs in iBugs 285 65 350 Classified Bugs 99 11 110 Applicable Bugs 41 1 42 Involved Bugs 20 1 21 11.11.2014 STARDUST - Fabian Keller 15 What happened?
  15. Bug 36234 workarounds cannot be used as evaluation oracle 11.11.2014

    STARDUST - Fabian Keller 16 Bug report: „Getting an out of memory error when compiling with Ajc 1.1 RC1 […]” Pre-Fix Post-Fix
  16. Bug 61411 platform specific bugs are mostly not present in

    test suites 11.11.2014 STARDUST - Fabian Keller 17 Bug report: „[…] highlights a problem that I've seen using ajdoc.bat on Windows […]” Pre-Fix Post-Fix
  17. Bug 151182 synchronization bugs are mostly not present in test

    suites 11.11.2014 STARDUST - Fabian Keller 18 Bug report: „[…] recompiled the aspect using 1.5.2 and tried to run it […], but it fails with a NullPointerException.[…]” Pre-Fix Post-Fix
  18. Agenda 1. Automated Fault Diagnosis 2. State of the Art

    3. Case Study: AspectJ 4. Evaluation 5. Conclusions 11.11.2014 STARDUST - Fabian Keller 19
  19. Research Questions • RQ1: How does the program size influence

    fault localization performance? • RQ2: How many bugs can be found when examining a fixed amount of ranked elements? • RQ3: How does the program size influence suspiciousness scores produced by different ranking metrics? • RQ4: Are the fault localization performance metrics currently used by the research community valid? 11.11.2014 STARDUST - Fabian Keller 20
  20. RQ1: Program Size vs. SBFL Performance? multiple ranked elements are

    mapped to the same suspiciousness 11.11.2014 STARDUST - Fabian Keller 21
  21. RQ4: Are the Performance Metrics Valid? on average, no bugs

    can be found in the first 100 lines 11.11.2014 STARDUST - Fabian Keller 23
  22. RQ4: Are the Performance Metrics Valid? with luck, 33% of

    all bugs can be found in the first 1000 lines 11.11.2014 STARDUST - Fabian Keller 24
  23. Agenda 1. Automated Fault Diagnosis 2. State of the Art

    3. Case Study: AspectJ 4. Evaluation 5. Conclusions 11.11.2014 STARDUST - Fabian Keller 25
  24. Conclusions there is still some work to be done •

    Bugs need more context to be fully understood • Current metrics cannot be applied to large projects • SBFL is not feasible for large projects • New metrics are starting point for future work 11.11.2014 STARDUST - Fabian Keller 26
  25. RQ2: examining a fixed amount inspect more than 100 files

    to find 50% of all bugs 11.11.2014 STARDUST - Fabian Keller 28
  26. RQ3: Program Size vs. Suspiciousness mean suspiciousness drops for larger

    programs 11.11.2014 STARDUST - Fabian Keller 29