Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TMPA-2017: 5W+1H Static Analysis Report Quality...

TMPA-2017: 5W+1H Static Analysis Report Quality Measure

TMPA-2017: Tools and Methods of Program Analysis
3-4 March, 2017, Hotel Holiday Inn Moscow Vinogradovo, Moscow

5W+1H Static Analysis Report Quality Measure
Maxim Menshchikov, Timur Lepikhin, Oktetlabs

For video follow the link: https://youtu.be/bjW6_rMCZB8

Would like to know more?
Visit our website:
www.tmpaconf.org
www.exactprosystems.com/events/tmpa

Follow us:
https://www.linkedin.com/company/exactpro-systems-llc?trk=biz-companies-cym
https://twitter.com/exactpro

Exactpro

March 23, 2017
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. 5W+1H static analysis report quality measure Maxim Menshchikov, Timur Lepikhin

    March 3, 2017 Saint Petersburg State University, OKTET Labs
  2. Authors Maxim Menshchikov Student, Saint Petersburg State University. Software Engineer

    at OKTET Labs. Timur Lepikhin Candidate of Sciences, Associate Professor, Saint Petersburg State University. 1
  3. Static analysis quality evaluation How the quality is usually evaluated?

    1. Precision. PPV = TP TP + FP 2. Recall. TPR = TP TP + FN 3. F1 (f-measure). F1 = 2TP 2TP + FP + FN 2
  4. Static analysis quality evaluation How the quality is usually evaluated?

    4. False-Positive Rate. FPR = FP FP + TN 5. Accuracy. ACC = TP + TN P + N 6. ... What’s missing in these measures? 3
  5. Missing pieces • Informational quality of messages How good and

    informative the message is? • Generalization of reports Reports can be either positive or negative when talking about errors. “Error in line x”. “No error in line x”. • Error class identification1 Reports can relate to the same problem or point of interest in the code. Reports should be combined according to that. • Utility support Not all tested utilities may support some kind of report. 1Not always missing :) 4
  6. The input Consider the following code sample: #include <stdio.h> int

    main() { int input; if (scanf("%d", &input) == 1) { if (input == 2) { int *a; int *n = a; a = n; *n = 5; } else { printf("OK\n"); } } return 0; } 5
  7. The output Clang 3.9 main.cpp:10:13: warning: Assigned value is garbage

    or undefined: int *n = a; main.cpp:5:5: note: Taking true branch: if (scanf("%d", &input) == 1) main.cpp:7:13: note: Assuming ’input’ is equal to 2: if (input == 2) main.cpp:7:9: note: Taking true branch: if (input == 2) main.cpp:9:13: note: ’a’ declared without an initial value: int *a; main.cpp:10:13: note: Assigned value is garbage or undefined: int *n = a; main.cpp:11:13: warning: Value stored to ’a’ is never read: a = n; main.cpp:11:13: note: Value stored to ’a’ is never read: a = n; 6
  8. The output cppcheck 1.76 [main.cpp:12]: (style) Variable ’a’ is assigned

    a value that is never used. [main.cpp:10]: (error) Uninitialized variable: a 7
  9. The difference 1. Clang shows which conditions should be met

    to encounter the bug. 2. Clang shows source code line text, while cppcheck only shows file and line number. Both reports would be “correct” in sense of all previous measures. They would be considered equal with respect to their contribution to result. 8
  10. 5W+1H “5Ws” are actively used in journalism and natural language

    processing. Sometimes they are referred as “5W+1H”, where “H” denotes “How?”. • What? • When? • Where? • Who? • Why? • How? 9
  11. 5W+1H We suggest to rephrase the 6th question as “How

    to fix?” • What? Consequences. The error. What will happen if the error occurs. • When? Conditions when it happens. • Where? Source code line number, module name. • Who? Who wrote this line? • Why? More or less formal reason why the error was treated as such. • How to fix? The ways to fix the problem. 10
  12. How it applies to previous code sample Question Clang Cppcheck

    What? Assigned value is garbage Uninitialized variable: a Who? — — Where? lines 5-10 line 10 When? scanf(...) == 1, input == 2 — Why? ’a’ declared without initial value — How? — — 11
  13. 5W+1H • It is hard to prove its completeness. (Do

    you have any counter-example?) 12
  14. 5W+1H • It is hard to prove its completeness. (Do

    you have any counter-example?) • Some way to evaluate reports is still needed. • You can always choose the most suitable question to associate report information with. 13
  15. Generalization of reports Factual error Report Presence Correctness Result kind

    Usefulness No Indeterminate2 Indeterminate Yes No Correct Positive No3 No Correct Negative Yes No Incorrect Positive No No Incorrect Negative No Yes Indeterminate Indeterminate No Yes Correct Positive Yes Yes Correct Negative Yes Yes Incorrect Positive No Yes Incorrect Negative No 2Or rather missing 3Something strange 14
  16. Report classes Report class is an infinite set of reports

    equal from end user’s point of view. Let’s group reports by answers to following questions: • Why? • What? • Where? 15
  17. Maths: propagate report classes Consider the surjective function combining reports

    from set R to the set of unique classes R . f(r) : R → R r ∈ R We’ll use R as an alias to R later on. 16
  18. Maths: introduce weights Consider the set of questions: {What, When,

    Where, Who, Why, HowToFix} Let W be a set of answer weights for questions 1-6, respectively. W = {w1, w2, ..., w6} Then following mapping can be applied4. W = {0.2, 0.15, 0.1, 0.05, 0.2, 0.3} 4Make your own mapping satisfying the needs of your test 17
  19. Maths: introduce weights, pt.2 Let I be informational quality of

    the message and A = {a1, a2, ..., a6} be a set of answers quality, where ai ∈ [0, 1], i = 1..6. I = 6 i=1 wi · ai (1) Let Imax be a measure of maximal informational quality between m utilities. Imax = 6 i=1 wi · max j aij j ∈ 1..m (2) 18
  20. Maths: introduce weights, pt.3 Having that, by taking Imax into

    account, we can easily find a sum of all reports. SR = n i=1 Imaxi (3) 19
  21. Maths: introduce weights, pt.4 Let m ∈ N be the

    number of tested static analyzers. Utility support for i -report can be abstractly represented as: uij ∈ Ui j = 1..m i = 1..n uij ∈ {0, 1} (4) where uij is a boolean value indicating the j− utility support of i− report’s underlying error type. With that, we can find a sum of all reports for j− utility taking utility support into account. Sj = n i=1 Iij · m j=1 uij (5) 20
  22. Maths: “IQ” measure We can calculate informational quality measure for

    j− utility. Snormj = Sj SR (6) We would call this measure IQ (Informational Quality). TPI only includes true positives. FPI includes false positives with the informational value taken into account. 21
  23. What? Should I measure it manually? No. • You can

    make you own parsers, as we did. • Many reports looks similarly. You can evaluate them once and apply the score to all. • (Could have been easier if there was some kind of standardized output...) 22
  24. Real world testing We tested the measure on Toyota ITC

    benchmarks5. • Clang 3.9, cppcheck 1.76, Frama-C Silicon, PVS-Studio (Linux) and ReSharper were tested. • Original benchmark was forked, errors patched, limited Win32 support added. • We created a lot of 5-minute-work parsers capable of reading output we got. They cannot be applied to all outputs. • pthread tests excluded from comparison as not all utilities support it. • We checked generic report informativeness. • All measures were calculated and analyzed. • The hypothesis: the measure is different from Precision, Recall and F1 scores. 5https://github.com/mmenshchikov/itc-benchmarks 23
  25. Test methodology • Prepared Toyota ITC benchmarks6. • Coded parsers

    for all tested utilities7. • Prepared scripts to do the comparison8 and verify results except parts that cannot be automated. • Scripts only check lines having special comments from Toyota. • Reports were semi-automatically checked for correctness. • Report quality was evaluted manually, yet applying the same score to similar reports (takes really little time). • The hypothesis was evaluated using t-test. 6https://github.com/mmenshchikov/itc-benchmarks 7https://github.com/mmenshchikov/sa_parsers 8https://github.com/mmenshchikov/sa_comparison_003 24
  26. Results: Informativeness Question Clang cppcheck Frama-C PVS RS9 What? 100%

    100% 100% 100% 100% When? 97.41% 0% 100% 0% 0% Where? 100% 100% 100% 100% 100% Who? 0% 0% 0% 0% 0% Why? 35.78% 0% 99.77% 48.46% 0% How to fix? 0% 0% 0% 17.15% 38.27% 9ReSharper C++ 25
  27. Results : IQ Utility IQ TPI TP FPI FP PPV10

    TPR11 F1 Clang 0.52 57.75 111 1.55 3 0.974 0.183 0.308 Cppcheck 0.3 30 100 0.6 2 0.98 0.165 0.282 Frama-C 0.649 196.1 302 57.2 88 0.774 0.498 0.606 PVS 0.459 53.67 117 4.32 12 0.907 0.193 0.318 RS12 – – – – – – – – 10Precision 11Recall 12ReSharper was excluded as it found “other” defects, although we considered it generic-purpose from the beginning 26
  28. Results : dependency In this test we found a dependency

    between Precision (PPV ) and IQ. • Utilities provide similar reports (measures for reports are similar): test more utilities. • Emitted messages are only error-related, no messages on error absence: include tools that inform about bug absence as well13. It is not a generally representative. We evaluated informational values ourselves, and that decreases the reliability of results. 13Many developers ignored our requests for academic versions 27
  29. What’s then You can use this information to improve your

    utilities: • Add answers to some of questions (“Who?”, “When?”). • Explain decisions more formally (“Why?”). • Suggest fixes, if possible (“How to fix?”). How to improve the measure: • Prepare better explained weights. How to improve test: • Better rules, less automation. • Richer selection of tools. 28
  30. Verbosity • Good verbosity More information on analyzer’s decision. Still

    you can filter out unneeded information. • Bad verbosity Many messages about the same error. A lot of “rubbish” messages spreading user’s attention. 30
  31. Who? It questions who wrote a bad line or did

    the most significant change in it. • svn blame? Too basic information. i.e. if constant in function invocation is wrong, you will not know for sure who is to blame. • Ethical aspects of blaming are out of question You can use static analysis results to automatically create tasks in a bugtracker and assign to right person. 31
  32. 5Ws Term is coming from journalism, natural language processing, problem-solving,

    etc. Something like that mentioned by various philosophers and rhetoricians. Taught in high-school journalism classes by 1917. 32