TMPA-2017: 5W+1H Static Analysis Report Quality Measure

5W+1H static analysis report quality measure Maxim Menshchikov, Timur Lepikhin
March 3, 2017 Saint Petersburg State University, OKTET Labs

Authors Maxim Menshchikov Student, Saint Petersburg State University. Software Engineer
at OKTET Labs. Timur Lepikhin Candidate of Sciences, Associate Professor, Saint Petersburg State University. 1

Static analysis quality evaluation How the quality is usually evaluated?
1. Precision. PPV = TP TP + FP 2. Recall. TPR = TP TP + FN 3. F1 (f-measure). F1 = 2TP 2TP + FP + FN 2

Static analysis quality evaluation How the quality is usually evaluated?
4. False-Positive Rate. FPR = FP FP + TN 5. Accuracy. ACC = TP + TN P + N 6. ... What’s missing in these measures? 3

Missing pieces • Informational quality of messages How good and
informative the message is? • Generalization of reports Reports can be either positive or negative when talking about errors. “Error in line x”. “No error in line x”. • Error class identiﬁcation1 Reports can relate to the same problem or point of interest in the code. Reports should be combined according to that. • Utility support Not all tested utilities may support some kind of report. 1Not always missing :) 4

The input Consider the following code sample: #include <stdio.h> int
main() { int input; if (scanf("%d", &input) == 1) { if (input == 2) { int *a; int *n = a; a = n; *n = 5; } else { printf("OK\n"); } } return 0; } 5

The output Clang 3.9 main.cpp:10:13: warning: Assigned value is garbage
or undefined: int *n = a; main.cpp:5:5: note: Taking true branch: if (scanf("%d", &input) == 1) main.cpp:7:13: note: Assuming ’input’ is equal to 2: if (input == 2) main.cpp:7:9: note: Taking true branch: if (input == 2) main.cpp:9:13: note: ’a’ declared without an initial value: int *a; main.cpp:10:13: note: Assigned value is garbage or undefined: int *n = a; main.cpp:11:13: warning: Value stored to ’a’ is never read: a = n; main.cpp:11:13: note: Value stored to ’a’ is never read: a = n; 6

The output cppcheck 1.76 [main.cpp:12]: (style) Variable ’a’ is assigned
a value that is never used. [main.cpp:10]: (error) Uninitialized variable: a 7

The difference 1. Clang shows which conditions should be met
to encounter the bug. 2. Clang shows source code line text, while cppcheck only shows ﬁle and line number. Both reports would be “correct” in sense of all previous measures. They would be considered equal with respect to their contribution to result. 8

5W+1H “5Ws” are actively used in journalism and natural language
processing. Sometimes they are referred as “5W+1H”, where “H” denotes “How?”. • What? • When? • Where? • Who? • Why? • How? 9

5W+1H We suggest to rephrase the 6th question as “How
to fix?” • What? Consequences. The error. What will happen if the error occurs. • When? Conditions when it happens. • Where? Source code line number, module name. • Who? Who wrote this line? • Why? More or less formal reason why the error was treated as such. • How to fix? The ways to fix the problem. 10

How it applies to previous code sample Question Clang Cppcheck
What? Assigned value is garbage Uninitialized variable: a Who? — — Where? lines 5-10 line 10 When? scanf(...) == 1, input == 2 — Why? ’a’ declared without initial value — How? — — 11

5W+1H • It is hard to prove its completeness. (Do
you have any counter-example?) 12

5W+1H • It is hard to prove its completeness. (Do
you have any counter-example?) • Some way to evaluate reports is still needed. • You can always choose the most suitable question to associate report information with. 13

Generalization of reports Factual error Report Presence Correctness Result kind
Usefulness No Indeterminate2 Indeterminate Yes No Correct Positive No3 No Correct Negative Yes No Incorrect Positive No No Incorrect Negative No Yes Indeterminate Indeterminate No Yes Correct Positive Yes Yes Correct Negative Yes Yes Incorrect Positive No Yes Incorrect Negative No 2Or rather missing 3Something strange 14

Report classes Report class is an inﬁnite set of reports
equal from end user’s point of view. Let’s group reports by answers to following questions: • Why? • What? • Where? 15

Maths: propagate report classes Consider the surjective function combining reports
from set R to the set of unique classes R . f(r) : R → R r ∈ R We’ll use R as an alias to R later on. 16

Maths: introduce weights Consider the set of questions: {What, When,
Where, Who, Why, HowToFix} Let W be a set of answer weights for questions 1-6, respectively. W = {w1, w2, ..., w6} Then following mapping can be applied4. W = {0.2, 0.15, 0.1, 0.05, 0.2, 0.3} 4Make your own mapping satisfying the needs of your test 17

Maths: introduce weights, pt.2 Let I be informational quality of
the message and A = {a1, a2, ..., a6} be a set of answers quality, where ai ∈ [0, 1], i = 1..6. I = 6 i=1 wi · ai (1) Let Imax be a measure of maximal informational quality between m utilities. Imax = 6 i=1 wi · max j aij j ∈ 1..m (2) 18

Maths: introduce weights, pt.3 Having that, by taking Imax into
account, we can easily ﬁnd a sum of all reports. SR = n i=1 Imaxi (3) 19

Maths: introduce weights, pt.4 Let m ∈ N be the
number of tested static analyzers. Utility support for i -report can be abstractly represented as: uij ∈ Ui j = 1..m i = 1..n uij ∈ {0, 1} (4) where uij is a boolean value indicating the j− utility support of i− report’s underlying error type. With that, we can ﬁnd a sum of all reports for j− utility taking utility support into account. Sj = n i=1 Iij · m j=1 uij (5) 20

Maths: “IQ” measure We can calculate informational quality measure for
j− utility. Snormj = Sj SR (6) We would call this measure IQ (Informational Quality). TPI only includes true positives. FPI includes false positives with the informational value taken into account. 21

What? Should I measure it manually? No. • You can
make you own parsers, as we did. • Many reports looks similarly. You can evaluate them once and apply the score to all. • (Could have been easier if there was some kind of standardized output...) 22

Real world testing We tested the measure on Toyota ITC
benchmarks5. • Clang 3.9, cppcheck 1.76, Frama-C Silicon, PVS-Studio (Linux) and ReSharper were tested. • Original benchmark was forked, errors patched, limited Win32 support added. • We created a lot of 5-minute-work parsers capable of reading output we got. They cannot be applied to all outputs. • pthread tests excluded from comparison as not all utilities support it. • We checked generic report informativeness. • All measures were calculated and analyzed. • The hypothesis: the measure is diﬀerent from Precision, Recall and F1 scores. 5https://github.com/mmenshchikov/itc-benchmarks 23

Test methodology • Prepared Toyota ITC benchmarks6. • Coded parsers
for all tested utilities7. • Prepared scripts to do the comparison8 and verify results except parts that cannot be automated. • Scripts only check lines having special comments from Toyota. • Reports were semi-automatically checked for correctness. • Report quality was evaluted manually, yet applying the same score to similar reports (takes really little time). • The hypothesis was evaluated using t-test. 6https://github.com/mmenshchikov/itc-benchmarks 7https://github.com/mmenshchikov/sa_parsers 8https://github.com/mmenshchikov/sa_comparison_003 24

Results: Informativeness Question Clang cppcheck Frama-C PVS RS9 What? 100%
100% 100% 100% 100% When? 97.41% 0% 100% 0% 0% Where? 100% 100% 100% 100% 100% Who? 0% 0% 0% 0% 0% Why? 35.78% 0% 99.77% 48.46% 0% How to ﬁx? 0% 0% 0% 17.15% 38.27% 9ReSharper C++ 25

Results : IQ Utility IQ TPI TP FPI FP PPV10
TPR11 F1 Clang 0.52 57.75 111 1.55 3 0.974 0.183 0.308 Cppcheck 0.3 30 100 0.6 2 0.98 0.165 0.282 Frama-C 0.649 196.1 302 57.2 88 0.774 0.498 0.606 PVS 0.459 53.67 117 4.32 12 0.907 0.193 0.318 RS12 – – – – – – – – 10Precision 11Recall 12ReSharper was excluded as it found “other” defects, although we considered it generic-purpose from the beginning 26

Results : dependency In this test we found a dependency
between Precision (PPV ) and IQ. • Utilities provide similar reports (measures for reports are similar): test more utilities. • Emitted messages are only error-related, no messages on error absence: include tools that inform about bug absence as well13. It is not a generally representative. We evaluated informational values ourselves, and that decreases the reliability of results. 13Many developers ignored our requests for academic versions 27

What’s then You can use this information to improve your
utilities: • Add answers to some of questions (“Who?”, “When?”). • Explain decisions more formally (“Why?”). • Suggest ﬁxes, if possible (“How to ﬁx?”). How to improve the measure: • Prepare better explained weights. How to improve test: • Better rules, less automation. • Richer selection of tools. 28

Questions? 29

Verbosity • Good verbosity More information on analyzer’s decision. Still
you can ﬁlter out unneeded information. • Bad verbosity Many messages about the same error. A lot of “rubbish” messages spreading user’s attention. 30

Who? It questions who wrote a bad line or did
the most signiﬁcant change in it. • svn blame? Too basic information. i.e. if constant in function invocation is wrong, you will not know for sure who is to blame. • Ethical aspects of blaming are out of question You can use static analysis results to automatically create tasks in a bugtracker and assign to right person. 31

5Ws Term is coming from journalism, natural language processing, problem-solving,
etc. Something like that mentioned by various philosophers and rhetoricians. Taught in high-school journalism classes by 1917. 32

TMPA-2017: 5W+1H Static Analysis Report Quality...

TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Exactpro PRO

More Decks by Exactpro

Other Decks in Technology

Featured

Transcript

5W+1H static analysis report quality measure Maxim Menshchikov, Timur Lepikhin

Authors Maxim Menshchikov Student, Saint Petersburg State University. Software Engineer

Static analysis quality evaluation How the quality is usually evaluated?

Static analysis quality evaluation How the quality is usually evaluated?

Missing pieces • Informational quality of messages How good and

The input Consider the following code sample: #include <stdio.h> int

The output Clang 3.9 main.cpp:10:13: warning: Assigned value is garbage

The output cppcheck 1.76 [main.cpp:12]: (style) Variable ’a’ is assigned

The difference 1. Clang shows which conditions should be met

5W+1H “5Ws” are actively used in journalism and natural language

5W+1H We suggest to rephrase the 6th question as “How

How it applies to previous code sample Question Clang Cppcheck

5W+1H • It is hard to prove its completeness. (Do

5W+1H • It is hard to prove its completeness. (Do

Generalization of reports Factual error Report Presence Correctness Result kind

Report classes Report class is an inﬁnite set of reports

Maths: propagate report classes Consider the surjective function combining reports

Maths: introduce weights Consider the set of questions: {What, When,

Maths: introduce weights, pt.2 Let I be informational quality of

Maths: introduce weights, pt.3 Having that, by taking Imax into

Maths: introduce weights, pt.4 Let m ∈ N be the

Maths: “IQ” measure We can calculate informational quality measure for

What? Should I measure it manually? No. • You can

Real world testing We tested the measure on Toyota ITC

Test methodology • Prepared Toyota ITC benchmarks6. • Coded parsers

Results: Informativeness Question Clang cppcheck Frama-C PVS RS9 What? 100%

Results : IQ Utility IQ TPI TP FPI FP PPV10

Results : dependency In this test we found a dependency

What’s then You can use this information to improve your

Questions? 29

Verbosity • Good verbosity More information on analyzer’s decision. Still

Who? It questions who wrote a bad line or did

5Ws Term is coming from journalism, natural language processing, problem-solving,