Slide 1

Slide 1 text

5W+1H static analysis report quality measure Maxim Menshchikov, Timur Lepikhin March 3, 2017 Saint Petersburg State University, OKTET Labs

Slide 2

Slide 2 text

Authors Maxim Menshchikov Student, Saint Petersburg State University. Software Engineer at OKTET Labs. Timur Lepikhin Candidate of Sciences, Associate Professor, Saint Petersburg State University. 1

Slide 3

Slide 3 text

Static analysis quality evaluation How the quality is usually evaluated? 1. Precision. PPV = TP TP + FP 2. Recall. TPR = TP TP + FN 3. F1 (f-measure). F1 = 2TP 2TP + FP + FN 2

Slide 4

Slide 4 text

Static analysis quality evaluation How the quality is usually evaluated? 4. False-Positive Rate. FPR = FP FP + TN 5. Accuracy. ACC = TP + TN P + N 6. ... What’s missing in these measures? 3

Slide 5

Slide 5 text

Missing pieces • Informational quality of messages How good and informative the message is? • Generalization of reports Reports can be either positive or negative when talking about errors. “Error in line x”. “No error in line x”. • Error class identification1 Reports can relate to the same problem or point of interest in the code. Reports should be combined according to that. • Utility support Not all tested utilities may support some kind of report. 1Not always missing :) 4

Slide 6

Slide 6 text

The input Consider the following code sample: #include int main() { int input; if (scanf("%d", &input) == 1) { if (input == 2) { int *a; int *n = a; a = n; *n = 5; } else { printf("OK\n"); } } return 0; } 5

Slide 7

Slide 7 text

The output Clang 3.9 main.cpp:10:13: warning: Assigned value is garbage or undefined: int *n = a; main.cpp:5:5: note: Taking true branch: if (scanf("%d", &input) == 1) main.cpp:7:13: note: Assuming ’input’ is equal to 2: if (input == 2) main.cpp:7:9: note: Taking true branch: if (input == 2) main.cpp:9:13: note: ’a’ declared without an initial value: int *a; main.cpp:10:13: note: Assigned value is garbage or undefined: int *n = a; main.cpp:11:13: warning: Value stored to ’a’ is never read: a = n; main.cpp:11:13: note: Value stored to ’a’ is never read: a = n; 6

Slide 8

Slide 8 text

The output cppcheck 1.76 [main.cpp:12]: (style) Variable ’a’ is assigned a value that is never used. [main.cpp:10]: (error) Uninitialized variable: a 7

Slide 9

Slide 9 text

The difference 1. Clang shows which conditions should be met to encounter the bug. 2. Clang shows source code line text, while cppcheck only shows file and line number. Both reports would be “correct” in sense of all previous measures. They would be considered equal with respect to their contribution to result. 8

Slide 10

Slide 10 text

5W+1H “5Ws” are actively used in journalism and natural language processing. Sometimes they are referred as “5W+1H”, where “H” denotes “How?”. • What? • When? • Where? • Who? • Why? • How? 9

Slide 11

Slide 11 text

5W+1H We suggest to rephrase the 6th question as “How to fix?” • What? Consequences. The error. What will happen if the error occurs. • When? Conditions when it happens. • Where? Source code line number, module name. • Who? Who wrote this line? • Why? More or less formal reason why the error was treated as such. • How to fix? The ways to fix the problem. 10

Slide 12

Slide 12 text

How it applies to previous code sample Question Clang Cppcheck What? Assigned value is garbage Uninitialized variable: a Who? — — Where? lines 5-10 line 10 When? scanf(...) == 1, input == 2 — Why? ’a’ declared without initial value — How? — — 11

Slide 13

Slide 13 text

5W+1H • It is hard to prove its completeness. (Do you have any counter-example?) 12

Slide 14

Slide 14 text

5W+1H • It is hard to prove its completeness. (Do you have any counter-example?) • Some way to evaluate reports is still needed. • You can always choose the most suitable question to associate report information with. 13

Slide 15

Slide 15 text

Generalization of reports Factual error Report Presence Correctness Result kind Usefulness No Indeterminate2 Indeterminate Yes No Correct Positive No3 No Correct Negative Yes No Incorrect Positive No No Incorrect Negative No Yes Indeterminate Indeterminate No Yes Correct Positive Yes Yes Correct Negative Yes Yes Incorrect Positive No Yes Incorrect Negative No 2Or rather missing 3Something strange 14

Slide 16

Slide 16 text

Report classes Report class is an infinite set of reports equal from end user’s point of view. Let’s group reports by answers to following questions: • Why? • What? • Where? 15

Slide 17

Slide 17 text

Maths: propagate report classes Consider the surjective function combining reports from set R to the set of unique classes R . f(r) : R → R r ∈ R We’ll use R as an alias to R later on. 16

Slide 18

Slide 18 text

Maths: introduce weights Consider the set of questions: {What, When, Where, Who, Why, HowToFix} Let W be a set of answer weights for questions 1-6, respectively. W = {w1, w2, ..., w6} Then following mapping can be applied4. W = {0.2, 0.15, 0.1, 0.05, 0.2, 0.3} 4Make your own mapping satisfying the needs of your test 17

Slide 19

Slide 19 text

Maths: introduce weights, pt.2 Let I be informational quality of the message and A = {a1, a2, ..., a6} be a set of answers quality, where ai ∈ [0, 1], i = 1..6. I = 6 i=1 wi · ai (1) Let Imax be a measure of maximal informational quality between m utilities. Imax = 6 i=1 wi · max j aij j ∈ 1..m (2) 18

Slide 20

Slide 20 text

Maths: introduce weights, pt.3 Having that, by taking Imax into account, we can easily find a sum of all reports. SR = n i=1 Imaxi (3) 19

Slide 21

Slide 21 text

Maths: introduce weights, pt.4 Let m ∈ N be the number of tested static analyzers. Utility support for i -report can be abstractly represented as: uij ∈ Ui j = 1..m i = 1..n uij ∈ {0, 1} (4) where uij is a boolean value indicating the j− utility support of i− report’s underlying error type. With that, we can find a sum of all reports for j− utility taking utility support into account. Sj = n i=1 Iij · m j=1 uij (5) 20

Slide 22

Slide 22 text

Maths: “IQ” measure We can calculate informational quality measure for j− utility. Snormj = Sj SR (6) We would call this measure IQ (Informational Quality). TPI only includes true positives. FPI includes false positives with the informational value taken into account. 21

Slide 23

Slide 23 text

What? Should I measure it manually? No. • You can make you own parsers, as we did. • Many reports looks similarly. You can evaluate them once and apply the score to all. • (Could have been easier if there was some kind of standardized output...) 22

Slide 24

Slide 24 text

Real world testing We tested the measure on Toyota ITC benchmarks5. • Clang 3.9, cppcheck 1.76, Frama-C Silicon, PVS-Studio (Linux) and ReSharper were tested. • Original benchmark was forked, errors patched, limited Win32 support added. • We created a lot of 5-minute-work parsers capable of reading output we got. They cannot be applied to all outputs. • pthread tests excluded from comparison as not all utilities support it. • We checked generic report informativeness. • All measures were calculated and analyzed. • The hypothesis: the measure is different from Precision, Recall and F1 scores. 5https://github.com/mmenshchikov/itc-benchmarks 23

Slide 25

Slide 25 text

Test methodology • Prepared Toyota ITC benchmarks6. • Coded parsers for all tested utilities7. • Prepared scripts to do the comparison8 and verify results except parts that cannot be automated. • Scripts only check lines having special comments from Toyota. • Reports were semi-automatically checked for correctness. • Report quality was evaluted manually, yet applying the same score to similar reports (takes really little time). • The hypothesis was evaluated using t-test. 6https://github.com/mmenshchikov/itc-benchmarks 7https://github.com/mmenshchikov/sa_parsers 8https://github.com/mmenshchikov/sa_comparison_003 24

Slide 26

Slide 26 text

Results: Informativeness Question Clang cppcheck Frama-C PVS RS9 What? 100% 100% 100% 100% 100% When? 97.41% 0% 100% 0% 0% Where? 100% 100% 100% 100% 100% Who? 0% 0% 0% 0% 0% Why? 35.78% 0% 99.77% 48.46% 0% How to fix? 0% 0% 0% 17.15% 38.27% 9ReSharper C++ 25

Slide 27

Slide 27 text

Results : IQ Utility IQ TPI TP FPI FP PPV10 TPR11 F1 Clang 0.52 57.75 111 1.55 3 0.974 0.183 0.308 Cppcheck 0.3 30 100 0.6 2 0.98 0.165 0.282 Frama-C 0.649 196.1 302 57.2 88 0.774 0.498 0.606 PVS 0.459 53.67 117 4.32 12 0.907 0.193 0.318 RS12 – – – – – – – – 10Precision 11Recall 12ReSharper was excluded as it found “other” defects, although we considered it generic-purpose from the beginning 26

Slide 28

Slide 28 text

Results : dependency In this test we found a dependency between Precision (PPV ) and IQ. • Utilities provide similar reports (measures for reports are similar): test more utilities. • Emitted messages are only error-related, no messages on error absence: include tools that inform about bug absence as well13. It is not a generally representative. We evaluated informational values ourselves, and that decreases the reliability of results. 13Many developers ignored our requests for academic versions 27

Slide 29

Slide 29 text

What’s then You can use this information to improve your utilities: • Add answers to some of questions (“Who?”, “When?”). • Explain decisions more formally (“Why?”). • Suggest fixes, if possible (“How to fix?”). How to improve the measure: • Prepare better explained weights. How to improve test: • Better rules, less automation. • Richer selection of tools. 28

Slide 30

Slide 30 text

Questions? 29

Slide 31

Slide 31 text

Verbosity • Good verbosity More information on analyzer’s decision. Still you can filter out unneeded information. • Bad verbosity Many messages about the same error. A lot of “rubbish” messages spreading user’s attention. 30

Slide 32

Slide 32 text

Who? It questions who wrote a bad line or did the most significant change in it. • svn blame? Too basic information. i.e. if constant in function invocation is wrong, you will not know for sure who is to blame. • Ethical aspects of blaming are out of question You can use static analysis results to automatically create tasks in a bugtracker and assign to right person. 31

Slide 33

Slide 33 text

5Ws Term is coming from journalism, natural language processing, problem-solving, etc. Something like that mentioned by various philosophers and rhetoricians. Taught in high-school journalism classes by 1917. 32