Code Coverage is a Strong Predictor of Test Suite Effectiveness in the Real World

Code Coverage is a Strong Predictor of Test Suite Effectiveness
in the Real World Rahul Gopinath Iftekhar Ahmed

When should we stop testing?

How to evaluate test suite effectiveness?

Previous research: Do not trust coverage (In theory) GTAC’15 Inozemtseva

Test suite quality Coverage Assertions Factors affecting test suite quality

Test suite quality Coverage Assertions According to previous research Test
suite size GTAC’15 Inozemtseva

What is the adequate test suite size? But… • Is
there a maximum number of test cases for a given program? • Are different test cases equivalent in strength? • How do we account for duplicate tests? • Test suite sizes are not comparable even for the same program.

Can I use coverage to measure suite effectiveness?

Mutation score is best predicted by statement coverage A fault
in a statement has 87% probability of being detected if a test covers it. M = 0.87xS Size of dots follow size of projects R2 = 0.94 Results from 250 real world programs largest > 100 KLOC On Developer written test suites

Mutation score is best predicted by statement coverage A fault
in a statement has 61% probability of being detected if a test covers it. M = 0.61xS R2 = 0.70 Results from 250 real world programs largest > 100 KLOC On Randoop produced test suites

Controlling for test suite size, coverage provides little extra information.
Hence don't use coverage GTAC’15 Inozemtseva But Mutation score provides little extra information (<6%) compared to coverage. Why use mutation?

Does coverage follow test suite size? GTAC’15 Inozemtseva Our Research
# Programs 5 250 Selection of programs Ad hoc Systematic sample from Github Tool used CodeCover, PIT Emma, Cobertura, CodeCover, PIT Test suites Random subsets of original Organic & Randomly generated (New results) Removal of influence of size Ad hoc Statistical Our study is much larger, systematic (not ad hoc), and follows the real world usage Our Research (New results) M~TestsuiteSize 12.84% M~log(TSize) 51.26% residuals(M~log(TSize))~S 75.25% Statement coverage can explain 75% variability in mutation score after eliminating influence of test suite size.

Is mutation analysis better than coverage analysis?

Δ=b2 – 4ac d = b^2 + 4 * a
* c;  d = b^2 * 4 * a * c;  d = b^2 / 4 * a * c;  d = b^2 ^ 4 * a * c;  d = b^2 % 4 * a * c; d = b^2 << 4 * a * c; d = b^2 >> 4 * a * c; d = b^2 * 4 + a * c;  d = b^2 * 4 - a * c;  d = b^2 * 4 / a * c;  d = b^2 * 4 ^ a * c;  d = b^2 * 4 % a * c; d = b^2 * 4 << a * c; d = b^2 * 4 >> a * c; d = b^2 * 4 * a + c;  d = b^2 * 4 * a - c;  d = b^2 * 4 * a / c;  d = b^2 * 4 * a ^ c;  d = b^2 * 4 * a % c; d = b^2 * 4 * a << c; d = b^2 * 4 * a >> c; d = b + 2 - 4 * a * c;  d = b - 2 - 4 * a * c;  d = b * 2 - 4 * a * c;  d = b / 2 - 4 * a * c;  d = b % 2 - 4 * a * c; d = b^0 - 4 * a * c;  d = b^1 - 4 * a * c; d = b^-1 - 4 * a * c; d = b^MAX - 4 * a * c; d = b^MIN - 4 * a * c; d = b^2 - 0 * a * c;  d = b^2 - 1 * a * c;  d = b^2 – (-1) * a * c;  d = b^2 - MAX * a * c;  d = b^2 - MIN * a * c;  14 Mutation Analysis: High cost of analysis

Δ=b2 – 22ac d = b^2 - (2^2) * a
* c;  d = b^2 - (2*2) * a * c;  d = b^2 - (2+2) * a * c; 15 Mutation Analysis: Equivalent Mutants Mutants Original Equivalent Mutant Normal Mutant Or: Do not trust low mutation scores

16 Mutation Analysis: Redundant Mutants Mutants Original Equivalent Mutant Redundant
Mutant d = b^2 - (-4) * a * c;  d = b^2 + 4 * a * c;  d = (-b)^2 - 4 * a * c;  Δ=b2 – 4ac Or: Do not trust high mutation scores

17 Mutation Analysis: Different Operators Δ=b2 – 4ac d =
b^2 + 4 * a * c; >>> dis.dis(d) 2 0 LOAD_FAST 0 (b) 3 LOAD_CONST 1 (2) 6 LOAD_CONST 2 (4) 9 LOAD_FAST 1 (a) 12 BINARY_MULTIPLY 13 LOAD_FAST 2 (c) 16 BINARY_MULTIPLY 17 BINARY_SUBTRACT 18 BINARY_XOR 19 RETURN_VALUE x [2016 Software Quality Journal]

Does a high coverage test suite actually prevent bugs?

FSE 2016 We looked at the incidence of bug fixes
on actual programs An uncovered line is twice as likely to have a bug fix as that of a line covered by any test case.

That is, • Coverage is highly correlated with mutation score
(92%) • Mutation score provides little extra information compared to coverage. • Coverage provides 75% more information than just test suite size. • Mutation score can be unreliable. • Coverage thresholds actually help reduce incidence of bugs.

Summary Beware of theoretical spherical cows

Code Coverage is a Strong Predictor of Test Su...

Code Coverage is a Strong Predictor of Test Suite Effectiveness in the Real World

Rahul Gopinath

More Decks by Rahul Gopinath

Other Decks in Research

Featured

Transcript