Classifying generated white-box tests: an exploratory study

Classifying generated white-box tests: an exploratory study Dávid Honfi, Zoltán
Micskei ICST’21 Journal-First Papers micskeiz mit.bme.hu/~micskeiz Software Quality Journal, 27:3, pp. 1339–1380, Springer, 2019. DOI: 10.1007/s11219-019-09446-5

White-Box test generation 2 Classifying generated white-box tests: an exploratory
study /// <summary>Calculates the sum of given number of /// elements from an index in an array.</summary> int CalculateSum(int start, int number, int[] a) { if(start+number > a.Length || a.Length <= 1) throw new ArgumentException(); int sum = 0; for (int i = start; i < start+number-1; i++) sum += a[i]; return sum; } [TestMethod] public void CalculateSumTest284() { int[] ints = new int[5] { 4,5,6,7,8 }; int i = CalculateSum(0, 0, ints); Assert.AreEqual<int>(0, i); } [TestMethod] public void CalculateSumTest647() { int[] ints = new int[5] { 4,5,6,7,8 }; int i = CalculateSum(0, 4, ints); Assert.AreEqual<int>(15, i); } Select test inputs Observe behavior Generate test code Test generator

White-Box test generation 3 Classifying generated white-box tests: an exploratory
study /// <summary>Calculates the sum of given number of /// elements from an index in an array.</summary> int CalculateSum(int start, int number, int[] a) { if(start+number > a.Length || a.Length <= 1) throw new ArgumentException(); int sum = 0; for (int i = start; i < start+number-1; i++) sum += a[i]; return sum; } [TestMethod] public void CalculateSumTest284() { int[] ints = new int[5] { 4,5,6,7,8 }; int i = CalculateSum(0, 0, ints); Assert.AreEqual<int>(0, i); } [TestMethod] public void CalculateSumTest647() { int[] ints = new int[5] { 4,5,6,7,8 }; int i = CalculateSum(0, 4, ints); Assert.AreEqual<int>(15, i); } Question: Do these tests “look good”? OK: expected w.r.t specification WRONG: unexpected w.r.t specification Select test inputs Observe behavior Generate test code Test generator

Classifying generated white-box tests: an exploratory study 4 Goal and
why should we care about it? Goal: How do developers who use test generator tools perform in deciding whether the generated tests encode expected or unexpected behavior? Typical research evaluation setup Faulty impl. Correct impl. Generated test OK Bug! Real setup “Do not know whether faulty or correct” implementation Generated test OK?? Classification is a non-trivial task affecting the practical fault-finding capability of test generators and empirical evaluations Bug?

Classifying generated white-box tests: an exploratory study 5 RQ1: How
do developers perform in the classification of generated tests? RQ2: How much time do developers spend with the classification? Planning: overview of the study Subjects Objects Environment Procedure • Students only • MSc V&V course • Basic experience • Apply voluntarily • C#, from GitHub • 5 methods in 4 repos • 3 tests per method • Artificial faults (ODC) • IntelliTest • Experiment portal • Visual Studio • Test runs and debug • 15 mins tutorial • Classify all 15 tests • At most 60 minutes • Activities recorded

Classifying generated white-box tests: an exploratory study 6 Planning: experiment
portal Specification of the method (open-source project) Current test to classify Observed behavior encoded in asserts Answer of participant

Classifying generated white-box tests: an exploratory study 7 Session Date
Object Participants Original #1 2016-12-01 NBitCoin 30 #2 2016-12-08 Math.NET 24 Replication #3 2017-11-30 NodaTime 22 #4 2017-12-07 NetTopologySuite 30 SUM 106 Execution

Classifying generated white-box tests: an exploratory study 8 Results: RQ1
classification performance Matthews correlation coefficient Results from session #2 • Column: participants • Rows: test methods • Cell: answers • Moderate classification ability • Incorrectly classifying both expected and unexpected results

Classifying generated white-box tests: an exploratory study 9 Results: RQ2
time spent on tasks Median time need for one test method: 55–117 s Time needed is not negligible in a larger project Learning effect (method, process)

Classifying generated white-box tests: an exploratory study 10 Intentionally simple
study design – 2 pilots to refine timing, length, materials… Current design resembles setting of – Junior developers testing a legacy project with test generators Limitations and how to extend in future studies – Students (professionals with experience on test generation?) – A priori knowledge of objects – Specification: code comments vs. other forms – Used only 1 tool (IntelliTest) Discussion: limitations

Classifying generated white-box tests: an exploratory study 11 Classification task
is challenging – Median misclassification 20% (14–33%) – Consider in future evaluations and tool development Participants’ feedback (see paper/dataset for more) – “It was hard to decide … when it tests an unspecified case” – “I think that some assertions are useless…” Summary of recommendations (see paper for more) – Structure the test code (Arrange/Act/Assert) and categorize tests – Naming variables, commenting etc. is important in generated tests – Instead of assert use something else, e.g., observed Discussion: recommendations

Towards a theory of classification Classifying generated white-box tests: an
exploratory study 12

Classifying generated white-box tests: an exploratory study 13 Many factors
+ some puzzling combinations Especially with exceptions No “common” approach in tools, papers Our preliminary, systematic proposal: 3 main factors (8 cases) Why is this classification task hard?

Classifying generated white-box tests: an exploratory study 15 Possible cases
of the classification task ID Shall raise exception Fault is triggered Exception is raised Test encodes Developer action Classif. Test outcome Conditions Outcome

Classifying generated white-box tests: an exploratory study 16 Possible cases
of the classification task ID Shall raise exception Fault is triggered Exception is raised Test encodes Developer action Classif. Test outcome C1 F F F Expected behavior Acknowledge the test OK Pass C2 F F T - - - - C3 F T F Unexpected behavior Realize faulty impl. WRONG Pass C4 F T T Unexpected behavior Recognize the fault- indicating exception WRONG Pass or fail C5 T F F - - - - C6 T F T Expected behavior Acknowledge the test OK Pass or fail C7 T T F Unexpected behavior Realize an exception is missing WRONG Pass C8 T T T Unexpected behavior Recognize the invalid exception WRONG Pass or fail See paper for examples

Classifying generated white-box tests: an exploratory study 17 IntelliTest •
Exception thrown inside class are expected, test passes • Otherwise failing test with no assert EvoSuite • Wraps every detected exception in try/catch, test passes • No failing tests Randoop • Error-revealing: potential error, test fails (several heuristics) • Regression tests: captures current behavior, test passes How do tools handle these specific cases?

Classifying generated white-box tests: an exploratory study 18 Change perspective
Suggestions for future research “Does this test encode a fault?” “Does the user recognize that there is a fault with the help of this test?” In your approach/tool, think about: – When should a generated test pass or fail? – What should you do when an exception is raised in a test? – When is a generated test useful for developers? “Correct/faulty test” “Appropriate or useful test” Proxy measures could overestimate faults that are detected

Classifying generated white-box tests: an exploratory study 19 Summary Open
access paper: 10.1007/s11219-019-09446-5 | Dataset: 10.5281/zenodo.1472714 Dávid Honfi, Zoltán Micskei

Classifying generated white-box tests: an explo...

Classifying generated white-box tests: an exploratory study

Critical Systems Research Group

More Decks by Critical Systems Research Group

Other Decks in Research

Featured

Transcript

Classifying generated white-box tests: an exploratory study Dávid Honfi, Zoltán

White-Box test generation 2 Classifying generated white-box tests: an exploratory

White-Box test generation 3 Classifying generated white-box tests: an exploratory

Classifying generated white-box tests: an exploratory study 4 Goal and

Classifying generated white-box tests: an exploratory study 5 RQ1: How

Classifying generated white-box tests: an exploratory study 6 Planning: experiment

Classifying generated white-box tests: an exploratory study 7 Session Date

Classifying generated white-box tests: an exploratory study 8 Results: RQ1

Classifying generated white-box tests: an exploratory study 9 Results: RQ2

Classifying generated white-box tests: an exploratory study 10 Intentionally simple

Classifying generated white-box tests: an exploratory study 11 Classification task

Towards a theory of classification Classifying generated white-box tests: an

Classifying generated white-box tests: an exploratory study 13 Many factors

Classifying generated white-box tests: an exploratory study 15 Possible cases

Classifying generated white-box tests: an exploratory study 16 Possible cases

Classifying generated white-box tests: an exploratory study 17 IntelliTest •

Classifying generated white-box tests: an exploratory study 18 Change perspective

Classifying generated white-box tests: an exploratory study 19 Summary Open