Classifying generated white-box tests: an exploratory study

Slide 1

Slide 1 text

Classifying generated white-box tests: an exploratory study Dávid Honfi, Zoltán Micskei ICST’21 Journal-First Papers micskeiz mit.bme.hu/~micskeiz Software Quality Journal, 27:3, pp. 1339–1380, Springer, 2019. DOI: 10.1007/s11219-019-09446-5

Slide 2

Slide 2 text

White-Box test generation 2 Classifying generated white-box tests: an exploratory study /// Calculates the sum of given number of /// elements from an index in an array. int CalculateSum(int start, int number, int[] a) { if(start+number > a.Length || a.Length <= 1) throw new ArgumentException(); int sum = 0; for (int i = start; i < start+number-1; i++) sum += a[i]; return sum; } [TestMethod] public void CalculateSumTest284() { int[] ints = new int[5] { 4,5,6,7,8 }; int i = CalculateSum(0, 0, ints); Assert.AreEqual(0, i); } [TestMethod] public void CalculateSumTest647() { int[] ints = new int[5] { 4,5,6,7,8 }; int i = CalculateSum(0, 4, ints); Assert.AreEqual(15, i); } Select test inputs Observe behavior Generate test code Test generator

Slide 3

Slide 3 text

White-Box test generation 3 Classifying generated white-box tests: an exploratory study /// Calculates the sum of given number of /// elements from an index in an array. int CalculateSum(int start, int number, int[] a) { if(start+number > a.Length || a.Length <= 1) throw new ArgumentException(); int sum = 0; for (int i = start; i < start+number-1; i++) sum += a[i]; return sum; } [TestMethod] public void CalculateSumTest284() { int[] ints = new int[5] { 4,5,6,7,8 }; int i = CalculateSum(0, 0, ints); Assert.AreEqual(0, i); } [TestMethod] public void CalculateSumTest647() { int[] ints = new int[5] { 4,5,6,7,8 }; int i = CalculateSum(0, 4, ints); Assert.AreEqual(15, i); } Question: Do these tests “look good”? OK: expected w.r.t specification WRONG: unexpected w.r.t specification Select test inputs Observe behavior Generate test code Test generator

Slide 4

Slide 4 text

Classifying generated white-box tests: an exploratory study 4 Goal and why should we care about it? Goal: How do developers who use test generator tools perform in deciding whether the generated tests encode expected or unexpected behavior? Typical research evaluation setup Faulty impl. Correct impl. Generated test OK Bug! Real setup “Do not know whether faulty or correct” implementation Generated test OK?? Classification is a non-trivial task affecting the practical fault-finding capability of test generators and empirical evaluations Bug?

Slide 5

Slide 5 text

Classifying generated white-box tests: an exploratory study 5 RQ1: How do developers perform in the classification of generated tests? RQ2: How much time do developers spend with the classification? Planning: overview of the study Subjects Objects Environment Procedure • Students only • MSc V&V course • Basic experience • Apply voluntarily • C#, from GitHub • 5 methods in 4 repos • 3 tests per method • Artificial faults (ODC) • IntelliTest • Experiment portal • Visual Studio • Test runs and debug • 15 mins tutorial • Classify all 15 tests • At most 60 minutes • Activities recorded

Slide 6

Slide 6 text

Classifying generated white-box tests: an exploratory study 6 Planning: experiment portal Specification of the method (open-source project) Current test to classify Observed behavior encoded in asserts Answer of participant

Slide 7

Slide 7 text

Classifying generated white-box tests: an exploratory study 7 Session Date Object Participants Original #1 2016-12-01 NBitCoin 30 #2 2016-12-08 Math.NET 24 Replication #3 2017-11-30 NodaTime 22 #4 2017-12-07 NetTopologySuite 30 SUM 106 Execution

Slide 8

Slide 8 text

Classifying generated white-box tests: an exploratory study 8 Results: RQ1 classification performance Matthews correlation coefficient Results from session #2 • Column: participants • Rows: test methods • Cell: answers • Moderate classification ability • Incorrectly classifying both expected and unexpected results

Slide 9

Slide 9 text

Classifying generated white-box tests: an exploratory study 9 Results: RQ2 time spent on tasks Median time need for one test method: 55–117 s Time needed is not negligible in a larger project Learning effect (method, process)

Slide 10

Slide 10 text

Classifying generated white-box tests: an exploratory study 10 Intentionally simple study design – 2 pilots to refine timing, length, materials… Current design resembles setting of – Junior developers testing a legacy project with test generators Limitations and how to extend in future studies – Students (professionals with experience on test generation?) – A priori knowledge of objects – Specification: code comments vs. other forms – Used only 1 tool (IntelliTest) Discussion: limitations

Slide 11

Slide 11 text

Classifying generated white-box tests: an exploratory study 11 Classification task is challenging – Median misclassification 20% (14–33%) – Consider in future evaluations and tool development Participants’ feedback (see paper/dataset for more) – “It was hard to decide … when it tests an unspecified case” – “I think that some assertions are useless…” Summary of recommendations (see paper for more) – Structure the test code (Arrange/Act/Assert) and categorize tests – Naming variables, commenting etc. is important in generated tests – Instead of assert use something else, e.g., observed Discussion: recommendations

Slide 12

Slide 12 text

Towards a theory of classification Classifying generated white-box tests: an exploratory study 12

Slide 13

Slide 13 text

Classifying generated white-box tests: an exploratory study 13 Many factors + some puzzling combinations Especially with exceptions No “common” approach in tools, papers Our preliminary, systematic proposal: 3 main factors (8 cases) Why is this classification task hard?

Slide 14

Slide 14 text

Classifying generated white-box tests: an exploratory study 15 Possible cases of the classification task ID Shall raise exception Fault is triggered Exception is raised Test encodes Developer action Classif. Test outcome Conditions Outcome

Slide 15

Slide 15 text

Classifying generated white-box tests: an exploratory study 16 Possible cases of the classification task ID Shall raise exception Fault is triggered Exception is raised Test encodes Developer action Classif. Test outcome C1 F F F Expected behavior Acknowledge the test OK Pass C2 F F T - - - - C3 F T F Unexpected behavior Realize faulty impl. WRONG Pass C4 F T T Unexpected behavior Recognize the fault- indicating exception WRONG Pass or fail C5 T F F - - - - C6 T F T Expected behavior Acknowledge the test OK Pass or fail C7 T T F Unexpected behavior Realize an exception is missing WRONG Pass C8 T T T Unexpected behavior Recognize the invalid exception WRONG Pass or fail See paper for examples

Slide 16

Slide 16 text

Classifying generated white-box tests: an exploratory study 17 IntelliTest • Exception thrown inside class are expected, test passes • Otherwise failing test with no assert EvoSuite • Wraps every detected exception in try/catch, test passes • No failing tests Randoop • Error-revealing: potential error, test fails (several heuristics) • Regression tests: captures current behavior, test passes How do tools handle these specific cases?

Slide 17

Slide 17 text

Classifying generated white-box tests: an exploratory study 18 Change perspective Suggestions for future research “Does this test encode a fault?” “Does the user recognize that there is a fault with the help of this test?” In your approach/tool, think about: – When should a generated test pass or fail? – What should you do when an exception is raised in a test? – When is a generated test useful for developers? “Correct/faulty test” “Appropriate or useful test” Proxy measures could overestimate faults that are detected

Slide 18

Slide 18 text

Classifying generated white-box tests: an exploratory study 19 Summary Open access paper: 10.1007/s11219-019-09446-5 | Dataset: 10.5281/zenodo.1472714 Dávid Honfi, Zoltán Micskei