$30 off During Our Annual Pro Sale. View Details »

Classifying generated white-box tests: an exploratory study

Classifying generated white-box tests: an exploratory study

Software Quality Journal paper presented at ICST'21 as journal-first paper.
DOI: https://dx.doi.org/10.1007/s11219-019-09446-5

Can developers understand tests generated from the source code? We performed a study with 106 participants and the results show that this is a non-trivial task that can affect the practical fault-finding capability of test generators.

More Decks by Critical Systems Research Group

Other Decks in Research

Transcript

  1. Classifying generated white-box tests: an exploratory study Dávid Honfi, Zoltán

    Micskei ICST’21 Journal-First Papers micskeiz mit.bme.hu/~micskeiz Software Quality Journal, 27:3, pp. 1339–1380, Springer, 2019. DOI: 10.1007/s11219-019-09446-5
  2. White-Box test generation 2 Classifying generated white-box tests: an exploratory

    study /// <summary>Calculates the sum of given number of /// elements from an index in an array.</summary> int CalculateSum(int start, int number, int[] a) { if(start+number > a.Length || a.Length <= 1) throw new ArgumentException(); int sum = 0; for (int i = start; i < start+number-1; i++) sum += a[i]; return sum; } [TestMethod] public void CalculateSumTest284() { int[] ints = new int[5] { 4,5,6,7,8 }; int i = CalculateSum(0, 0, ints); Assert.AreEqual<int>(0, i); } [TestMethod] public void CalculateSumTest647() { int[] ints = new int[5] { 4,5,6,7,8 }; int i = CalculateSum(0, 4, ints); Assert.AreEqual<int>(15, i); } Select test inputs Observe behavior Generate test code Test generator
  3. White-Box test generation 3 Classifying generated white-box tests: an exploratory

    study /// <summary>Calculates the sum of given number of /// elements from an index in an array.</summary> int CalculateSum(int start, int number, int[] a) { if(start+number > a.Length || a.Length <= 1) throw new ArgumentException(); int sum = 0; for (int i = start; i < start+number-1; i++) sum += a[i]; return sum; } [TestMethod] public void CalculateSumTest284() { int[] ints = new int[5] { 4,5,6,7,8 }; int i = CalculateSum(0, 0, ints); Assert.AreEqual<int>(0, i); } [TestMethod] public void CalculateSumTest647() { int[] ints = new int[5] { 4,5,6,7,8 }; int i = CalculateSum(0, 4, ints); Assert.AreEqual<int>(15, i); } Question: Do these tests “look good”? OK: expected w.r.t specification WRONG: unexpected w.r.t specification Select test inputs Observe behavior Generate test code Test generator
  4. Classifying generated white-box tests: an exploratory study 4 Goal and

    why should we care about it? Goal: How do developers who use test generator tools perform in deciding whether the generated tests encode expected or unexpected behavior? Typical research evaluation setup Faulty impl. Correct impl. Generated test OK Bug! Real setup “Do not know whether faulty or correct” implementation Generated test OK?? Classification is a non-trivial task affecting the practical fault-finding capability of test generators and empirical evaluations Bug?
  5. Classifying generated white-box tests: an exploratory study 5 RQ1: How

    do developers perform in the classification of generated tests? RQ2: How much time do developers spend with the classification? Planning: overview of the study Subjects Objects Environment Procedure • Students only • MSc V&V course • Basic experience • Apply voluntarily • C#, from GitHub • 5 methods in 4 repos • 3 tests per method • Artificial faults (ODC) • IntelliTest • Experiment portal • Visual Studio • Test runs and debug • 15 mins tutorial • Classify all 15 tests • At most 60 minutes • Activities recorded
  6. Classifying generated white-box tests: an exploratory study 6 Planning: experiment

    portal Specification of the method (open-source project) Current test to classify Observed behavior encoded in asserts Answer of participant
  7. Classifying generated white-box tests: an exploratory study 7 Session Date

    Object Participants Original #1 2016-12-01 NBitCoin 30 #2 2016-12-08 Math.NET 24 Replication #3 2017-11-30 NodaTime 22 #4 2017-12-07 NetTopologySuite 30 SUM 106 Execution
  8. Classifying generated white-box tests: an exploratory study 8 Results: RQ1

    classification performance Matthews correlation coefficient Results from session #2 • Column: participants • Rows: test methods • Cell: answers • Moderate classification ability • Incorrectly classifying both expected and unexpected results
  9. Classifying generated white-box tests: an exploratory study 9 Results: RQ2

    time spent on tasks Median time need for one test method: 55–117 s Time needed is not negligible in a larger project Learning effect (method, process)
  10. Classifying generated white-box tests: an exploratory study 10 Intentionally simple

    study design – 2 pilots to refine timing, length, materials… Current design resembles setting of – Junior developers testing a legacy project with test generators Limitations and how to extend in future studies – Students (professionals with experience on test generation?) – A priori knowledge of objects – Specification: code comments vs. other forms – Used only 1 tool (IntelliTest) Discussion: limitations
  11. Classifying generated white-box tests: an exploratory study 11 Classification task

    is challenging – Median misclassification 20% (14–33%) – Consider in future evaluations and tool development Participants’ feedback (see paper/dataset for more) – “It was hard to decide … when it tests an unspecified case” – “I think that some assertions are useless…” Summary of recommendations (see paper for more) – Structure the test code (Arrange/Act/Assert) and categorize tests – Naming variables, commenting etc. is important in generated tests – Instead of assert use something else, e.g., observed Discussion: recommendations
  12. Towards a theory of classification Classifying generated white-box tests: an

    exploratory study 12
  13. Classifying generated white-box tests: an exploratory study 13 Many factors

    + some puzzling combinations Especially with exceptions No “common” approach in tools, papers Our preliminary, systematic proposal: 3 main factors (8 cases) Why is this classification task hard?
  14. Classifying generated white-box tests: an exploratory study 15 Possible cases

    of the classification task ID Shall raise exception Fault is triggered Exception is raised Test encodes Developer action Classif. Test outcome Conditions Outcome
  15. Classifying generated white-box tests: an exploratory study 16 Possible cases

    of the classification task ID Shall raise exception Fault is triggered Exception is raised Test encodes Developer action Classif. Test outcome C1 F F F Expected behavior Acknowledge the test OK Pass C2 F F T - - - - C3 F T F Unexpected behavior Realize faulty impl. WRONG Pass C4 F T T Unexpected behavior Recognize the fault- indicating exception WRONG Pass or fail C5 T F F - - - - C6 T F T Expected behavior Acknowledge the test OK Pass or fail C7 T T F Unexpected behavior Realize an excep- tion is missing WRONG Pass C8 T T T Unexpected behavior Recognize the invalid exception WRONG Pass or fail See paper for examples
  16. Classifying generated white-box tests: an exploratory study 17 IntelliTest •

    Exception thrown inside class are expected, test passes • Otherwise failing test with no assert EvoSuite • Wraps every detected exception in try/catch, test passes • No failing tests Randoop • Error-revealing: potential error, test fails (several heuristics) • Regression tests: captures current behavior, test passes How do tools handle these specific cases?
  17. Classifying generated white-box tests: an exploratory study 18 Change perspective

    Suggestions for future research “Does this test encode a fault?” “Does the user recognize that there is a fault with the help of this test?” In your approach/tool, think about: – When should a generated test pass or fail? – What should you do when an exception is raised in a test? – When is a generated test useful for developers? “Correct/faulty test” “Appropriate or useful test” Proxy measures could overestimate faults that are detected
  18. Classifying generated white-box tests: an exploratory study 19 Summary Open

    access paper: 10.1007/s11219-019-09446-5 | Dataset: 10.5281/zenodo.1472714 Dávid Honfi, Zoltán Micskei