Test Amplification - Speaker Deck

Slide 1

Slide 1 text

Test Amplifcation Arie van Deursen, TU Delft Dutch Testing Day, November 6, 2018 Pouria Derakhshanfar, Xavier Devroey, Mozhan Soltani, Annibale Panichella, Andy Zaidman 1

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Test Amplification Tools Under Construction DSPOT: Detect and generate missing assertions for Junit test cases DESCARTES: Speed up mutation testing by making bigger changes (drop method body) CAMP: Environment test amplification through Docker mutations EvoCrash / Botsing: Test suite amplification via crash reproduction https://www.stamp-project.eu/ https://github.com/STAMP-project 3

Slide 4

Slide 4 text

java.lang.ClassCastException: […] at org…..SolrEntityReferenceResolver.getWikiReference(....java:93) at org…..SolrEntityReferenceResolver.getEntityReference(….java:70) at org…..SolrEntityReferenceResolver.resolve(….java:63) at org…..SolrDocumentReferenceResolver.resolve(….java:48) at … Crash!?! 4

Slide 5

Slide 5 text

Crash Reproduction • File an issue • Discover steps to reproduce the crash • Example: XWIKI-13031 Challenges: • Labor intensive • Hard to automate 5

Slide 6

Slide 6 text

Java Stack Trace (Issue XWIKI-13031) java.lang.ClassCastException: […] at org…..SolrEntityReferenceResolver.getWikiReference(....java:93) at org…..SolrEntityReferenceResolver.getEntityReference(….java:70) at org…..SolrEntityReferenceResolver.resolve(….java:63) at org…..SolrDocumentReferenceResolver.resolve(….java:48) at … Exception Frames {Target ➞ 6

Slide 7

Slide 7 text

Search-Based Crash Reproduction a() b() c() e() d() e() c() e() a() b() c() e() d() e() c() e() a() b() c() e() d() e() c() e() a() b() c() e() d() e() c() e() Random initial test suite a() b() c() e() d() e() c() e() a() b() c() e() d() e() c() e() a() c() e() d() e() c() b() e() a() c() e() d() e() c() b() e() Evolutionary search a() e() d() Exception: at x(…) at y(…) at e(…) Exception: at x(…) at y(…) at e(…) Crash reproducing test case Stack trace 7 Soltani, Panichella, van Deursen. “Search-Based Crash Reproduction and Its Impact on Debugging.” IEEE Transactions on Software Engineering, 2018. pure.tudelft.nl

Slide 8

Slide 8 text

Crash-reproducing Test Case public void test0() throws Throwable { … SolrEntityReferenceResolver solrEntityReferenceResolver0 = new …(); EntityReferenceResolver entityReferenceResolver0 = … mock(…); solrDocument0.put("wiki", (Object) entityType0); Injector.inject(solrEntityReferenceResolver0, …); Injector.validateBean(solrEntityReferenceResolver0, …); … // Undeclared exception! solrEntityReferenceResolver0.resolve(solrDocument0, entityType0, objectArray0); } java.lang.ClassCastException: […] at org…..SolrEntityReferenceResolver.getWikiReference(....java:93) at org…..SolrEntityReferenceResolver.getEntityReference(….java:70) at org…..SolrEntityReferenceResolver.resolve(….java:63) 8

Slide 9

Slide 9 text

EvoSuite • Search based test generation • Many random JUnit tests • Optimized to maximize (e.g.) branch coverage • Combines and improves tests to optimize overall fitness • http://www.evosuite.org/ 9 Initialize population Evaluate fitness Next generation Selection Crossover Mutation Reinsertion [fitness == 0 or budget exhausted]

Slide 10

Slide 10 text

EvoCrash • Implemented on top of EvoSuite • Requires • Stack trace • Binaries: .jar files • Time budget: Set by user • Produce single test case • That reproduces the crash 10 Initialize population Evaluate fitness Next generation Selection Crossover Mutation Reinsertion [fitness == 0 or budget exhausted]

Slide 11

Slide 11 text

Initialize Population • Guided initialization • Random method calls • Guarantees that a target method call is inserted in each test at least once • Direct for public and protected methods • Indirect for private methods 11 Initialize population Evaluate fitness Next generation Selection Crossover Mutation Reinsertion [fitness == 0 or budget exhausted]

Slide 12

Slide 12 text

Evaluate Fitness • Global fitness function to guide generation process • Line coverage • How far are we from the line where the exception is thrown? • Exception coverage • Is the exception thrown? • Stack trace similarity • How similar is the stack trace compared to the original (given) stack trace? 12 Initialize population Evaluate fitness Next generation Selection Crossover Mutation Reinsertion [fitness == 0 or budget exhausted]

Slide 13

Slide 13 text

Next Generation • Selection • Fittest tests according to the fitness function • Guided crossover • Single-point crossover • Check that call to target method is preserved • Guided mutation • Add/change/drop statements • Check that call to target method is preserved 13 Initialize population Evaluate fitness Next generation Selection Crossover Mutation Reinsertion [fitness == 0 or budget exhausted]

Slide 14

Slide 14 text

Does it Work? The JCrashPack Crash Replication Benchmark • 200 crashes from various open source projects • XWiki • STAMP partner • From XWiki issue tracking system: 51 crashes • Defects4J applications • State of the art fault localization benchmark • 73 crashes (with fixes) • Elasticsearch • Based on popularity • From Elasticsearch issue tracking system: 76 crashes • Filtered, verified, cleaned up, right jar versions, … 14 https://github.com/STAMP-project/JCrashPack

Slide 15

Slide 15 text

ExRunner: Running Crash Replication • Python tool to benchmark crash reproduction tools • Multithreaded execution • Monitors tool executions Stack traces Job Generator Observer Thread 1 Thread n . . Job 1 Logs results Tool Configs. Job n Test case Jar files . . Logs results Test case 15

Slide 16

Slide 16 text

EvoCrash Applied to JCrashPack Defects4J XWiki Elasticsearch crashed ex. thrown failed line reached reproduced crashed ex. thrown failed line reached reproduced crashed ex. thrown failed line reached reproduced 1 10 100 Number of frames (log. scale) 16

Slide 17

Slide 17 text

Identified 12 Key Challenges • Input data generation • For complex inputs, generic types, etc. • Environmental dependencies • Environment state hard to manage at unit level • Complex code • Long methods, with lot of nested predicates • Abstract classes and methods • Cannot be instantiated and one concrete implementation is picked randomly • […] 17 Mozhan Soltani · Pouria Derakhshanfar · Xavier Devroey · Arie van Deursen An Empirical Evaluation of Search-Based Crash Reproduction. TU Delft, 2018. In preparation.

Slide 18

Slide 18 text

Work in Progress: Improved Seeding Hypothesis An initial test suite close to actual usage of the class is more likely to lead to a crash reproducing test case than a random one 18 a() b() c() e() d() e() c() e() a() b() c() e() d() e() c() e() a() b() c() e() d() e() c() e() a() b() c() e() d() e() c() e() Random initial test suite Exception: at x(…) at y(…) at e(…) Stack trace Pouria Derakhshanfar, Xavier Devroey, Gilles Perrouin, Andy Zaidman and Arie van Deursen. Search-based Crash Reproduction using Behavioral Model Seeding. TU Delft, 2018. Submitted

Slide 19

Slide 19 text

Test seeding Use the existing tests to generate the initial test suite 19 a() b() c() e() d() e() c() e() a() b() c() e() d() e() c() e() a() b() c() e() d() e() c() e() a() c() c() e() c() e() c() a() Exception: at x(…) at y(…) at e(…) Stack trace a() c() c() c() b() a() e() Existing tests e() Random initial test suite Existing tests subset

Slide 20

Slide 20 text

Model seeding Use a model of method usage to generate the initial test suite 20 Random initial test suite Exception: at x(…) at y(…) at e(…) Stack trace a() b() c() e() d() Model Model-driven initial test suite a() b() c() e() d() e() c() e() a() b() c() e() d() e() c() e() a() b() c() e() d() e() c() e() b() a() c() e() d() e() c() e() a()

Slide 21

Slide 21 text

Call Sequence Models 21 a() b() c() e() d() Model • Model generated from sequences of method calls • Coming from • Source code • Static analysis • Test cases • Dynamic analysis • Operations logs • Online analysis N-gram inference [b(), a(), e()] [c(), d(), a(), e()] [b(), a(), d(), a(), d(), a(), e()] …

Slide 22

Slide 22 text

Example Transitions for java.util.LinkedList size() add(Object) iterator() S0 get(int) remove(int) S1 S2 S3 remove(int) S4 S5 SX size() add(Object) size() add(Object) 22

Slide 23

Slide 23 text

Model example 23

Slide 24

Slide 24 text

Model Seeding Approach Guided Genetic Algorithm app.jar tests.jar Pr[clone] Guided initialization Fitness evaluation [ﬁtness == 0 or budget exhausted] Selection Guided crossover Guided mutation Reinsertion Stacktrace Test cases Objects pool Test seeding Bheavioral model seeding populates Instrumented execution Static analysis Instrumented execution Call sequ. Models inference populates Abstract test cases selection Models Pr[pick init] Pr[pick mut] 1 2 3 4 5 A 24

Slide 25

Slide 25 text

0 25 50 75 not started failed line reached ex. thrown reproduced Number of frames Configurations no s. test s. 0.2 test s. 0.5 test s. 0.8 test s. 1.0 model s. 0.2 model s. 0.5 model s. 0.8 model s. 1.0 25

Slide 26

Slide 26 text

Model Seeding: Results • Behavioral seeding outperforms test seeding / no seeding • 13 more crashes could be reproduced • No performance overhead • Extra crashes are more complex • Higher frame levels • Better at ”industrial” cases • Model seeding outperforms test seeding. 26

Slide 27

Slide 27 text

Implementation: Botsing • Open source implementation of EvoCrash approach • Uses EvoSuite as library for instrumentation • Extensible, modular, tested • Used as test bed for new crash reproduction ideas (model seeding) • https://github.com/STAMP-project/botsing 27