Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Summarization Techniques for Code, Changes, and Testing

Summarization Techniques for Code, Changes, and Testing

Summarization Techniques for Code, Changes, and Testing

Sebastiano Panichella

July 12, 2016
Tweet

More Decks by Sebastiano Panichella

Other Decks in Technology

Transcript

  1. Summarization Techniques for Code, Changes, and Testing Sebastiano Panichella Institut

    für Informatik Universität Zürich [email protected] http://www.ifi.uzh.ch/seal/people/panichella.html
  2. Outline I. Source Code Summaries II. Code Change Summarization -

    Why? Prevent Maintenance Cost. - How? Using Term-based Text Retrieval (TR) techniques - Generating Commit Messages via Summarization of Source Code Changes - Automatic Generation of Release Notes III. Test Cases Summarization - Generating Human Readable Test Cases via Summarization of Source Code Techniques - Evaluation involving 30 developers
  3. Activities in Software Maintenance Change documentation 5% Change implementation 10%

    Change planning 10% Change Testing 25% Source code comprehension 50% Source: Principles of Software Engineering and Design, Zelkovits, Shaw, Gannon 1979 Source Code Summaries: Why? Prevent Maintenance Cost…. 5
  4. Understanding Code… Not So Happy Developers Happy Developers Absence of

    Comments in the Code again !! Comments in the Code again !! SOLUTION??? 6
  5. Source Code Summaries: How? Generating Summaries of Source Code: 7

    “Automatically generated, short, yet accurate descriptions of source code entities”.
  6. Questions when Generating Summaries of Java Classes 9 ▪ 1)

    What information to include in the summaries? ▪ 2) How much information to include in the summaries? ▪ 3) How to generate and present the summaries?
  7. What information include in the summaries? ▪ Methods and attributes

    relevant for the class ▪ Class stereotypes [Dragan et al., ICSM’10] ▪ Method stereotypes [Dragan et al., ICSM’06] ▪ Access-level heuristics ▪ Private, protected, package-protected, public 10 [ L. Moreno at al. - ASE 2012- “JStereoCode: automatically identifying method and class stereotypes in Java code”]”
  8. Example of Important Attributes/Methods of an Entity Java Class 11

    we look at - Attributes - Methods - Dependencies between Classes
  9. How to present and generate the summaries? Other Code Artefacts

    can be Summarised as well: - Packages - Classes - Methods - etc.
  10. Task-Driven Summaries [ Binkley at al. - ICSM 2013 ]

    1) Generating Commit Messages via Summarization of Source Code Changes 2) Automatic Generation of Release Notes To Improve Commits Quality To Improve Releases Note Quality 15
  11. Task-Driven Summaries [ Binkley at al. - ICSM 2013 ]

    1) Generating Commit Messages via Summarization of Source Code Changes 2) Automatic Generation of Release Notes To Improve Commits Quality To Improve Releases Note Quality 16
  12. Commit Message Should Describe… The what: changes implemented during the

    incremental change The why: motivation and context behind the changes 17
  13. Commit Message Should Describe… The what: changes implemented during the

    incremental change The why: motivation and context behind the changes 18 >20% of the messages were removed: - they were empty - had very short strings or lacked any semantical sense Maalej and Happel - MSR 10
  14. Java project version i-1 Java project version i $ "

    $ ! $ " $ !   1. Changes Extractor 2. Stereotypes Detector 3. Message Generator Generating Commit Messages via Summarization of Source Code Changes 19 https://github.com/SEMERU-WM/ChangeScribe
  15. Example: This is a degenerate modifier commit: this change set

    is composed of empty, incidental, and abstract methods. These methods indicate that a new feature is planned. This change set is mainly composed of: 1. Changes to package org.springframework.social.connect.web: 1.1. Modifications to ConnectController.java: 1.1.1. Add try statement at oauth1Callback(String,NativeWebRequest) method 1.1.2. Add catch clause at oauth1Callback(String,NativeWebRequest) method 1.1.3. Add method invocation to method warn of logger object at oauth1Callback(String,NativeWebRequest) method 1.2. Modifications to ConnectControllerTest.java: 1.2.1. Modify method invocation mockMvc at oauth1Callback() method 1.2.2. Add a functionality to oauth 1 callback exception while fetching access token 2. Changes to package org.springframework.social.connect.web.test: 2.1. Add a ConnectionRepository implementation for stub connection repository. It allows to: Find all connections; Find connections; Find connections to users; Get connection; Get primary connection; Find primary connection; Add connection; Update connection; Remove connections; Remove connection [..............] 20
  16. Impact = relative number of methods impacted by a class

    in the commit Generating Commit Messages via Summarization of Source Code Changes This is a degenerate modifier commit: this change set is composed of empty, incidental, and abstract methods. These methods indicate that a new feature is planned. This change set is mainly composed of: 1. Changes to package org.springframework.social.connect.web: 1.1. Modifications to ConnectController.java: 1.1.1. Add try statement at oauth1Callback(String,NativeWebRequest) method 1.1.2. Add catch clause at oauth1Callback(String,NativeWebRequest) method 1.1.3. Add method invocation to method warn of logger object at oauth1Callback(String,NativeWebRequest) method 1.2. Modifications to ConnectControllerTest.java: 1.2.1. Modify method invocation mockMvc at oauth1Callback() method 1.2.2. Add a functionality to oauth 1 callback exception while fetching access token 2. Changes to package org.springframework.social.connect.web.test: 2.1. Add a ConnectionRepository implementation for stub connection repository. It allows to: Find all connections; Find connections; Find connections to users; This is a degenerate modifier commit: this change set is composed of empty, incidental, and abstract methods. These methods indicate that a new feature is planned. This change set is mainly composed of: 1. Changes to package org.springframework.social.connect.web: 1.1. Modifications to ConnectController.java: 1.1.1. Add try statement at oauth1Callback(String,NativeWebRequest) method 1.1.2. Add catch clause at oauth1Callback(String,NativeWebRequest) method 1.1.3. Add method invocation to method warn of logger object at oauth1Callback(String,NativeWebRequest) method 1.2. Modifications to ConnectControllerTest.java: 1.2.1. Modify method invocation mockMvc at oauth1Callback() method 1.2.2. Add a functionality to oauth 1 callback exception while fetching access token 2. Changes to package org.springframework.social.connect.web.test: 2.1. Add a ConnectionRepository implementation for stub connection repository. It allows to: Find all connections; Find connections; Find connections to users; 17% Example: impact >= 17%
  17. Original Message This is a large modifier commit: this is

    a commit with many methods and combines multiple roles. This commit includes changes to internationalization, properties or configuration files (pom.xml). This change set is mainly composed of: 1. Changes to package retrofit.converter: 1.1. Add a Converter implementation for simple XML converter. It allows to: Instantiate simple XML converter with serializer; Process simple XML converter simple XML converter from body; Convert simple XML converter to body Referenced by: SimpleXMLConverterTest class Message Automatically Generated 22
  18. Manual Testing is still Dominant in Industry… ? Why Automatically

    generated tests do not improve the ability of developers to detect faults when compared to manual testing. Fraser et al. Modeling Readability to Improve Unit Tests Ermira Daka, José Campos, and Gordon Fraser University of Sheffield Sheffield, UK Jonathan Dorn and Westley Weimer University of Virginia Virginia, USA ABSTRACT Writing good unit tests can be tedious and error prone, but even once they are written, the job is not done: Developers need to reason about unit tests throughout software development and evolution, in order to diagnose test failures, maintain the tests, and to understand code written by other developers. Unreadable tests are more dif- ficult to maintain and lose some of their value to developers. To overcome this problem, we propose a domain-specific model of unit test readability based on human judgements, and use this model to augment automated unit test generation. The resulting approach can automatically generate test suites with both high coverage and also improved readability. In human studies users prefer our improved tests and are able to answer maintenance questions about them 14% more quickly at the same level of accuracy. Categories and Subject Descriptors. D.2.5 [Software Engineer- ing]: Testing and Debugging – Testing Tools; Keywords. Readability, unit testing, automated test generation 1. INTRODUCTION Unit testing is a popular technique in object oriented program- ming, where efficient automation frameworks such as JUnit allow unit tests to be defined and executed conveniently. However, pro- ducing good tests is a tedious and error prone task, and over their lifetime, these tests often need to be read and understood by different people. Developers use their own tests to guide their implemen- tation activities, receive tests from automated unit test generation tools to improve their test suites, and rely on the tests written by developers of other code. Any test failures require fixing either the software or the failing test, and any passing test may be consulted by developers as documentation and usage example for the code under test. Test comprehension is a manual activity that requires one to understand the behavior represented by a test — a task that may not be easy if the test was written a week ago, difficult if it was written by a different person, and challenging if the test was generated automatically. How difficult it is to understand a unit test depends on many factors. Unit tests for object-oriented languages typically consist of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. ElementName elementName0 = new ElementName("", ""); Class<Object> class0 = Object.class; VirtualHandler virtualHandler0 = new VirtualHandler( elementName0, (Class) class0); Object object0 = new Object(); RootHandler rootHandler0 = new RootHandler((ObjectHandler ) virtualHandler0, object0); ObjectHandlerAdapter objectHandlerAdapter0 = new ObjectHandlerAdapter((ObjectHandlerInterface) rootHandler0); assertEquals("ObjectHandlerAdapter", objectHandlerAdapter0.getName()); ObjectHandlerAdapter objectHandlerAdapter0 = new ObjectHandlerAdapter((ObjectHandlerInterface) null); assertEquals("ObjectHandlerAdapter", objectHandlerAdapter0.getName()); Figure 1: Two versions of a test that exercise the same functionality but have a different appearance and readability. sequences of calls to instantiate various objects, bring them to appro- priate states, and create interactions between them. The particular choice of sequence of calls and values can have a large impact on the resulting test. For example, consider the pair of unit tests shown in Figure 1. Both tests exercise the same functionality with respect to the constructor of the class ObjectHandlerAdaptor in the Xi- neo open source project (which treats null and rootHandler0 arguments identically). Despite this identical coverage of the subject class in practice, they are quite different in presentation. In terms of concrete features that may affect comprehension, the first test is longer, uses more different classes, defines more variables, has more parentheses, has longer lines. The visual appearance of code in general is referred to as its readability — if code is not readable, intuitively it will be more difficult to perform any tasks that require understanding it. Despite significant interest from managers and developers [8], a general understanding of software readability remains elusive. For source code, Buse and Weimer [7] applied machine learning on a dataset of code snippets with human annotated ratings of readability, allowing them to predict whether code snippets are considered readable or not. Although unit tests are also just code in principle, they use a much more restricted set of language features; for example, unit tests usually do not contain conditional or looping statements. Therefore, a general code readability metric may not be well suited for unit tests. In this paper, we address this problem by designing a domain- specific model of readability based on human judgements that ap- plies to object oriented unit test cases. To support developers in deriving readable unit tests, we use this model in an automated ap- proach to improve the readability of unit tests, and integrate this into an automated unit test generation tool. In detail, the contributions of this paper are as follows: • An analysis of the syntactic features of unit tests and their Does Automated White-Box Test Generation Really Help Software Testers? Gordon Fraser1 Matt Staats2 Phil McMinn1 Andrea Arcuri3 Frank Padberg4 1Department of 2Division of Web Science 3Simula Research 4Karlsruhe Institute of Computer Science, and Technology, Laboratory, Technology, University of Sheffield, UK KAIST, South Korea Norway Karlsruhe, Germany ABSTRACT Automated test generation techniques can efficiently produce test data that systematically cover structural aspects of a program. In the absence of a specification, a common assumption is that these tests relieve a developer of most of the work, as the act of testing is reduced to checking the results of the tests. Although this as- sumption has persisted for decades, there has been no conclusive evidence to date confirming it. However, the fact that the approach has only seen a limited uptake in industry suggests the contrary, and calls into question its practical usefulness. To investigate this issue, we performed a controlled experiment comparing a total of 49 sub- jects split between writing tests manually and writing tests with the aid of an automated unit test generation tool, EVOSUITE. We found that, on one hand, tool support leads to clear improvements in com- monly applied quality metrics such as code coverage (up to 300% increase). However, on the other hand, there was no measurable improvement in the number of bugs actually found by developers. Our results not only cast some doubt on how the research commu- nity evaluates test generation tools, but also point to improvements and future work necessary before automated test generation tools will be widely adopted by practitioners. Categories and Subject Descriptors. D.2.5 [Software Engineer- ing]: Testing and Debugging – Testing Tools; General Terms. Algorithms, Experimentation, Reliability, Theory Keywords. Unit testing, automated test generation, branch cover- age, empirical software engineering 1. INTRODUCTION Controlled empirical studies involving human subjects are not common in software engineering. A recent survey by Sjoberg et al. [28] showed that out of 5,453 analyzed software engineering articles, only 1.9% included a controlled study with human sub- jects. For software testing, several novel techniques and tools have been developed to automate and solve different kinds of problems and tasks—however, they have, in general, only been evaluated us- ing surrogate measures (e.g., code coverage), and not with human Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies testers—leaving unanswered the more directly relevant question: Does technique X really help software testers? This paper addresses this question in the context of automated white-box test generation, a research area that has received much attention of late (e.g., [8, 12, 18, 31, 32]). When using white-box test generation, a developer need not manually write the entire test suite, and can instead automatically generate a set of test inputs that systematically exercise a program (for example, by covering all branches), and only need check that the outputs for the test in- puts match those expected. Although the benefits for the developer seem obvious, there is little evidence that it is effective for practical software development. Manual testing is still dominant in industry, and research tools are commonly evaluated in terms of code cover- age achieved and other automatically measurable metrics that can be applied without the involvement of actual end-users. In order to determine if automated test generation is really help- ful for software testing in a scenario without automated oracles, we performed a controlled experiment involving 49 human subjects. Subjects were given one of three Java classes containing seeded faults and were asked to construct a JUnit test suite either manu- ally, or with the assistance of the automated white-box test genera- tion tool EVOSUITE [8]. EVOSUITE automatically produces JUnit test suites that target branch coverage, and these unit tests contain assertions that reflect the current behaviour of the class [10]. Con- sequently, if the current behaviour is faulty, the assertions reflecting the incorrect behaviour must be corrected. The performance of the subjects was measured in terms of coverage, seeded faults found, mutation score, and erroneous tests produced. Our study yields three key results: 1. The experiment results confirm that tools for automated test generation are effective at what they are designed to do— producing test suites with high code coverage—when com- pared with those constructed by humans. 2. The study does not confirm that using automated tools de- signed for high coverage actually helps in finding faults. In our experiments, subjects using EVOSUITE found the same number of faults as manual testers, and during subsequent mutation analysis, test suites did not always have higher mu- tation scores. 3. Investigating how test suites evolve over the course of a test- ing session revealed that there is a need to re-think test gen- eration tools: developers seem to spend most of their time analyzing what the tool produces. If the tool produces a poor initial test suite, this is clearly detrimental for testing. A Does Automated Unit Test Generation Really Help Software Testers? A Controlled Empirical Study Gordon Fraser, Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello S1 4DP, Sheffield, UK Gordon.Fraser@sheffield.ac.uk Matt Staats, SnT Centre for Security, Reliability and Trust, University of Luxembourg, 4 rue Alphonse Weicker L-2721 Luxembourg, Luxembourg, [email protected] Phil McMinn, Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello S1 4DP, Sheffield, UK p.mcminn@sheffield.ac.uk Andrea Arcuri, Certus Software V&V Center at Simula Research Laboratory, P.O. Box 134, Lysaker, Norway [email protected] Frank Padberg, Karlsruhe Institute of Technology, Karlsruhe, Germany Work on automated test generation has produced several tools capable of generating test data which achieves high structural coverage over a program. In the absence of a specification, developers are expected to manually construct or verify the test oracle for each test input. Nevertheless, it is assumed that these generated tests ease the task of testing for the developer, as testing is reduced to checking the results of tests. While this assumption has persisted for decades, there has been no conclusive evidence to date confirming it. However, the limited adoption in industry indicates this assumption may not be correct, and calls into question the practical value of test generation tools. To investigate this issue, we performed two controlled experiments comparing a total of 97 subjects split between writing tests manually and writing tests with the aid of an automated unit test generation tool, EVOSUITE. We found that, on one hand, tool support leads to clear improvements in commonly applied quality metrics such as code coverage (up to 300% increase). However, on the other hand, there was no measurable improvement in the number of bugs actually found by developers. Our results not only cast some doubt on how the research community evaluates test generation tools, but Developers spend up to 50% of their time in understanding and analyzing the output of automatic tools. Fraser et al. “Professional developers perceive generated test cases as hard to understand.” Dana et al. 25
  19. Example of Test Case Generated by Evosuite Test Case Automatically

    Generated by Evosuite (for the class apache.commons.Option.Java) } 26
  20. Example of Test Case Generated by Evosuite Test Case Automatically

    Generated by Evosuite (for the class apache.commons.Option.Java) Not Meaningful Names for Test Methods It is difficult to tell, without reading the contents of the target class,what is the behavior under test. } 27
  21. Test Case Automatically Generated by Evosuite (for the class apache.commons.Option.Java)

    Our Solution: Automatically Generate Summaries of Test Cases 29
  22. Our Solution: Automatically Generate Summaries of Test Cases Sebastiano Panichella,

    Annibale Panichella, Moritz Beller, Andy Zaidman, and Harald Gall: “The impact of test case summaries on bug fixing performance: An empirical investigation” - ICSE 2016. 30
  23. Empirical Study: Evaluating the Usefulness of the Generated Summaries Bug

    Fixing: WITH Comments WITHOUT Comments WITHOUT WITH WITH Comments WITHOUT Comments WITHOUT WITH 30 Developers: - 22 Researchers - 8 Professional Developers 31 15 15
  24. Future work… Automatic (Re-)Documenting Test Cases… Automatic Optimize Test Cases

    Readability by Minimizing (the Generated)Code Smells Automatic Assigning/Generating Meaningful names for test cases