Summarization Techniques for Code, Changes, and Testing

Summarization Techniques for Code, Changes, and Testing Sebastiano Panichella Institut
für Informatik Universität Zürich [email protected] http://www.ifi.uzh.ch/seal/people/panichella.html

Outline I. Source Code Summaries II. Code Change Summarization -
Why? Prevent Maintenance Cost. - How? Using Term-based Text Retrieval (TR) techniques - Generating Commit Messages via Summarization of Source Code Changes - Automatic Generation of Release Notes III. Test Cases Summarization - Generating Human Readable Test Cases via Summarization of Source Code Techniques - Evaluation involving 30 developers

I. Source Code Summaries Why? How?

Source Code Summaries: Why? Prevent Maintenance Cost…. 4

Activities in Software Maintenance Change documentation 5% Change implementation 10%
Change planning 10% Change Testing 25% Source code comprehension 50% Source: Principles of Software Engineering and Design, Zelkovits, Shaw, Gannon 1979 Source Code Summaries: Why? Prevent Maintenance Cost…. 5

Understanding Code… Not So Happy Developers Happy Developers Absence of
Comments in the Code again !! Comments in the Code again !! SOLUTION??? 6

Source Code Summaries: How? Generating Summaries of Source Code: 7
“Automatically generated, short, yet accurate descriptions of source code entities”.

When Navigating Java Classes… 8 https://github.com/larsb/atunesplus/blob/master/aTunes/src/main/java/net/sourceforge/atunes/kernel/modules/repository/audio/AudioFile.java we look at -
Name of the Class - Attributes - Methods - Dependencies between Classes

Questions when Generating Summaries of Java Classes 9 ▪ 1)
What information to include in the summaries? ▪ 2) How much information to include in the summaries? ▪ 3) How to generate and present the summaries?

What information include in the summaries? ▪ Methods and attributes
relevant for the class ▪ Class stereotypes [Dragan et al., ICSM’10] ▪ Method stereotypes [Dragan et al., ICSM’06] ▪ Access-level heuristics ▪ Private, protected, package-protected, public 10 [ L. Moreno at al. - ASE 2012- “JStereoCode: automatically identifying method and class stereotypes in Java code”]”

Example of Important Attributes/Methods of an Entity Java Class 11
we look at - Attributes - Methods - Dependencies between Classes

An approach for Summarizing a Java Class (JSummarizer) 12 http://www.cs.wayne.edu/~severe/jsummarizer/

How to present and generate the summaries? Other Code Artefacts
can be Summarised as well: - Packages - Classes - Methods - etc.

II. Code Change Summarization

Task-Driven Summaries [ Binkley at al. - ICSM 2013 ]
1) Generating Commit Messages via Summarization of Source Code Changes 2) Automatic Generation of Release Notes To Improve Commits Quality To Improve Releases Note Quality 15

Task-Driven Summaries [ Binkley at al. - ICSM 2013 ]
1) Generating Commit Messages via Summarization of Source Code Changes 2) Automatic Generation of Release Notes To Improve Commits Quality To Improve Releases Note Quality 16

Commit Message Should Describe… The what: changes implemented during the
incremental change The why: motivation and context behind the changes 17

Commit Message Should Describe… The what: changes implemented during the
incremental change The why: motivation and context behind the changes 18 >20% of the messages were removed: - they were empty - had very short strings or lacked any semantical sense Maalej and Happel - MSR 10

Java project version i-1 Java project version i $ "
$ ! $ " $ ! 1. Changes Extractor 2. Stereotypes Detector 3. Message Generator Generating Commit Messages via Summarization of Source Code Changes 19 https://github.com/SEMERU-WM/ChangeScribe

Example: This is a degenerate modifier commit: this change set
is composed of empty, incidental, and abstract methods. These methods indicate that a new feature is planned. This change set is mainly composed of: 1. Changes to package org.springframework.social.connect.web: 1.1. Modifications to ConnectController.java: 1.1.1. Add try statement at oauth1Callback(String,NativeWebRequest) method 1.1.2. Add catch clause at oauth1Callback(String,NativeWebRequest) method 1.1.3. Add method invocation to method warn of logger object at oauth1Callback(String,NativeWebRequest) method 1.2. Modifications to ConnectControllerTest.java: 1.2.1. Modify method invocation mockMvc at oauth1Callback() method 1.2.2. Add a functionality to oauth 1 callback exception while fetching access token 2. Changes to package org.springframework.social.connect.web.test: 2.1. Add a ConnectionRepository implementation for stub connection repository. It allows to: Find all connections; Find connections; Find connections to users; Get connection; Get primary connection; Find primary connection; Add connection; Update connection; Remove connections; Remove connection [..............] 20

Impact = relative number of methods impacted by a class
in the commit Generating Commit Messages via Summarization of Source Code Changes This is a degenerate modifier commit: this change set is composed of empty, incidental, and abstract methods. These methods indicate that a new feature is planned. This change set is mainly composed of: 1. Changes to package org.springframework.social.connect.web: 1.1. Modifications to ConnectController.java: 1.1.1. Add try statement at oauth1Callback(String,NativeWebRequest) method 1.1.2. Add catch clause at oauth1Callback(String,NativeWebRequest) method 1.1.3. Add method invocation to method warn of logger object at oauth1Callback(String,NativeWebRequest) method 1.2. Modifications to ConnectControllerTest.java: 1.2.1. Modify method invocation mockMvc at oauth1Callback() method 1.2.2. Add a functionality to oauth 1 callback exception while fetching access token 2. Changes to package org.springframework.social.connect.web.test: 2.1. Add a ConnectionRepository implementation for stub connection repository. It allows to: Find all connections; Find connections; Find connections to users; This is a degenerate modifier commit: this change set is composed of empty, incidental, and abstract methods. These methods indicate that a new feature is planned. This change set is mainly composed of: 1. Changes to package org.springframework.social.connect.web: 1.1. Modifications to ConnectController.java: 1.1.1. Add try statement at oauth1Callback(String,NativeWebRequest) method 1.1.2. Add catch clause at oauth1Callback(String,NativeWebRequest) method 1.1.3. Add method invocation to method warn of logger object at oauth1Callback(String,NativeWebRequest) method 1.2. Modifications to ConnectControllerTest.java: 1.2.1. Modify method invocation mockMvc at oauth1Callback() method 1.2.2. Add a functionality to oauth 1 callback exception while fetching access token 2. Changes to package org.springframework.social.connect.web.test: 2.1. Add a ConnectionRepository implementation for stub connection repository. It allows to: Find all connections; Find connections; Find connections to users; 17% Example: impact >= 17%

Original Message This is a large modifier commit: this is
a commit with many methods and combines multiple roles. This commit includes changes to internationalization, properties or configuration files (pom.xml). This change set is mainly composed of: 1. Changes to package retrofit.converter: 1.1. Add a Converter implementation for simple XML converter. It allows to: Instantiate simple XML converter with serializer; Process simple XML converter simple XML converter from body; Convert simple XML converter to body Referenced by: SimpleXMLConverterTest class Message Automatically Generated 22

III. Test Cases Summarization

Manual Testing V.S. Automatic Testing 24

Manual Testing is still Dominant in Industry… ? Why Automatically
generated tests do not improve the ability of developers to detect faults when compared to manual testing. Fraser et al. Modeling Readability to Improve Unit Tests Ermira Daka, José Campos, and Gordon Fraser University of Sheffield Sheffield, UK Jonathan Dorn and Westley Weimer University of Virginia Virginia, USA ABSTRACT Writing good unit tests can be tedious and error prone, but even once they are written, the job is not done: Developers need to reason about unit tests throughout software development and evolution, in order to diagnose test failures, maintain the tests, and to understand code written by other developers. Unreadable tests are more difficult to maintain and lose some of their value to developers. To overcome this problem, we propose a domain-specific model of unit test readability based on human judgements, and use this model to augment automated unit test generation. The resulting approach can automatically generate test suites with both high coverage and also improved readability. In human studies users prefer our improved tests and are able to answer maintenance questions about them 14% more quickly at the same level of accuracy. Categories and Subject Descriptors. D.2.5 [Software Engineer- ing]: Testing and Debugging – Testing Tools; Keywords. Readability, unit testing, automated test generation 1. INTRODUCTION Unit testing is a popular technique in object oriented program- ming, where efficient automation frameworks such as JUnit allow unit tests to be defined and executed conveniently. However, producing good tests is a tedious and error prone task, and over their lifetime, these tests often need to be read and understood by different people. Developers use their own tests to guide their implementation activities, receive tests from automated unit test generation tools to improve their test suites, and rely on the tests written by developers of other code. Any test failures require fixing either the software or the failing test, and any passing test may be consulted by developers as documentation and usage example for the code under test. Test comprehension is a manual activity that requires one to understand the behavior represented by a test — a task that may not be easy if the test was written a week ago, difficult if it was written by a different person, and challenging if the test was generated automatically. How difficult it is to understand a unit test depends on many factors. Unit tests for object-oriented languages typically consist of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. ElementName elementName0 = new ElementName("", ""); Class<Object> class0 = Object.class; VirtualHandler virtualHandler0 = new VirtualHandler( elementName0, (Class) class0); Object object0 = new Object(); RootHandler rootHandler0 = new RootHandler((ObjectHandler ) virtualHandler0, object0); ObjectHandlerAdapter objectHandlerAdapter0 = new ObjectHandlerAdapter((ObjectHandlerInterface) rootHandler0); assertEquals("ObjectHandlerAdapter", objectHandlerAdapter0.getName()); ObjectHandlerAdapter objectHandlerAdapter0 = new ObjectHandlerAdapter((ObjectHandlerInterface) null); assertEquals("ObjectHandlerAdapter", objectHandlerAdapter0.getName()); Figure 1: Two versions of a test that exercise the same functionality but have a different appearance and readability. sequences of calls to instantiate various objects, bring them to appro- priate states, and create interactions between them. The particular choice of sequence of calls and values can have a large impact on the resulting test. For example, consider the pair of unit tests shown in Figure 1. Both tests exercise the same functionality with respect to the constructor of the class ObjectHandlerAdaptor in the Xi- neo open source project (which treats null and rootHandler0 arguments identically). Despite this identical coverage of the subject class in practice, they are quite different in presentation. In terms of concrete features that may affect comprehension, the first test is longer, uses more different classes, defines more variables, has more parentheses, has longer lines. The visual appearance of code in general is referred to as its readability — if code is not readable, intuitively it will be more difficult to perform any tasks that require understanding it. Despite significant interest from managers and developers [8], a general understanding of software readability remains elusive. For source code, Buse and Weimer [7] applied machine learning on a dataset of code snippets with human annotated ratings of readability, allowing them to predict whether code snippets are considered readable or not. Although unit tests are also just code in principle, they use a much more restricted set of language features; for example, unit tests usually do not contain conditional or looping statements. Therefore, a general code readability metric may not be well suited for unit tests. In this paper, we address this problem by designing a domain- specific model of readability based on human judgements that ap- plies to object oriented unit test cases. To support developers in deriving readable unit tests, we use this model in an automated approach to improve the readability of unit tests, and integrate this into an automated unit test generation tool. In detail, the contributions of this paper are as follows: • An analysis of the syntactic features of unit tests and their Does Automated White-Box Test Generation Really Help Software Testers? Gordon Fraser1 Matt Staats2 Phil McMinn1 Andrea Arcuri3 Frank Padberg4 1Department of 2Division of Web Science 3Simula Research 4Karlsruhe Institute of Computer Science, and Technology, Laboratory, Technology, University of Sheffield, UK KAIST, South Korea Norway Karlsruhe, Germany ABSTRACT Automated test generation techniques can efficiently produce test data that systematically cover structural aspects of a program. In the absence of a specification, a common assumption is that these tests relieve a developer of most of the work, as the act of testing is reduced to checking the results of the tests. Although this assumption has persisted for decades, there has been no conclusive evidence to date confirming it. However, the fact that the approach has only seen a limited uptake in industry suggests the contrary, and calls into question its practical usefulness. To investigate this issue, we performed a controlled experiment comparing a total of 49 subjects split between writing tests manually and writing tests with the aid of an automated unit test generation tool, EVOSUITE. We found that, on one hand, tool support leads to clear improvements in commonly applied quality metrics such as code coverage (up to 300% increase). However, on the other hand, there was no measurable improvement in the number of bugs actually found by developers. Our results not only cast some doubt on how the research community evaluates test generation tools, but also point to improvements and future work necessary before automated test generation tools will be widely adopted by practitioners. Categories and Subject Descriptors. D.2.5 [Software Engineer- ing]: Testing and Debugging – Testing Tools; General Terms. Algorithms, Experimentation, Reliability, Theory Keywords. Unit testing, automated test generation, branch coverage, empirical software engineering 1. INTRODUCTION Controlled empirical studies involving human subjects are not common in software engineering. A recent survey by Sjoberg et al. [28] showed that out of 5,453 analyzed software engineering articles, only 1.9% included a controlled study with human subjects. For software testing, several novel techniques and tools have been developed to automate and solve different kinds of problems and tasks—however, they have, in general, only been evaluated using surrogate measures (e.g., code coverage), and not with human Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies testers—leaving unanswered the more directly relevant question: Does technique X really help software testers? This paper addresses this question in the context of automated white-box test generation, a research area that has received much attention of late (e.g., [8, 12, 18, 31, 32]). When using white-box test generation, a developer need not manually write the entire test suite, and can instead automatically generate a set of test inputs that systematically exercise a program (for example, by covering all branches), and only need check that the outputs for the test inputs match those expected. Although the benefits for the developer seem obvious, there is little evidence that it is effective for practical software development. Manual testing is still dominant in industry, and research tools are commonly evaluated in terms of code coverage achieved and other automatically measurable metrics that can be applied without the involvement of actual end-users. In order to determine if automated test generation is really help- ful for software testing in a scenario without automated oracles, we performed a controlled experiment involving 49 human subjects. Subjects were given one of three Java classes containing seeded faults and were asked to construct a JUnit test suite either manually, or with the assistance of the automated white-box test generation tool EVOSUITE [8]. EVOSUITE automatically produces JUnit test suites that target branch coverage, and these unit tests contain assertions that reflect the current behaviour of the class [10]. Con- sequently, if the current behaviour is faulty, the assertions reflecting the incorrect behaviour must be corrected. The performance of the subjects was measured in terms of coverage, seeded faults found, mutation score, and erroneous tests produced. Our study yields three key results: 1. The experiment results confirm that tools for automated test generation are effective at what they are designed to do— producing test suites with high code coverage—when compared with those constructed by humans. 2. The study does not confirm that using automated tools designed for high coverage actually helps in finding faults. In our experiments, subjects using EVOSUITE found the same number of faults as manual testers, and during subsequent mutation analysis, test suites did not always have higher mutation scores. 3. Investigating how test suites evolve over the course of a testing session revealed that there is a need to re-think test generation tools: developers seem to spend most of their time analyzing what the tool produces. If the tool produces a poor initial test suite, this is clearly detrimental for testing. A Does Automated Unit Test Generation Really Help Software Testers? A Controlled Empirical Study Gordon Fraser, Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello S1 4DP, Sheffield, UK Gordon.Fraser@sheffield.ac.uk Matt Staats, SnT Centre for Security, Reliability and Trust, University of Luxembourg, 4 rue Alphonse Weicker L-2721 Luxembourg, Luxembourg, [email protected] Phil McMinn, Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello S1 4DP, Sheffield, UK p.mcminn@sheffield.ac.uk Andrea Arcuri, Certus Software V&V Center at Simula Research Laboratory, P.O. Box 134, Lysaker, Norway [email protected] Frank Padberg, Karlsruhe Institute of Technology, Karlsruhe, Germany Work on automated test generation has produced several tools capable of generating test data which achieves high structural coverage over a program. In the absence of a specification, developers are expected to manually construct or verify the test oracle for each test input. Nevertheless, it is assumed that these generated tests ease the task of testing for the developer, as testing is reduced to checking the results of tests. While this assumption has persisted for decades, there has been no conclusive evidence to date confirming it. However, the limited adoption in industry indicates this assumption may not be correct, and calls into question the practical value of test generation tools. To investigate this issue, we performed two controlled experiments comparing a total of 97 subjects split between writing tests manually and writing tests with the aid of an automated unit test generation tool, EVOSUITE. We found that, on one hand, tool support leads to clear improvements in commonly applied quality metrics such as code coverage (up to 300% increase). However, on the other hand, there was no measurable improvement in the number of bugs actually found by developers. Our results not only cast some doubt on how the research community evaluates test generation tools, but Developers spend up to 50% of their time in understanding and analyzing the output of automatic tools. Fraser et al. “Professional developers perceive generated test cases as hard to understand.” Dana et al. 25

Example of Test Case Generated by Evosuite Test Case Automatically
Generated by Evosuite (for the class apache.commons.Option.Java) } 26

Example of Test Case Generated by Evosuite Test Case Automatically
Generated by Evosuite (for the class apache.commons.Option.Java) Not Meaningful Names for Test Methods It is difﬁcult to tell, without reading the contents of the target class,what is the behavior under test. } 27

Test Case Automatically Generated by Evosuite (for the class apache.commons.Option.Java)
28 Example of Test Case Generated by Evosuite

Test Case Automatically Generated by Evosuite (for the class apache.commons.Option.Java)
Our Solution: Automatically Generate Summaries of Test Cases 29

Our Solution: Automatically Generate Summaries of Test Cases Sebastiano Panichella,
Annibale Panichella, Moritz Beller, Andy Zaidman, and Harald Gall: “The impact of test case summaries on bug ﬁxing performance: An empirical investigation” - ICSE 2016. 30

Empirical Study: Evaluating the Usefulness of the Generated Summaries Bug
Fixing: WITH Comments WITHOUT Comments WITHOUT WITH WITH Comments WITHOUT Comments WITHOUT WITH 30 Developers: - 22 Researchers - 8 Professional Developers 31 15 15

Results 30 Developers: - 22 Researchers - 8 Professional Developers
32

Future work… Automatic (Re-)Documenting Test Cases… Automatic Optimize Test Cases
Readability by Minimizing (the Generated)Code Smells Automatic Assigning/Generating Meaningful names for test cases

Summarization Techniques for Code, Changes, and...

Summarization Techniques for Code, Changes, and Testing

Sebastiano Panichella

More Decks by Sebastiano Panichella

Other Decks in Technology

Featured

Transcript