Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Clone Detection in Python @ EuroPython 2012

Clone Detection in Python @ EuroPython 2012

EuroPython 2012 Talk.

**Abstract**:
The clone detection is a longstanding and very active research area in the field of Software Maintenance aimed at identifying duplications in source code. The presence of clones may affect maintenance activities. For example, errors in the “original” fragment of code have to be fixed in every clone. To make things worse, code clones are usually not documented and so their location in the source code is not known. In case of small-size software systems the clone detection may be manually performed, but on large software systems it can be accomplished only by means of automatic techniques.

In this talk an approach that exploits structural (i.e., AST) and lexical information of the code (e.g., name of methods, variables) for the identification of clones will be presented. The main innovation of such approach is represented by the adoption of a Machine Learning technique based on (Tree) Kernel functions. Some insights on mathematical properties of these Kernel-based method along with its corresponding (efficient) Python implementation (Numpy, Scipy) will be presented.

Afterwards the talk will be focused on the explanation of some detection results gathered on well-known Python systems (Eric, Plone, networkx, Zope), compared with other non-Python ones (Eclipse-Jdtcore, JHotDraw). The aim of this part will be to analyze what are the Python features that could possibly avoid (or allow) duplications w.r.t. other OO languages. Some snippets for analyzing the Python code “by itself” will be also presented, emphasizing the powerful Python built-in reflection capabilities, extremely useful in this specific code analysis task.

Basic maths skill and basic knowledge of the Python language are the only suggested prerequisites for the talk.

Valerio Maggio

July 04, 2012
Tweet

More Decks by Valerio Maggio

Other Decks in Programming

Transcript

  1. Introduction Duplicated Code Number one in the stink parade is

    duplicated code. If you see the same code structure in more than one place, you can be sure that your program will be better if you find a way to unify them. 2
  2. Introduction Duplicated Code ‣ Exists: 5% to 30% of code

    is similar • In extreme cases, even up to 50% - This is the case of Payroll, a COBOL system 5
  3. Introduction Duplicated Code ‣ Exists: 5% to 30% of code

    is similar • In extreme cases, even up to 50% - This is the case of Payroll, a COBOL system ‣ Is often created during development • due to time pressure for an upcoming deadline • to overcome limitations of the programming language 5
  4. Introduction Duplicated Code ‣ Exists: 5% to 30% of code

    is similar • In extreme cases, even up to 50% - This is the case of Payroll, a COBOL system ‣ Is often created during development • due to time pressure for an upcoming deadline • to overcome limitations of the programming language ‣ Three Public Enemies: • Copy, Paste and Modify 5
  5. Part I: Clone Detection Code Clones ‣ There can be

    different definitions of similarity, based on: • Program Text (text, syntax) • Semantics 7 (Def.) “Software Clones are segments of code that are similar according to some definition of similarity” (I.D. Baxter, 1998)
  6. Part I: Clone Detection Code Clones ‣ There can be

    different definitions of similarity, based on: • Program Text (text, syntax) • Semantics ‣ Four Different Types of Clones 7 (Def.) “Software Clones are segments of code that are similar according to some definition of similarity” (I.D. Baxter, 1998)
  7. Part I: Clone Detection The original one 8 # Original

    Fragment def do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines
  8. Part I: Clone Detection Type 1: Exact Copy ‣ Identical

    code segments except for differences in layout, whitespace, and comments 9
  9. Part I: Clone Detection Type 1: Exact Copy ‣ Identical

    code segments except for differences in layout, whitespace, and comments 9 # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines def do_something_cool_in_Python (filepath, marker='---end---'): lines = list() # This list is initially empty with open(filepath) as report: for l in report: # It goes through the lines of the file if l.endswith(marker): lines.append(l) return lines
  10. Part I: Clone Detection Type 2: Parameter Substituted Clones ‣

    Structurally identical segments except for differences in identifiers, literals, layout, whitespace, and comments 10
  11. Part I: Clone Detection Type 2: Parameter Substituted Clones ‣

    Structurally identical segments except for differences in identifiers, literals, layout, whitespace, and comments 10 # Type 2 Clone def do_something_cool_in_Python(path, end='---end---'): targets = list() with open(path) as data_file: for t in data_file: if l.endswith(end): targets.append(t) # Stores only lines that ends with "marker" #Return the list of different lines return targets # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines
  12. Part I: Clone Detection Type 3: Structure Substituted Clones ‣

    Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments 11
  13. Part I: Clone Detection Type 3: Structure Substituted Clones ‣

    Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments 11 import os def do_something_with(path, marker='---end---'): # Check if the input path corresponds to a file if not os.path.isfile(path): return None bad_ones = list() good_ones = list() with open(path) as report: for line in report: line = line.strip() if line.endswith(marker): good_ones.append(line) else: bad_ones.append(line) #Return the lists of different lines return good_ones, bad_ones
  14. Part I: Clone Detection Type 3: Structure Substituted Clones ‣

    Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments 11 import os def do_something_with(path, marker='---end---'): # Check if the input path corresponds to a file if not os.path.isfile(path): return None bad_ones = list() good_ones = list() with open(path) as report: for line in report: line = line.strip() if line.endswith(marker): good_ones.append(line) else: bad_ones.append(line) #Return the lists of different lines return good_ones, bad_ones
  15. Part I: Clone Detection Type 4: “Semantic” Clones ‣ Semantically

    equivalent segments that perform the same computation but are implemented by different syntactic variants 12
  16. Part I: Clone Detection Type 4: “Semantic” Clones ‣ Semantically

    equivalent segments that perform the same computation but are implemented by different syntactic variants 12 # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): lines = list() with open(filepath) as report: for l in report: if l.endswith(marker): lines.append(l) # Stores only lines that ends with "marker" return lines #Return the list of different lines def do_always_the_same_stuff(filepath, marker='---end---'): report = open(filepath) file_lines = report.readlines() report.close() #Filters only the lines ending with marker return filter(lambda l: len(l) and l.endswith(marker), file_lines)
  17. Part I: Clone Detection What are the consequences? ‣ Do

    clones increase the maintenance effort? ‣ Hypothesis: • Cloned code increases code size • A fix to a clone must be applied to all similar fragments • Bugs are duplicated together with their clones ‣ However: it is not always possible to remove clones • Removal of Clones is harder if variations exist. 13
  18. Part I: Clone Detection 14 Duplix Scorpio PMD CCFinder Dup

    CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Clone Detection Tools
  19. Part I: Clone Detection 14 Duplix Scorpio PMD CCFinder Dup

    CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD ‣ Text Based Tools: • Lines are compared to other lines Clone Detection Tools
  20. Part I: Clone Detection 14 Duplix Scorpio PMD CCFinder Dup

    CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD ‣ Token Based Tools: • Token sequences are compared to sequences Clone Detection Tools
  21. Part I: Clone Detection 14 Duplix Scorpio PMD CCFinder Dup

    CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD ‣ Syntax Based Tools: • Syntax subtrees are compared to each other Clone Detection Tools
  22. Part I: Clone Detection 14 Duplix Scorpio PMD CCFinder Dup

    CPD Duplix Shinobi Clone Detective Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD ‣ Graph Based Tools: • (sub) graphs are compared to each other Clone Detection Tools
  23. Part I: Clone Detection Clone Detection Techniques 15 ‣ String/Token

    based Techiniques: • Pros: Run very fast • Cons: Too many false clones ‣ Syntax based (AST) Techniques: • Pros: Well suited to detect structural similarities • Cons: Not Properly suited to detect Type 3 Clones ‣ Graph based Techniques: • Pros: The only one able to deal with Type 4 Clones • Cons: Performance Issues
  24. Part I: Clone Detection The idea: Use Machine Learning, Luke

    ‣ Use Machine Learning Techniques to compute similarity of fragments by exploiting specific features of the code. ‣ Combine different sources of Information • Structural Information: ASTs, PDGs • Lexical Information: Program Text 16
  25. Part I: Clone Detection Kernel Methods for Structured Data ‣

    Well-grounded on solid and awful Math ‣ Based on the idea that objects can be described in terms of their constituent Parts ‣ Can be easily tailored to specific domains • Tree Kernels • Graph Kernels • .... 17
  26. Part I: Clone Detection Defining a Kernel for Structured Data

    The definition of a new Kernel for a Structured Object requires the definition of: 18
  27. Part I: Clone Detection Defining a Kernel for Structured Data

    The definition of a new Kernel for a Structured Object requires the definition of: ‣ Set of features to annotate each part of the object 18
  28. Part I: Clone Detection Defining a Kernel for Structured Data

    The definition of a new Kernel for a Structured Object requires the definition of: ‣ Set of features to annotate each part of the object ‣ A Kernel function to measure the similarity on the smallest part of the object • e.g., Nodes for AST and Graphs 18
  29. Part I: Clone Detection Defining a Kernel for Structured Data

    The definition of a new Kernel for a Structured Object requires the definition of: ‣ Set of features to annotate each part of the object ‣ A Kernel function to measure the similarity on the smallest part of the object • e.g., Nodes for AST and Graphs ‣ A Kernel function to apply the computation on the different (sub)parts of the structured object 18
  30. Part I: Clone Detection Kernel Methods for Clones: Tree Kernels

    Example on AST ‣ Features: We annotate each node by a set of 4 features • Instruction Class - i.e., LOOP, CONDITIONAL_STATEMENT, CALL • Instruction - i.e., FOR, IF, WHILE, RETURN • Context - i.e. Instruction Class of the closer statement node • Lexemes - Lexical information gathered (recursively) from leaves - i.e., Lexical Information 19 FOR
  31. Part I: Clone Detection Kernel Methods for Clones: Tree Kernels

    Example on AST ‣ Kernel Function: • Aims at identify the maximum isomorphic Tree/Subtree 20 K(T1, T2) = X n2T1 X n02T2 (n, n0) · Ksubt(n, n0) block print p 0.0 s = 1.0 = p s f block print y 1.0 x = x = y x f Ksubt(n, n0) = sim(n, n0) + (1 ) X (n1,n2)2Ch(n,n0) k(n1, n2)
  32. Part II: In Python The Overall Process Sketch 22 block

    print p 0.0 s = 1.0 = p s f block print y 1.0 x = x = y x f 1. Pre Processing 2. Extraction
  33. Part II: In Python The Overall Process Sketch 22 block

    print p 0.0 s = 1.0 = p s f block print y 1.0 x = x = y x f block print p 0.0 s = 1.0 = p s f block print y 1.0 x = x = y x f 1. Pre Processing 2. Extraction 3. Detection
  34. Part II: In Python The Overall Process Sketch 22 block

    print p 0.0 s = 1.0 = p s f block print y 1.0 x = x = y x f block print p 0.0 s = 1.0 = p s f block print y 1.0 x = x = y x f 1. Pre Processing 2. Extraction 3. Detection 4. Aggregation
  35. Part II: In Python Empirical Evaluation ‣ Comparison with another

    (pure) AST-based: Clone Digger • It has been the first Clone detector for and in Python :-) • Presented at EuroPython 2006 ‣ Comparison on a system with randomly seeded clones 24 ‣ Results refer only to Type 3 Clones ‣ On Type 1 and Type 2 we got the same results
  36. Part II: In Python Precision/Recall Plot 25 0 0,25 0,50

    0,75 1,00 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 Precision, Recall and F-Measure Precision Recall F1 Precision: How accurate are the obtained results? (Altern.) How many errors do they contain? Recall: How complete are the obtained results? (Altern.) How many clones have been retrieved w.r.t. Total Clones?