Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unsupervised Machine Learning for Clone Detection

Unsupervised Machine Learning for Clone Detection

Duplicated source code is a phenomenon that frequently occurs in large software systems. Reasons why programmers duplicate code are manifold. The most well-known is a common bad programming practice, the copy and paste, that gives rise to software clones, or simply clones. Software clones may affect the reliability and the maintainability of software systems. For example, errors affecting a fragment of code must be fixed in everyone of its duplications.

Clones are usually not documented, and their identification is usually complicated since programmers adapt software copies by applying multiple modifications (e.g., adding/removing statements, renaming variables). Therefore, automatic approaches are required in order to reliably tackle this problem.

Machine Learning (ML) algorithms have proven to be of great practical value in a variety of application domains, providing flexible solutions able to analyse large data set with an affordable computational efficiency.
In this talk an approach that exploits structural (e.g., AST) and lexical information in the code (e.g., name of methods, variables) for the identification of clones will be presented. In particular, the proposed contribution leverages the benefits of ML algorithms, which have been properly tailored and customised in order to make them suitable for the considered task/domain.

Valerio Maggio

June 25, 2013
Tweet

More Decks by Valerio Maggio

Other Decks in Research

Transcript

  1. General Disclaimer: All the Maths appearing in the next slides

    is only intended to better introduce the considered case studies. Speakers are not responsible for any possible disease or “brain consumption” caused by too much formulas. So BEWARE; use this information at your own risk! It's intention is solely educational. We would strongly encourage you to use this information in cooperation with a medical or health professional. Awful Maths
  2. Number one in the stink parade is duplicated code. If

    you see the same code structure in more than one place, you can be sure that your program will be better if you find a way to unify them.
  3. PROBL EM S T A T E M E N

    T CLONE DETECTION Software clones are fragments of code that are similar according to some predefined measure of similarity I.D. Baxter, 1998
  4. PROBL EM S T A T E M E N

    T CLONE DETECTION
  5. PROBL EM S T A T E M E N

    T CLONE DETECTION Clones Textual Similarity
  6. PROBL EM S T A T E M E N

    T CLONE DETECTION Clones Functional Similarity
  7. PROBL EM S T A T E M E N

    T CLONE DETECTION Clones affect the reliability of the system! Sneaky Bug!
  8. THE ORIGINAL ONE # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): !

    lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines
  9. TYPE 1: Exact Copy • Identical code segments except for

    differences in layout, whitespace, and comments
  10. def do_something_cool_in_Python (filepath, marker='---end---'): ! lines = list() # This

    list is initially empty ! with open(filepath) as report: ! ! for l in report: # It goes through the lines of the file ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) ! return lines TYPE 1: Exact Copy • Identical code segments except for differences in layout, whitespace, and comments # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines
  11. TYPE 2: Parameter Substituted • Structurally identical segments except for

    differences in identifiers, literals, layout, whitespace, and comments
  12. # Type 2 Clone def do_something_cool_in_Python(path, end='---end---'): ! targets =

    list() ! with open(path) as data_file: ! ! for t in datae: ! ! ! if l.endswith(end): ! ! ! ! targets.append(t) # Stores only lines that ends with "marker" ! #Return the list of different lines ! return targets # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines TYPE 2: Parameter Substituted • Structurally identical segments except for differences in identifiers, literals, layout, whitespace, and comments
  13. TYPE 3: Structure Substituted • Similar segments with further modifications

    such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  14. import os def do_something_with(path, marker='---end---'): ! # Check if the

    input path corresponds to a file ! if not os.path.isfile(path): ! ! return None ! bad_ones = list() ! good_ones = list() ! with open(path) as report: ! ! for line in report: ! ! ! line = line.strip() ! ! ! if line.endswith(marker): ! ! ! ! good_ones.append(line) ! ! ! else: ! ! ! ! bad_ones.append(line) ! #Return the lists of different lines ! return good_ones, bad_ones TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  15. import os def do_something_with(path, marker='---end---'): ! # Check if the

    input path corresponds to a file ! if not os.path.isfile(path): ! ! return None ! bad_ones = list() ! good_ones = list() ! with open(path) as report: ! ! for line in report: ! ! ! line = line.strip() ! ! ! if line.endswith(marker): ! ! ! ! good_ones.append(line) ! ! ! else: ! ! ! ! bad_ones.append(line) ! #Return the lists of different lines ! return good_ones, bad_ones TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  16. TYPE 4: “Functional” Copies • Semantically equivalent segments that perform

    the same computation but are implemented by different syntactic variants
  17. # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list()

    ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines def do_always_the_same_stuff(filepath, marker='---end---'): ! report = open(filepath) ! file_lines = report.readlines() ! report.close() ! #Filters only the lines ending with marker ! return filter(lambda l: len(l) and l.endswith(marker), file_lines) TYPE 4: “Functional” Copies • Semantically equivalent segments that perform the same computation but are implemented by different syntactic variants
  18. SOURCE CODE INFORMATION FUNCTION parser_compare PARAMS PARAM PARAM node *left

    node *right IF-STMT IF-STMT RETURN-STMT BODY CALL-STMT parser_compare_node PARAMS STRUCT-OP right st_node left st_node BODY BODY COND COND OR == == left right 0 0 == right left RETURN- STMT RETURN-STMT 0 0
  19. SOURCE CODE INFORMATION ENTRY EXIT FORMAL-IN ACTUAL-IN ACTUAL-IN FORMAL-IN BODY

    CONTROL-POINT EXPR CONTROL-POINT CONTROL-POINT CALL-SITE RETURN ACTUAL-OUT RETURN EXPR EXPR FORMAL-OUT
  20. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective

    Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD STATE OF THE ART TOOLS
  21. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective

    Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Text Based Tools: Text is compared line by line STATE OF THE ART TOOLS
  22. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective

    Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Token Based Tools: Token sequences are compared to sequences STATE OF THE ART TOOLS
  23. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective

    Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Syntax Based Tools: Syntax subtrees are compared to each other STATE OF THE ART TOOLS
  24. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective

    Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Graph Based Tools: (sub) graphs are compared to each other STATE OF THE ART TOOLS
  25. • String/Token based Techniques: • Pros: Run very fast •

    Cons: Too many false clones STATE OF THE ART TECHNIQUES
  26. • String/Token based Techniques: • Pros: Run very fast •

    Cons: Too many false clones • Syntax based (AST) Techniques: • Pros: Well suited to detect structural similarities • Cons: Not Properly suited to detect Type 3 Clones STATE OF THE ART TECHNIQUES
  27. • String/Token based Techniques: • Pros: Run very fast •

    Cons: Too many false clones • Syntax based (AST) Techniques: • Pros: Well suited to detect structural similarities • Cons: Not Properly suited to detect Type 3 Clones • Graph based Techniques: • Pros: The only one able to deal with Type 4 Clones • Cons: Performance Issues STATE OF THE ART TECHNIQUES
  28. USE MACHINE LEARNING L U K E • Provides computational

    effective solutions to analyze large data sets
  29. USE MACHINE LEARNING L U K E • Provides computational

    effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains
  30. USE MACHINE LEARNING L U K E • Provides computational

    effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in:
  31. USE MACHINE LEARNING L U K E • Provides computational

    effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in: • the definition of the relevant information best suited for the specific task/domain
  32. USE MACHINE LEARNING L U K E • Provides computational

    effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in: • the definition of the relevant information best suited for the specific task/domain • the application of the learning algorithms to the considered data
  33. UNSUPERVISED LEARNING • Supervised Learning: • Learn from labelled samples

    • Unsupervised Learning: • Learn (directly) from the data Learn by examples
  34. UNSUPERVISED LEARNING • Supervised Learning: • Learn from labelled samples

    • Unsupervised Learning: • Learn (directly) from the data Learn by examples (+) No cost of labeling samples (-) Trade-off imposed on the quality of the data
  35. CODE STRUCTURES KERNELS FOR STRUCTURES Abstract Syntax Tree (AST) Tree

    structure representing the syntactic structure of the different instructions of a program (function) Program Dependencies Graph (PDG) (Directed) Graph structure representing the relationship among the different statement of a program Computation of the dot product between (Graph) Structures K( ) ,
  36. < x y = = x + x 1 y

    - y 1 while block while block block if > b a = = a + a 1 b - b 1 > b 0 = c 3 CODE AST KERNEL FOR CLONES
  37. < x y = = x + x 1 y

    - y 1 while block while block block if > b a = = a + a 1 b - b 1 > b 0 = c 3 CODE AST AST KERNEL KERNEL FOR CLONES < block while = = block = y - = x + + x 1 - y 1 < x y > b 0 = c 3 if block > b a - b 1 < block while + a 1 = b - = a +
  38. while block < x y KERNELS FOR CODE STRUCTURES: AST

    KERNEL FEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT
  39. while block < x y KERNELS FOR CODE STRUCTURES: AST

    KERNEL FEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN
  40. while block < x y KERNELS FOR CODE STRUCTURES: AST

    KERNEL FEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node
  41. while block < x y KERNELS FOR CODE STRUCTURES: AST

    KERNEL FEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node Lexemes (Ls) Lexical information gathered (recursively) from leaves
  42. while block < x y KERNELS FOR CODE STRUCTURES: AST

    KERNEL FEATURES IC = Conditional-Expr I = Less-operator C = Loop Ls= [x,y] IC = Loop I = while-loop C = Function-Body Ls= [x, y] Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node Lexemes (Ls) Lexical information gathered (recursively) from leaves IC = Block I = while-body C = Loop Ls= [ x ]
  43. CLONE DETECTION • Comparison with another (pure) AST-based clone detector

    • Comparison on a system with randomly seeded clones 0 0,25 0,5 0,75 1 Precision Recall F-measure CloneDigger Tree Kernel Tool RE SULTS Results refer to clones where code fragments have been modified by adding/ removing or changing code statements
  44. 0 0,25 0,50 0,75 1,00 0.6 0.62 0.64 0.66 0.68

    0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 Precision, Recall and F-Measure Precision Recall F1 Precision: How accurate are the obtained results? (Altern.) How many errors do they contain? Recall: How complete are the obtained results? (Altern.) How many clones have been retrieved w.r.t. Total Clones?
  45. CODE STRUCTURES PDG • Two Types of Nodes • Control

    Nodes (Dashed ones) • e.g., if - for - while - function calls... • Data Nodes • e.g., expressions - parameters... NODES AND EDGES while call-site arg expr
  46. CODE STRUCTURES PDG • Two Types of Nodes • Control

    Nodes (Dashed ones) • e.g., if - for - while - function calls... • Data Nodes • e.g., expressions - parameters... • Two Types of Edges (i.e., dependencies) • Control edges (Dashed ones) • Data edges NODES AND EDGES while call-site arg expr
  47. • Features of nodes: • Node Label • i.e., ,

    WHILE, CALL-SITE, EXPR, ... • Node Type • i.e., Data Node or Control Node • Features of edges: • Edge Type • i.e., Data Edge or Control Edge KERNELS FOR CODE STRUCTURES: PDG GRAPH KERNELS FOR PDG while call-site arg expr expr
  48. • Features of nodes: • Node Label • i.e., ,

    WHILE, CALL-SITE, EXPR, ... • Node Type • i.e., Data Node or Control Node • Features of edges: • Edge Type • i.e., Data Edge or Control Edge KERNELS FOR CODE STRUCTURES: PDG Node Label = WHILE Node Type = Control Node GRAPH KERNELS FOR PDG while call-site arg expr expr Control Edge Data Edge
  49. while call-site arg expr expr while call-site arg expr call-site

    GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  50. while call-site arg expr expr while call-site arg expr call-site

    GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  51. while call-site arg expr expr while call-site arg expr call-site

    GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  52. while call-site arg expr expr while call-site arg expr call-site

    GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  53. PROBL EM S T A T E M E N

    T (MODEL) CLONE DETECTION Models: models are typically represented visually, as box-and-arrow diagrams, and the clones we are searching for are similar subgraphs of these diagrams. Model Granularity: models could be represented at different levels of granularity (such as the source code) corresponding to different syntactic (and semantic) units. Models Clones are categorized in (three) different Types
  54. TYPE 1 C L O N E S (MODEL) CLONE

    DETECTION • Type 1 (exact) model clones: Identical model fragments except for variations in visual presentation, layout and formatting.
  55. TYPE 2 C L O N E S (MODEL) CLONE

    DETECTION Type 2 (renamed) model clones: Structurally identical model fragments except for variations in labels, values, types, visual presentation, layout and formatting. model@Friction Mode Logic/Break Apart Detection model@Friction Mode Logic/Lockup Detection/Required Friction for Lockup
  56. TYPE 3 C L O N E S (MODEL) CLONE

    DETECTION Type 3 (near-miss) model clones: Model fragments with further modifications, such as changes in position or connection with respect to other model fragments and small additions or removals of blocks or lines in addition to variations in labels, values, types, visual presentation, layout and formatting. [email protected]_estimation [email protected]_estimation