Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improving Software Maintenance using Unsupervised Machine Learning techniques - Ph.D. Defence Dissertation

Improving Software Maintenance using Unsupervised Machine Learning techniques - Ph.D. Defence Dissertation

Ph.D. Dissertation for the Thesis entitled "Improving Software Maintenance using Unsupervised Machine Learning techniques"

Unsupervised Machine Learning techniques have been used to face different software maintenance issues such as Software Modularisation and Clone detection.

Valerio Maggio

June 05, 2013
Tweet

More Decks by Valerio Maggio

Other Decks in Research

Transcript

  1. Dottorato in Scienze Computazionali e Informatiche, XXV Ciclo Ph.D. Candidate:

    Valerio Maggio Thesis Advisors: Dr. Sergio Di Martino Dr. Anna Corazza June 5th, 2013 UNSUPERVISED MACHINE LEARNING FOR SOFTWARE MAINTENANCE
  2. THESIS G O A L UNSUPERVISED MACHINE LEARNING FOR SOFTWARE

    MAINTENANCE These solutions exploit (unsupervised) machine learning techniques to mine information from the source code Define and experimentally evaluate solutions (techniques and prototype tools) for automatic software analysis to support software maintenance activities.
  3. There exist different types of Software Maintenance. SOFTWARE MAINTENANCE “A

    software system must be continually adapted during its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution)
  4. SOFTWARE MAINTENANCE “A software system must be continually adapted during

    its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution) Software Maintenance represents the most expensive, time consuming and challenging phase of the whole development process.
  5. SOFTWARE MAINTENANCE “A software system must be continually adapted during

    its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution) Software Maintenance represents the most expensive, time consuming and challenging phase of the whole development process. Software Maintenance could account up to the 85-90% of the total software costs.
  6. ISSUES SOFTWARE MAINTENANCE “Software Maintenance is about change!” (cit. S.

    Jarzabek) Change Analysis Program Comprehension The documentation is usually scarce or not up to date!
  7. ISSUES SOFTWARE MAINTENANCE The source code (usually) represents the most

    reliable source of information about the system “Software Maintenance is about change!” (cit. S. Jarzabek) Change Analysis Program Comprehension The documentation is usually scarce or not up to date!
  8. ISSUES SOFTWARE MAINTENANCE The source code (usually) represents the most

    reliable source of information about the system “Software Maintenance is about change!” (cit. S. Jarzabek) Change Analysis Program Comprehension Reverse Engineering The documentation is usually scarce or not up to date!
  9. REVERSE E N G I N E E R I

    N G • Definition of tools and techniques to support maintenance activities • Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document • Goal: To aid the comprehension of the system
  10. REVERSE E N G I N E E R I

    N G • Definition of tools and techniques to support maintenance activities • Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document • Goal: To aid the comprehension of the system STATIC ANALYSIS DYNAMIC ANALYSIS
  11. REVERSE E N G I N E E R I

    N G • Definition of tools and techniques to support maintenance activities • Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document • Goal: To aid the comprehension of the system STATIC ANALYSIS DYNAMIC ANALYSIS
  12. MACHINE L E A R N I N G •

    Provides computational effective solutions to analyze large data sets
  13. MACHINE L E A R N I N G •

    Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains
  14. MACHINE L E A R N I N G •

    Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in: • the definition of the relevant information best suited for the specific task/domain
  15. MACHINE L E A R N I N G •

    Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in: • the definition of the relevant information best suited for the specific task/domain • the application of the learning algorithms to the considered data
  16. UNSUPERVISED LEARNING • Supervised Learning: • Learn from labelled samples

    • Unsupervised Learning: • Learn (directly) from the data Learn by examples
  17. UNSUPERVISED LEARNING • Supervised Learning: • Learn from labelled samples

    • Unsupervised Learning: • Learn (directly) from the data Learn by examples (+) No cost of labeling samples (-) Trade-off imposed on the quality of the data
  18. THESIS C O N T R I B U T

    I O N S Contributions to three relevant and related open issues in Software Maintenance &
  19. THESIS C O N T R I B U T

    I O N S Contributions to three relevant and related open issues in Software Maintenance 1. SOFTWARE RE-MODULARIZATION 3. CLONE DETECTION 2. SOURCE CODE NORMALIZATION &
  20. Software Classes UI Process Components UI Components Data Access Components

    Data Helpers / Utilities Security Operational Management Communications Business Components Application Facade Buisiness Workflows Messages Interfaces Service Interfaces Re-modularization provides a way to support software maintainers by automatically grouping together (clustering) “related” software classes SOFTWARE RE-MODULARIZATION PROBL EM S T A T E M E N T
  21. Re-modularization provides a way to support software maintainers by automatically

    grouping together (clustering) “related” software classes SOFTWARE RE-MODULARIZATION External Systems Service Consumers Services Service Interfaces Messages Interfaces Cross Cutting Security Operational Management Communications Data Data Access Components Data Helpers / Utilities Presentation UI Components UI Process Components Business Application Facade Buisiness Workflows Business Components Clusters of Software Classes PROBL EM S T A T E M E N T
  22. SOURCE CODE LEXICAL INFORMATION (IDEA): Exploit the lexical information gathered

    from the source code to produce clusters of classes that are lexically related.
  23. SOURCE CODE LEXICAL INFORMATION (IDEA): Exploit the lexical information gathered

    from the source code to produce clusters of classes that are lexically related. State of the Art: Information Retrieval (IR) based approaches.
  24. 1. Tokenization Draws, the, are, NullHandle, box, r, Rectangle, g,

    Graphics, box, displayBox, ... Implicit assumption: The “same” words are used whenever a particular concept occurs IR INDEXING
  25. 1. Tokenization 2. Normalization draw, the, are, null, handl, box,

    r, rectangl, g, graphic, box, display, box, ... Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... Implicit assumption: The “same” words are used whenever a particular concept occurs IR INDEXING
  26. SOURCE CODE ZONES LEXICAL INFORMATION Class Names Attribute Names Method

    Names Parameter Names Method Names Parameter Names
  27. SOURCE CODE ZONES LEXICAL INFORMATION Class Names Attribute Names Method

    Names Parameter Names Comments Comments Method Names Parameter Names Comments
  28. SOURCE CODE ZONES LEXICAL INFORMATION Source Code Class Names Attribute

    Names Method Names Parameter Names Comments Comments Method Names Parameter Names Source Code Comments
  29. 1. Tokenization 2. Normalization ZONE INDEXING Draws, the, are, NullHandle,

    box, r, draw, g, Graphics, color, displayBox, ... draw, the, are, null, handl, box, r, draw, g, graphic, color, display, box, ... RQ1: Do terms in different Zones provide different contributions?
  30. SOURCE CODE LEXICON JEdit • Very good in Comments •

    Very poor in Method and Parameter names
  31. SOURCE CODE LEXICON Xerces • Poor in Method, Parameter names

    and Comments • Good in Method Code and Variable names
  32. SOURCE CODE LEXICON JFreeChart JEdit JUnit Xerces RQ2: How to

    automatically weight the different contributions ?
  33. CONTRI BUTION SOFTWARE RE- MODULARIZATION • We propose a source

    code Zone Indexing Corazza A., Di Martino, S., Maggio, V., Scanniello, G. Investigating the use of lexical information for software system clustering (2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7 Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. Combining machine learning and information retrieval techniques for software clustering (2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0
  34. CONTRI BUTION SOFTWARE RE- MODULARIZATION • We propose a source

    code Zone Indexing • An approach to automatically weight the contribution of different terms in Zones • Maximum-Likelihood Estimation (MLE) Approach Corazza A., Di Martino, S., Maggio, V., Scanniello, G. Investigating the use of lexical information for software system clustering (2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7 Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. Combining machine learning and information retrieval techniques for software clustering (2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0
  35. CONTRI BUTION SOFTWARE RE- MODULARIZATION • We propose a source

    code Zone Indexing • An approach to automatically weight the contribution of different terms in Zones • Maximum-Likelihood Estimation (MLE) Approach • A variant of the K-medoid clustering algorithm • Tailored to the software clustering task Corazza A., Di Martino, S., Maggio, V., Scanniello, G. Investigating the use of lexical information for software system clustering (2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7 Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. Combining machine learning and information retrieval techniques for software clustering (2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0
  36. StackedVerticalBarRenderer LineAndShapeRenderer K-MEDOID VARIANT ArrayHolder Plot LinearPlotFitAlgorithm LinearPlotFit HorizontalNumberAxis NumberAxisPropertyEditPanel

    PlotFitAlgorithm ChartFrame GraphicsHolder PropertyEditPanel DateAxisPropertyEditPanel StandardLegend LegendItem FigureLegend VerticalNumberAxis VerticalXYDataPlot DataPlot SampleXYDataset SampleXYDatasetThread Non-extreme Distribution property VerticalPlot FitAlgorithm
  37. StackedVerticalBarRenderer LineAndShapeRenderer K-MEDOID VARIANT ArrayHolder Plot LinearPlotFitAlgorithm LinearPlotFit HorizontalNumberAxis NumberAxisPropertyEditPanel

    PlotFitAlgorithm ChartFrame GraphicsHolder PropertyEditPanel DateAxisPropertyEditPanel StandardLegend LegendItem FigureLegend VerticalNumberAxis VerticalXYDataPlot DataPlot SampleXYDataset SampleXYDatasetThread Non-extreme Distribution property VerticalPlot FitAlgorithm
  38. StackedVerticalBarRenderer LineAndShapeRenderer K-MEDOID VARIANT ArrayHolder Plot LinearPlotFitAlgorithm LinearPlotFit HorizontalNumberAxis NumberAxisPropertyEditPanel

    PlotFitAlgorithm ChartFrame GraphicsHolder PropertyEditPanel DateAxisPropertyEditPanel StandardLegend LegendItem FigureLegend VerticalNumberAxis VerticalXYDataPlot DataPlot SampleXYDataset SampleXYDatasetThread Non-extreme Distribution property VerticalPlot FitAlgorithm Extreme Extreme
  39. StackedVerticalBarRenderer LineAndShapeRenderer K-MEDOID VARIANT ArrayHolder Plot LinearPlotFitAlgorithm LinearPlotFit HorizontalNumberAxis NumberAxisPropertyEditPanel

    PlotFitAlgorithm ChartFrame GraphicsHolder PropertyEditPanel DateAxisPropertyEditPanel StandardLegend LegendItem FigureLegend VerticalNumberAxis VerticalXYDataPlot DataPlot SampleXYDataset SampleXYDatasetThread Non-extreme Distribution property VerticalPlot FitAlgorithm
  40. AUTHORITATIVENESS RE SULTS Comparison of Clustering Results with an Authoritative

    Target Partition Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces 0 0,2 0,4 0,6 0,8 1,0 19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
  41. AUTHORITATIVENESS RE SULTS Comparison of Clustering Results with an Authoritative

    Target Partition Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces 0 0,2 0,4 0,6 0,8 1,0 19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
  42. AUTHORITATIVENESS RE SULTS Comparison of Clustering Results with an Authoritative

    Target Partition Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces 0 0,2 0,4 0,6 0,8 1,0 19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
  43. AUTHORITATIVENESS RE SULTS Comparison of Clustering Results with an Authoritative

    Target Partition 19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART 0 0,2 0,4 0,6 0,8 1,0 Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
  44. 1. Tokenization 2. Normalization 1.5 Identifier Splitting • Splitting algorithms

    based on naming conventions • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • NullHandle ==> Null | Handle • displayBox ==> display | Box draw, the, are, box, r, rectangl, g, graphic, color, .... STATE OF THE ART TOOLS display, box null, handl
  45. 1. Tokenization 2. Normalization 1.5 Identifier Splitting • Splitting algorithms

    based on naming conventions • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • NullHandle ==> Null | Handle • displayBox ==> display | Box draw, the, are, box, r, rectangl, g, graphic, color, .... STATE OF THE ART TOOLS display, box null, handl
  46. 1. Tokenization 2. Normalization 1.5 Identifier Splitting • Splitting algorithms

    based on naming conventions • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • NullHandle ==> Null | Handle draw, the, are, box, r, rectangl, g, graphic, color, .... STATE OF THE ART TOOLS display, box null, handl
  47. 1. Tokenization 2. Normalization 1.5 Identifier Splitting • Splitting algorithms

    based on naming conventions • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • NullHandle ==> Null | Handle • displayBox ==> display | Box draw, the, are, box, r, rectangl, g, graphic, color, .... STATE OF THE ART TOOLS null, handl display, box
  48. • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • drawXORRect ==> drawXOR | Rect

    • drawxorrect ==> NO SPLIT Splitting algorithms based on naming conventions are not robust enough IDENTIFIERS SPLITTING
  49. Splitting algorithms based on naming conventions are not robust enough

    • Heavy use of Abbreviations in the source code • r as for Rectangle • rect as for Rectangle ABBREVIATIONS EXPANSION
  50. Splitting algorithms based on naming conventions are not robust enough

    • Heavy use of Abbreviations in the source code • r as for Rectangle • rect as for Rectangle ABBREVIATIONS EXPANSION
  51. 1. Tokenization CODE NORMALIZATION 2. Normalization 1.5 Code Normalization •

    AMAP (Hill and Pollock, 2008) • SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011) • GenTest+Normalize (Lawrie and Binkley, 2011) • LINSEN (Corazza, Di Martino and Maggio, 2012) draw, the, are, null, handl, box, rectangl, graphic, color, display, box, rectangl, draw, xor, rectangl
  52. 1. Tokenization CODE NORMALIZATION 2. Normalization 1.5 Code Normalization •

    AMAP (Hill and Pollock, 2008) • SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011) • GenTest+Normalize (Lawrie and Binkley, 2011) • LINSEN (Corazza, Di Martino and Maggio, 2012) draw, the, are, null, handl, box, rectangl, graphic, color, display, box, rectangl, draw,xor, rectangl
  53. CONTRI BUTION SOURCE CODE NORMALIZATION • LINSEN: novel technique for

    code normalization • Able to both Split Identifiers and Expand possible occurring abbreviations Corazza, A., Di Martino, S., Maggio, V. LINSEN: An efficient approach to split identifiers and expand abbreviations (2012) IEEE International Conference on Software Maintenance, ICSM, art. no. 6405277, pp. 233-242. ISBN: 978-146732312-3
  54. CONTRI BUTION SOURCE CODE NORMALIZATION • LINSEN: novel technique for

    code normalization • Able to both Split Identifiers and Expand possible occurring abbreviations • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP) • Exploit different Sources of Information to find the matching words Corazza, A., Di Martino, S., Maggio, V. LINSEN: An efficient approach to split identifiers and expand abbreviations (2012) IEEE International Conference on Software Maintenance, ICSM, art. no. 6405277, pp. 233-242. ISBN: 978-146732312-3
  55. SKETCH OF THE ALGORITHM • NODES correspond to characters of

    the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 6 3 10
  56. SKETCH OF THE ALGORITHM • ARCS corresponds to matchings between

    identifier substrings and dictionary words • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 6 3 10
  57. SKETCH OF THE ALGORITHM • ARCS corresponds to matchings between

    identifier substrings and dictionary words • Padding Arcs to ensure the Graph always connected • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  58. SKETCH OF THE ALGORITHM • Every Arc is Labelled with

    the corresponding dictionary word • Weights represent the “cost” of each matching • Cost function [c(“word”)] favors longest words and domain- related information Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  59. SKETCH OF THE ALGORITHM • The final Mapping Solution corresponds

    to the sequence of labels in the path with the minimum cost (Djikstra Algorithm) Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  60. SPLITTING Accuracy Rates for the comparison with DTW (Madani et.

    al 2010) 0 0,25 0,5 0,75 1 JhotDraw 5.1 Lynx 2.8.5 DTW LINSEN DTW LINSEN DTW LINSEN RE SULTS # 1 Accuracy Rates for the comparison with GenTest (Lawrie and Binkley, 2011) 0 0,25 0,5 0,75 1 which 2.20 a2ps 4.14 GenTest LINSEN DTW O(n3)
  61. Accuracy Rates for the comparison with AMAP (Hill and Pollock,

    2008) 0 0,25 0,5 0,75 1 CW DL OO AC PR SL TOTAL (Avg) AMAP LINSEN EXPANSION RE SULTS # 2 CW: Combination Words DL: Dropped Letters OO: Others AC: Acronyms PR: Prefix SL: Single Letters
  62. PROBL EM S T A T E M E N

    T CLONE DETECTION
  63. PROBL EM S T A T E M E N

    T CLONE DETECTION Clones Textual Similarity
  64. PROBL EM S T A T E M E N

    T CLONE DETECTION Clones Functional Similarity
  65. PROBL EM S T A T E M E N

    T CLONE DETECTION Clones affect the reliability of the system! Sneaky Bug!
  66. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective

    Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD STATE OF THE ART TOOLS
  67. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective

    Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Text Based Tools: Text is compared line by line STATE OF THE ART TOOLS
  68. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective

    Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Token Based Tools: Token sequences are compared to sequences STATE OF THE ART TOOLS
  69. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective

    Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Syntax Based Tools: Syntax subtrees are compared to each other STATE OF THE ART TOOLS
  70. Duplix Scorpio PMD CCFinder Dup CPD Duplix Shinobi Clone Detective

    Gemini iClones KClone ConQAT Deckard Clone Digger JCCD CloneDr SimScan CLICS NiCAD Simian Duploc Dude SDD Graph Based Tools: (sub) graphs are compared to each other STATE OF THE ART TOOLS
  71. CONTRI BUTION CLONE DETECTION • Combining different sources of information

    to improve the effectiveness of the detection process Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. A Tree Kernel based approach for clone detection (2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8 Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability (2013) Communications in Computer and Information Science, In Press
  72. CONTRI BUTION CLONE DETECTION • Combining different sources of information

    to improve the effectiveness of the detection process • Structural and Lexical Information Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. A Tree Kernel based approach for clone detection (2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8 Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability (2013) Communications in Computer and Information Science, In Press
  73. CONTRI BUTION CLONE DETECTION • Combining different sources of information

    to improve the effectiveness of the detection process • Structural and Lexical Information • Application of Kernel Methods to Source Code Structures Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. A Tree Kernel based approach for clone detection (2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8 Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability (2013) Communications in Computer and Information Science, In Press
  74. CONTRI BUTION CLONE DETECTION • Combining different sources of information

    to improve the effectiveness of the detection process • Structural and Lexical Information • Application of Kernel Methods to Source Code Structures • Typically applied in other field such as NLP or Bioinformatics Corazza, A., Di Martino, S., Maggio, V., Scanniello, G. A Tree Kernel based approach for clone detection (2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8 Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability (2013) Communications in Computer and Information Science, In Press
  75. CODE STRUCTURES KERNELS FOR STRUCTURES Abstract Syntax Tree (AST) Tree

    structure representing the syntactic structure of the different instructions of a program (function) Program Dependencies Graph (PDG) (Directed) Graph structure representing the relationship among the different statement of a program Computation of the dot product between (Graph) Structures K( ) ,
  76. CODE STRUCTURES KERNELS FOR STRUCTURES Abstract Syntax Tree (AST) Tree

    structure representing the syntactic structure of the different instructions of a program (function) Program Dependencies Graph (PDG) (Directed) Graph structure representing the relationship among the different statement of a program Computation of the dot product between (Graph) Structures K( ) ,
  77. < x y = = x + x 1 y

    - y 1 while block while block block if > b a = = a + a 1 b - b 1 > b 0 = c 3 CODE AST KERNEL FOR CLONES
  78. < x y = = x + x 1 y

    - y 1 while block while block block if > b a = = a + a 1 b - b 1 > b 0 = c 3 CODE AST AST KERNEL KERNEL FOR CLONES < block while = = block = y - = x + + x 1 - y 1 < x y > b 0 = c 3 if block > b a - b 1 < block while + a 1 = b - = a +
  79. while block < x y KERNELS FOR CODE STRUCTURES: AST

    KERNEL FEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT
  80. while block < x y KERNELS FOR CODE STRUCTURES: AST

    KERNEL FEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN
  81. while block < x y KERNELS FOR CODE STRUCTURES: AST

    KERNEL FEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node
  82. while block < x y KERNELS FOR CODE STRUCTURES: AST

    KERNEL FEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node Lexemes (Ls) Lexical information gathered (recursively) from leaves
  83. while block < x y KERNELS FOR CODE STRUCTURES: AST

    KERNEL FEATURES IC = Conditional-Expr I = Less-operator C = Loop Ls= [x,y] IC = Loop I = while-loop C = Function-Body Ls= [x, y] Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node Lexemes (Ls) Lexical information gathered (recursively) from leaves IC = Block I = while-body C = Loop Ls= [ x ]
  84. CLONE DETECTION • Comparison with another (pure) AST-based clone detector

    • Comparison on a system with randomly seeded clones 0 0,25 0,5 0,75 1 Precision Recall F-measure CloneDigger Tree Kernel Tool RE SULTS Results refer to clones where code fragments have been modified by adding/ removing or changing code statements
  85. CONCLUSIONS FUTURE RESEARCH DIRECTIONS • Re-modularization: • Integrate code normalization

    algorithm (LINSEN) to lexical analysis • Code Normalization: • Investigate the contributions of different sources of information used for disambiguations • Clone Detection: • Combine Static (and Kernel methods) with Dynamic Analysis
  86. THANK YOU June 5th, 2013 - Ph.D. Defense Valerio Maggio

    Ph.D. Student, University of Naples “Federico II” [email protected]