Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LINSEN an efficient approach to split identifiers and expand abbreviations

Valerio Maggio
September 26, 2012

LINSEN an efficient approach to split identifiers and expand abbreviations

Slide of the talk presented at IEEE ICSM 2012 (International Conference on Software Maintenance), held in Riva del Garda (TN), Italy on Sept. 2012

**Abstract**:
Information Retrieval (IR) techniques are being exploited by an increasing number of tools supporting Software Maintenance activities. Indeed the lexical information embedded in the source code can be valuable for tasks such as concept location, clustering or recovery of traceability links. The application of such IR-based techniques relies on the consistency of the lexicon available in the different artifacts, and their effectiveness can worsen if programmers introduce abbreviations (e.g: rect) and/or do not strictly follow naming conventions such as Camel Case (e.g: UTFtoASCII). In this paper we propose an approach to automatically split identifiers in their composing words, and expand abbreviations. The solution is based on a graph model and performs in linear time with respect to the size of the dictionary, taking advantage of an approximate string matching technique. The proposed technique exploits a number of different dictionaries, referring to increasingly broader contexts, in order to achieve a disam- biguation strategy based on the knowledge gathered from the most appropriate domain. The approach has been compared to other splitting and expansion techniques, using freely available oracles for the identifiers extracted from 24 C/C++ and Java open source systems. Results show an improvement in both splitting and expanding performance, in addition to a strong enhancement in the computational efficiency.

Valerio Maggio

September 26, 2012
Tweet

More Decks by Valerio Maggio

Other Decks in Research

Transcript

  1. LINSEN AN EFFICIENT APPROACH TO SPLIT IDENTIFIERS AND EXPAND ABBREVIATIONS

    Anna Corazza, Sergio Di Martino, Valerio Maggio Università di Napoli “Federico II” 26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
  2. 1. Tokenization IR FOR NATURAL LANGUAGE Draws, the, are, NullHandle,

    box, r, Rectangle, g, Graphics, box, displayBox, ...
  3. 1. Tokenization 2. Normalization IR FOR NATURAL LANGUAGE Draws, the,

    are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
  4. 1. Tokenization 2. Normalization IR FOR NATURAL LANGUAGE Draws, the,

    are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...
  5. 1. Tokenization 2. Normalization IR FOR NATURAL LANGUAGE Draws, the,

    are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ... 2.Remove StopWords
  6. 1. Tokenization 2. Normalization IR FOR NATURAL LANGUAGE Draws, the,

    are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ... 2.Remove StopWords 3.Apply Stemming 4. ...
  7. Implicit assumption: The “same” words are used whenever a particular

    concept is described 1. Tokenization 2. Normalization IR FOR NATURAL LANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ... 2.Remove StopWords 3.Apply Stemming 4. ...
  8. Implicit assumption: The “same” words are used whenever a particular

    concept is described 1. Tokenization 2. Normalization IR FOR NATURAL LANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ... 2.Remove StopWords 3.Apply Stemming 4. ...
  9. 1. Tokenization IR FOR SOURCE CODE 2. Normalization 1.5 Identifier

    Splitting • snake_case Splitter: r’(?<=\w)_’ • display_box ==> display | box • camelCase/PascalCase Splitter: r’(?<!^)([A-Z][a-z]+)’ • displayBox ==> display | Box draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...
  10. • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • drawXORRect ==> drawXOR | Rect

    • drawxorrect ==> NO SPLIT IR FOR SOURCE CODE Splitting algorithms based on naming conventions are not robust enough
  11. Splitting algorithms based on naming conventions are not robust enough

    • Heavy use of Abbreviations in the source code IR FOR SOURCE CODE
  12. Splitting algorithms based on naming conventions are not robust enough

    • Heavy use of Abbreviations in the source code • rect as for Rectangle • r as for Rectangle IR FOR SOURCE CODE
  13. Splitting algorithms based on naming conventions are not robust enough

    • Heavy use of Abbreviations in the source code • rect as for Rectangle • r as for Rectangle IR FOR SOURCE CODE
  14. There could be multiple and equally correct splitting or expansion

    solutions • r as for Rectangle OR red THE AMBIGUITY PROBLEM
  15. There could be multiple and equally correct splitting or expansion

    solutions • r as for Rectangle OR red THE AMBIGUITY PROBLEM •nsISupport ==> ns|IS|up|ports OR ==> ns|I|Supports
  16. 1. Tokenization IDENTIFIER MAPPING 2. Normalization 1.5 Identifier Mapping •

    SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011) • GenTest+Normalize (Lawrie and Binkley, 2011) • AMAP (Hill and Pollock, 2008) • ... • LINSEN draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...
  17. LINSEN A L G O R I T H M

    CONTRIBUTION • Novel technique for the Identifier Mapping
  18. LINSEN A L G O R I T H M

    CONTRIBUTION • Novel technique for the Identifier Mapping • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)
  19. LINSEN A L G O R I T H M

    CONTRIBUTION • Novel technique for the Identifier Mapping • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP) • Applied on a Graph-based model
  20. LINSEN A L G O R I T H M

    CONTRIBUTION • Novel technique for the Identifier Mapping • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP) • Applied on a Graph-based model • Able to both Split Identifiers and Expand possible occurring abbreviations
  21. GRAPH MODEL • NODES correspond to characters of the current

    identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 6 3 10
  22. GRAPH MODEL • ARCS corresponds to matchings between identifier substrings

    and dictionary words • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 6 3 10 “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”)
  23. GRAPH MODEL • ARCS corresponds to matchings between identifier substrings

    and dictionary words • Application of the String Matching Algorithm (BYP) • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 6 3 10 “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”)
  24. GRAPH MODEL • ARCS corresponds to matchings between identifier substrings

    and dictionary words • Application of the String Matching Algorithm (BYP) • Padding Arcs to ensure the Graph always connected • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  25. GRAPH MODEL • Every Arc is Labelled with the corresponding

    dictionary word • Weights represent the “cost” of each matching • Cost function [c(“word”)] favors longest words and words coming from the application-aware dictionaries Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  26. GRAPH MODEL • The final Mapping Solution corresponds to the

    sequence of labels in the path with the minimum cost (Djikstra Algorithm) Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  27. STRING MATCHING • Application of the Baeza-Yates and Perlberg (BYP)

    Algorithm • Signature: BYP(identifier, word, φ(word)) • identifier: target string • word: string to match • φ(·): Tolerance (Error) function • Bounds the length of acceptable matchings Advantage: Use the same algorithm for both the splitting and the expansion step with different input Tolerance function
  28. • BYP(identifier, word, φ Split (word)) φ Split : Exact

    Matching (i.e., No Errors allowed) BYP FOR SPLITTING .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... DFile D COMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  29. • BYP(identifier, word, φ Split (word)) φ Split : Exact

    Matching (i.e., No Errors allowed) BYP FOR SPLITTING .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 5 7 8 2 6 3 10 DFile D COMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  30. • BYP(identifier, word, φ Split (word)) φ Split : Exact

    Matching (i.e., No Errors allowed) BYP FOR SPLITTING .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “R”,C-MAX DFile D COMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  31. • BYP(identifier, word, φ Split (word)) φ Split : Exact

    Matching (i.e., No Errors allowed) BYP FOR SPLITTING ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “draw”,c(“draw”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile D COMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  32. • BYP(identifier, word, φ Split (word)) φ Split : Exact

    Matching (i.e., No Errors allowed) BYP FOR SPLITTING ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  33. • BYP(identifier, word, φ Split (word)) φ Split : Exact

    Matching (i.e., No Errors allowed) BYP FOR SPLITTING 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish Identifier: drawXORRect
  34. • BYP(identifier, word, φ Split (word)) φ Split : Exact

    Matching (i.e., No Errors allowed) BYP FOR SPLITTING 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish Identifier: drawXORRect
  35. • BYP(identifier, word, φ Split (word)) φ Split : Exact

    Matching (i.e., No Errors allowed) BYP FOR SPLITTING 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish Identifier: drawXORRect
  36. • BYP(identifier, word, φ Exp (word)) • φ Exp :

    Approximate Matching BYP FOR EXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish DFile Identifier: drawXORRect
  37. • BYP(identifier, word, φ Exp (word)) • φ Exp :

    Approximate Matching BYP FOR EXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish DFile Identifier: drawXORRect
  38. • BYP(identifier, word, φ Exp (word)) • φ Exp :

    Approximate Matching BYP FOR EXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish “red”,c(“red”) .. draw, the, are, null, handle, box, red, rectangle, ... DFile Identifier: drawXORRect
  39. • BYP(identifier, word, φ Exp (word)) • φ Exp :

    Approximate Matching BYP FOR EXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish “red”,c(“red”) “rectangle”,c(“rectangle”) .. draw, the, are, null, handle, box, red, rectangle, ... DFile Identifier: drawXORRect
  40. • BYP(identifier, word, φ Exp (word)) • φ Exp :

    Approximate Matching BYP FOR EXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish “red”,c(“red”) “rectangle”,c(“rectangle”) .. draw, the, are, null, handle, box, red, rectangle, ... DFile Identifier: drawXORRect
  41. EMPIRI CAL E V A L U A T I

    O N RESEARCH QUESTIONS • RQ1: How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?
  42. EMPIRI CAL E V A L U A T I

    O N RESEARCH QUESTIONS • RQ1: How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers? • RQ2: How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words?
  43. EMPIRI CAL E V A L U A T I

    O N RESEARCH QUESTIONS • RQ1: How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers? • RQ2: How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words? • RQ3: What is the ability of the LINSEN approach in dealing with different types of abbreviations?
  44. CASE STUDIES DTW (Madani et. al 2010) GenTest+Normalize (Lawrie and

    Binkley, 2011) RQ1 and RQ2 } RQ1 only LUDISO Dataset (2012) AMAP (Hill and Pollock, 2008) RQ3 only 15 out of 750 software systems Covering the 58% of total identifiers EMPIRI CAL E V A L U A T I O N
  45. EVALUATION METRICS Comparability of Results: Accuracy rate • Identifier Level

    evaluation: Each mapping result must be completely correct • Soft-word Level evaluation: “Partial credit” given to each word correctly mapped Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011] As for the comparison with GenTest+Normalize (Lawrie and Binkley, 2011) EMPIRI CAL E V A L U A T I O N
  46. RQ1: SPLITTING Accuracy Rates for the comparison with DTW (Madani

    et. al 2010) 0 0,25 0,5 0,75 1 JhotDraw 5.1 Lynx 2.8.5 DTW LINSEN DTW LINSEN DTW LINSEN RE SULTS R Q 1
  47. Accuracy Rates for the comparison with GenTest (Lawrie and Binkley,

    2011) 0 0,175 0,35 0,525 0,7 which 2.20 a2ps 4.14 Identifier Level 0 0,2 0,4 0,6 0,8 which 2.20 a2ps 4.14 Soft-word Level GenTest LINSEN GenTest LINSEN RE SULTS R Q 1 RQ1: SPLITTING
  48. Accuracy Rates for the comparison with DTW (Madani et. al

    2010) 0 0,25 0,5 0,75 1 JhotDraw 5.1 Lynx 2.8.5 DTW LINSEN RQ2: MAPPING RE SULTS R Q 2
  49. Accuracy Rates for the comparison with Normalize (Lawrie and Binkley,

    2011) 0 0,15 0,3 0,45 0,6 which 2.20 a2ps 4.14 Identifier Level 0 0,225 0,45 0,675 0,9 which 2.20 a2ps 4.14 Soft-word Level Normalize LINSEN Normalize LINSEN RE SULTS R Q 2 RQ2: MAPPING
  50. Accuracy Rates for the comparison with AMAP (Hill and Pollock,

    2008) 0 0,225 0,45 0,675 0,9 CW DL OO AC PR SL AMAP LINSEN RQ3: EXPANSION RE SULTS R Q 3 CW: Combination Words DL: Dropped Letters OO: Others AC: Acronyms PR: Prefix SL: Single Letters
  51. FUTURE WORKS • Evaluation of the impact of each adopted

    dictionary on the performance • Improve or change or add dictionaries • Improve the implementation of the prototype to speed up the computation • Make use of parallel computation to process each identifier in isolation
  52. THANK YOU 26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy Valerio

    Maggio Ph.D. Student, University of Naples “Federico II” [email protected]