LINSEN an efficient approach to split identifiers and expand abbreviations

LINSEN AN EFFICIENT APPROACH TO SPLIT IDENTIFIERS AND EXPAND ABBREVIATIONS
Anna Corazza, Sergio Di Martino, Valerio Maggio Università di Napoli “Federico II” 26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy

MOTIVA TIONS IR FOR SE

IR FOR NATURAL LANGUAGE

1. Tokenization IR FOR NATURAL LANGUAGE

1. Tokenization IR FOR NATURAL LANGUAGE Draws, the, are, NullHandle,
box, r, Rectangle, g, Graphics, box, displayBox, ...

1. Tokenization 2. Normalization IR FOR NATURAL LANGUAGE Draws, the,
are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...

are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...

are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ... 2.Remove StopWords

are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ... 2.Remove StopWords 3.Apply Stemming 4. ...

Implicit assumption: The “same” words are used whenever a particular
concept is described 1. Tokenization 2. Normalization IR FOR NATURAL LANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ... 2.Remove StopWords 3.Apply Stemming 4. ...

1. Tokenization IR FOR SOURCE CODE 2. Normalization 1.5 Identiﬁer
Splitting

1. Tokenization IR FOR SOURCE CODE 2. Normalization 1.5 Identiﬁer
Splitting • snake_case Splitter: r’(?<=\w)_’ • display_box ==> display | box • camelCase/PascalCase Splitter: r’(?<!^)([A-Z][a-z]+)’ • displayBox ==> display | Box draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...

IR FOR SOURCE CODE

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT IR FOR SOURCE CODE

• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT IR FOR SOURCE CODE Splitting algorithms based on naming conventions are not robust enough

Splitting algorithms based on naming conventions are not robust enough
• Heavy use of Abbreviations in the source code IR FOR SOURCE CODE

Splitting algorithms based on naming conventions are not robust enough
• Heavy use of Abbreviations in the source code • rect as for Rectangle • r as for Rectangle IR FOR SOURCE CODE

There could be multiple and equally correct splitting or expansion
solutions THE AMBIGUITY PROBLEM

solutions • r as for Rectangle OR red THE AMBIGUITY PROBLEM

solutions • r as for Rectangle OR red THE AMBIGUITY PROBLEM •nsISupport ==> ns|IS|up|ports OR ==> ns|I|Supports

1. Tokenization IDENTIFIER MAPPING 2. Normalization 1.5 Identiﬁer Mapping •
SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011) • GenTest+Normalize (Lawrie and Binkley, 2011) • AMAP (Hill and Pollock, 2008) • ... • LINSEN draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...

LINSEN A L G O R I T H M
CONTRIBUTION • Novel technique for the Identifier Mapping

CONTRIBUTION • Novel technique for the Identifier Mapping • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)

CONTRIBUTION • Novel technique for the Identifier Mapping • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP) • Applied on a Graph-based model

CONTRIBUTION • Novel technique for the Identifier Mapping • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP) • Applied on a Graph-based model • Able to both Split Identifiers and Expand possible occurring abbreviations

DICTIONARIES

DICTIONARIES Application-aware Dictionaries

DICTIONARIES Application-aware Dictionaries (108,315 Entries)

DICTIONARIES Application-aware Dictionaries (22,940 Entries) (108,315 Entries)

DICTIONARIES Application-aware Dictionaries (22,940 Entries) (108,315 Entries) (588 Entries)

GRAPH MODEL Model: Weighted Directed Graph Example: drawXORRect identifier

GRAPH MODEL • NODES correspond to characters of the current
identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 6 3 10

GRAPH MODEL • ARCS corresponds to matchings between identifier substrings
and dictionary words • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 6 3 10 “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”)

and dictionary words • Application of the String Matching Algorithm (BYP) • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 6 3 10 “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”)

and dictionary words • Application of the String Matching Algorithm (BYP) • Padding Arcs to ensure the Graph always connected • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX

GRAPH MODEL • Every Arc is Labelled with the corresponding
dictionary word • Weights represent the “cost” of each matching • Cost function [c(“word”)] favors longest words and words coming from the application-aware dictionaries Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX

GRAPH MODEL • The final Mapping Solution corresponds to the
sequence of labels in the path with the minimum cost (Djikstra Algorithm) Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX

STRING MATCHING • Application of the Baeza-Yates and Perlberg (BYP)
Algorithm • Signature: BYP(identifier, word, φ(word)) • identifier: target string • word: string to match • φ(·): Tolerance (Error) function • Bounds the length of acceptable matchings Advantage: Use the same algorithm for both the splitting and the expansion step with different input Tolerance function

• BYP(identifier, word, φ Split (word)) φ Split : Exact
Matching (i.e., No Errors allowed) BYP FOR SPLITTING .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... DFile D COMPUTER-SCIENCE DEnglish Identifier: drawXORRect

Matching (i.e., No Errors allowed) BYP FOR SPLITTING .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 5 7 8 2 6 3 10 DFile D COMPUTER-SCIENCE DEnglish Identifier: drawXORRect

Matching (i.e., No Errors allowed) BYP FOR SPLITTING .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “R”,C-MAX DFile D COMPUTER-SCIENCE DEnglish Identifier: drawXORRect

Matching (i.e., No Errors allowed) BYP FOR SPLITTING ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “draw”,c(“draw”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile D COMPUTER-SCIENCE DEnglish Identifier: drawXORRect

Matching (i.e., No Errors allowed) BYP FOR SPLITTING ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE DEnglish Identifier: drawXORRect

Matching (i.e., No Errors allowed) BYP FOR SPLITTING 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish Identifier: drawXORRect

Matching (i.e., No Errors allowed) BYP FOR SPLITTING 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish Identifier: drawXORRect

• BYP(identifier, word, φ Exp (word)) • φ Exp :
Approximate Matching BYP FOR EXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish DFile Identifier: drawXORRect

Approximate Matching BYP FOR EXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish DFile Identifier: drawXORRect

Approximate Matching BYP FOR EXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish “red”,c(“red”) .. draw, the, are, null, handle, box, red, rectangle, ... DFile Identifier: drawXORRect

Approximate Matching BYP FOR EXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX “t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX ... echo, testing, threading, xpm, xor, .... D COMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish “red”,c(“red”) “rectangle”,c(“rectangle”) .. draw, the, are, null, handle, box, red, rectangle, ... DFile Identifier: drawXORRect

EMPIRI CAL E V A L U A T I
O N RESEARCH QUESTIONS • RQ1: How does LINSEN compare with state-of-the-art approaches as for the splitting of identiﬁers?

O N RESEARCH QUESTIONS • RQ1: How does LINSEN compare with state-of-the-art approaches as for the splitting of identiﬁers? • RQ2: How does LINSEN compare with state-of-the-art approaches as for the mapping of identiﬁers to dictionary words?

O N RESEARCH QUESTIONS • RQ1: How does LINSEN compare with state-of-the-art approaches as for the splitting of identiﬁers? • RQ2: How does LINSEN compare with state-of-the-art approaches as for the mapping of identiﬁers to dictionary words? • RQ3: What is the ability of the LINSEN approach in dealing with different types of abbreviations?

CASE STUDIES DTW (Madani et. al 2010) GenTest+Normalize (Lawrie and
Binkley, 2011) RQ1 and RQ2 } RQ1 only LUDISO Dataset (2012) AMAP (Hill and Pollock, 2008) RQ3 only 15 out of 750 software systems Covering the 58% of total identifiers EMPIRI CAL E V A L U A T I O N

EVALUATION METRICS Comparability of Results: Accuracy rate Qualitative Evaluation: Precision/Recall/F-1
[Guerrouj, et.al , 2011] EMPIRI CAL E V A L U A T I O N

EVALUATION METRICS Comparability of Results: Accuracy rate • Identifier Level
evaluation: Each mapping result must be completely correct • Soft-word Level evaluation: “Partial credit” given to each word correctly mapped Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011] As for the comparison with GenTest+Normalize (Lawrie and Binkley, 2011) EMPIRI CAL E V A L U A T I O N

RQ1: SPLITTING Accuracy Rates for the comparison with DTW (Madani
et. al 2010) 0 0,25 0,5 0,75 1 JhotDraw 5.1 Lynx 2.8.5 DTW LINSEN DTW LINSEN DTW LINSEN RE SULTS R Q 1

Accuracy Rates for the comparison with GenTest (Lawrie and Binkley,
2011) 0 0,175 0,35 0,525 0,7 which 2.20 a2ps 4.14 Identifier Level 0 0,2 0,4 0,6 0,8 which 2.20 a2ps 4.14 Soft-word Level GenTest LINSEN GenTest LINSEN RE SULTS R Q 1 RQ1: SPLITTING

Accuracy Rates for the comparison with DTW (Madani et. al
2010) 0 0,25 0,5 0,75 1 JhotDraw 5.1 Lynx 2.8.5 DTW LINSEN RQ2: MAPPING RE SULTS R Q 2

Accuracy Rates for the comparison with Normalize (Lawrie and Binkley,
2011) 0 0,15 0,3 0,45 0,6 which 2.20 a2ps 4.14 Identifier Level 0 0,225 0,45 0,675 0,9 which 2.20 a2ps 4.14 Soft-word Level Normalize LINSEN Normalize LINSEN RE SULTS R Q 2 RQ2: MAPPING

Accuracy Rates for the comparison with AMAP (Hill and Pollock,
2008) 0 0,225 0,45 0,675 0,9 CW DL OO AC PR SL AMAP LINSEN RQ3: EXPANSION RE SULTS R Q 3 CW: Combination Words DL: Dropped Letters OO: Others AC: Acronyms PR: Prefix SL: Single Letters

CONCLUSIONS

FUTURE WORKS • Evaluation of the impact of each adopted
dictionary on the performance • Improve or change or add dictionaries • Improve the implementation of the prototype to speed up the computation • Make use of parallel computation to process each identifier in isolation

THANK YOU 26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy Valerio
Maggio Ph.D. Student, University of Naples “Federico II” [email protected]

LINSEN an efficient approach to split identifie...

LINSEN an efficient approach to split identifiers and expand abbreviations

More Decks by Valerio Maggio

Other Decks in Research

Featured

Transcript