Training data selection for cross-project defect prediction

Training data selection for cross- project defect prediction Steffen Herbold
1

Outline •  Motivation •  Training data selection •  Case study
•  Conclusion 2

3 Every so)ware has failures!

But where? 4

Defect prediction! 5

Software metrics as foundation public class GUIElementFactory implements IGUIElementFactory {
private static GUIElementFactory instance = new GUIElementFactory(); private GUIElementFactory() {} public static synchronized GUIElementFactory getInstance() { return instance; } private Properties mappingsFromConfiguration; @Override public IGUIElement instantiateGUIElement( IGUIElementSpec specification, IGUIElement parent) throws GUIModelConfigurationException { Properties mappings = getMappingsFromConfiguration(); IGUIElement guiElement = null; String[] typeHierarchy = specification.getTypeHierarchy(); int i = 0; String className = null; while ((className == null) && (i < typeHierarchy.length)) { className = mappings.getProperty(typeHierarchy[i]); i++; } if (className != null) { try { GUIElementFactory Lines of Code (LOC) 193 Weighted Methods per Class (WMC) 34 Number of Methods (NOM) 3 … 6 …

Defect prediction 7 Predictor Training Defect Prediction Target Project
Ant 1.3

Cross-project defect prediction 8 Predictor Training Defect Prediction Target
Project Ant 1.3 Available Data arc Xerces 1.4 ...

Training data as subset of available data 9 Training Data
Selection Predictor Training Defect Prediction Target Project Ant 1.3 Training Data Available Data arc Xerces 1.4 ... Based on

Set-wise selection 10 Training Data Selection Predictor Training Defect
Prediction Target Project Ant 1.3 Training Data Version 1 Version k ... Available Data arc Xerces 1.4 ... Based on

Relationship between distributional characteristics and success 11

Distributional characteristics of a project Project Characteris7cs mean(LOC)
110 stddev(LOC) 30 … mean(WMC) 15 stddev(WMC) 5 … 12 … GUIElementFactory Lines of Code (LOC) 193 Weighted Methods per Class (WMC) 34 … GUIElement Lines of Code (LOC) 75 Weighted Methods per Class (WMC) 10 … Project Data

k-Nearest Neighbor Selection 13

k-Nearest Neighbor Selection 14

EM clustering selection 15

EM clustering selection 16

Case study data 17 •  14 Java projects •  44
releases •  20 software metrics •  34% percent defect prone in total

Defect proneness density 18

Weighting to counter bias 19 ∑↑▒↓ =∑↑▒ ↓    ↓ =0.5∙#/#  ↓ =0.5∙#/# 

Predictor models •  Logistic Regression •  Naïve Bayes •  Bayesian
Networks •  SVM with RBF kernel •  C4.5 Decision Trees •  Random Forest •  Multilayer Perceptron 20

Case study workflow 21 Training Data Selection Predictor Training
Defect Prediction Target Project Ant 1.3 Training Data Version 1 Version k ... Candidate Training Data arc Xerces 1.4 ... Available Projects Ant 1.3 Ant 1.7 ... arc Xerces 1.4 ... Based on For each predictor model For each available project

Evaluation criteria 22 =/+  =/+  =(>0.7) ∧(>0.5)

Results success 23

recall and precision 24

Key findings •  Set-wise selection improves results •  Equal weighting
improves results •  Within-project performance still out of reach •  SVM performs best 25

Open Issues 26

Advertisement •  http://autoquest.informatik.uni-goettingen.de 27

Training data selection for cross-project defec...

Training data selection for cross-project defect prediction

PROMISE'13: The 9th International Conference on Predictive Models in Software Engineering

More Decks by PROMISE'13: The 9th International Conference on Predictive Models in Software Engineering

Other Decks in Research

Featured

Transcript