Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Training data selection for cross-project defect prediction

Training data selection for cross-project defect prediction

by Steffen Herbold

More Decks by PROMISE'13: The 9th International Conference on Predictive Models in Software Engineering

Other Decks in Research

Transcript

  1. Software metrics as foundation public class GUIElementFactory implements IGUIElementFactory {

    private static GUIElementFactory instance = new GUIElementFactory(); private GUIElementFactory() {} public static synchronized GUIElementFactory getInstance() { return instance; } private Properties mappingsFromConfiguration; @Override public IGUIElement instantiateGUIElement( IGUIElementSpec specification, IGUIElement parent) throws GUIModelConfigurationException { Properties mappings = getMappingsFromConfiguration(); IGUIElement guiElement = null; String[] typeHierarchy = specification.getTypeHierarchy(); int i = 0; String className = null; while ((className == null) && (i < typeHierarchy.length)) { className = mappings.getProperty(typeHierarchy[i]); i++; } if (className != null) { try { GUIElementFactory Lines  of  Code  (LOC)   193   Weighted  Methods  per   Class  (WMC)   34   Number  of  Methods   (NOM)   3   …   6 …  
  2. Cross-project defect prediction 8 Predictor Training Defect   Prediction Target

     Project Ant  1.3 Available  Data arc Xerces  1.4 ...
  3. Training data as subset of available data 9 Training  Data

      Selection Predictor Training Defect   Prediction Target  Project Ant  1.3 Training  Data Available  Data arc Xerces  1.4 ... Based  on
  4. Set-wise selection 10 Training  Data   Selection Predictor Training Defect

      Prediction Target  Project Ant  1.3 Training  Data Version  1 Version  k ... Available  Data arc Xerces  1.4 ... Based  on
  5. Distributional characteristics of a project Project  Characteris7cs   mean(LOC)  

    110   stddev(LOC)   30   …   mean(WMC)   15   stddev(WMC)   5   …   12 …   GUIElementFactory Lines  of  Code  (LOC)   193   Weighted  Methods  per   Class  (WMC)   34   …   GUIElement Lines  of  Code  (LOC)   75   Weighted  Methods  per   Class  (WMC)   10   …   Project  Data  
  6. Case study data 17 •  14 Java projects •  44

    releases •  20 software metrics •  34% percent defect prone in total
  7. Predictor models •  Logistic Regression •  Naïve Bayes •  Bayesian

    Networks •  SVM with RBF kernel •  C4.5 Decision Trees •  Random Forest •  Multilayer Perceptron 20
  8. Case study workflow 21 Training  Data   Selection Predictor Training

    Defect   Prediction Target  Project Ant  1.3 Training  Data Version  1 Version  k ... Candidate  Training   Data arc Xerces  1.4 ... Available  Projects Ant  1.3 Ant  1.7 ... arc Xerces  1.4 ... Based  on For  each  predictor  model   For  each  available  project  
  9. Key findings •  Set-wise selection improves results •  Equal weighting

    improves results •  Within-project performance still out of reach •  SVM performs best 25