Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Second Opinion: Learning Cross-Company Data

Building a Second Opinion: Learning Cross-Company Data

by Ekrem Kocaguneli, Bojan Cukic, Tim Menzies and Huihua Lu

More Decks by PROMISE'13: The 9th International Conference on Predictive Models in Software Engineering

Other Decks in Research

Transcript

  1. Building a Second Opinion: Learning Cross-Company Data Ekrem Kocaguneli, Bojan

    Cukic, Tim Menzies, Huihua Lu 10/09/2013, PROMISE’13
  2. Software Effort Estimation (SEE) • Software effort estimation (SEE) methods

    are often supervised, i.e. they need data which is: – Described both with independent variables • A.k.a. descriptive features • E.g. metrics defining completed software projects such as analyst capability, function points, lines of code etc. – As well as dependent variables • A.k.a. labels • E.g. effort values past projects
  3. Cost of Data • Collecting independent features (metrics) using local

    data generally possible – E.g. static code analysis tools, questionnaires • Collection of effort values (labels) may be costly [1] – Experts frequently manually investigate and collect the effort associated with past projects – Sometimes label info may not even exist • The lack of (or the high cost of) dependent variables from local projects makes the development of effort models with supervised learning challenging.
  4. Cross-company Learning • Cross-company learning has been proposed as a

    practical supplement that mitigates local data drought problems – In case local (a.k.a. within) data is not present or missing, cross- company data is used [Turhan et al., Kocaguneli et al.] • Basic idea is to use data of another organization (a.k.a. cross data) for training. • And use the within data as the test set Within test project(s) Cross data (used as training data) Es ma on Method Estimate
  5. Improving Cross-company Learning • Using another organization’s irrelevant effort data

    may leads to questionable performance [Turhan et al.] – Kitchenham et al. report equal evidence for and against the use of cross-company data • Relevance filtering has shown to be effective in defect [Kocaguneli et al.] and effort estimation [Zimmerman et al.] – Premise of relevance filtering: Learn from cross-company projects that are similar to local context Within test project(s) Cross data Es ma on Method Estimate Filter Filtered cross data General idea of relevance filtering
  6. Challenges with State-of-the-art Transfer Learners • The lack of human-in-the-loop

    interpretation of software engineering data – No second opinion, – Merely provides an estimate (a single number) • E.g. it is difficult to interpret estimates coming from data of another organization • We need to incorporate contributions inherent in other supplementary learning techniques – E.g. use of semi-supervised learning, when data is only partially labeled (Lu et al.) – E.g. use of active learning for finding essential content of data (Kocaguneli et al. )
  7. Expectations from State-of-the-art Transfer Learners • Competency: The performance should

    not be worse than existing automated supervised methods • Locality: The model should offer estimates using local data (though it may be limited) • Succinctness: Interpretation of the estimate should not take too much time, i.e. the estimate should come from such an amount of data that it is manually interpretable by human experts • Low cost: Second opinion coming from a transfer learner should be easy to implement and automated
  8. Different Learning Methods Cross data TEAK filter Filtered cross data

    • Method 1: A cross-company learner called TEAK – Pro: Filters cross company data without performance loss (Competency) – Con: Resulting filtered cross data with labels is limited in size (few instances) • Semi-supervised learning can solve limited labeled data problem – Con: Estimates come from cross data, hence difficult to interpret (lacks Locality)
  9. Different Learning Methods (cntd.) • Method 2: An active-learner called

    QUICK – Pro: Unsupervised, i.e. does not need labels – Pro: Identifies a small subset of the within data, a.k.a. essential data (is succinct) – Con: Requires expert to provide labels once the essential data is found • Semi-supervised learning can use labels coming from cross data Past within data (without labels) QUICK Essential within data
  10. Within test project(s) Cross data Es ma on Method Estimate

    TEAK filter Filtered cross data Past within data (without labels) QUICK Essential within data SSL Essential within data with pseudo labels 1 2 3 4 Combining 2 Learning Methods via Semi- supervised Learning Cross data is filtered Essential rows/columns of existing within data is found SSL uses filtered cross data labels to generate pseudo-labels for the essential within data Estimate for within test data is generated from local/within train data with pseudo-labels
  11. Calculating the percentage of essential content? Assume a data set

    of N rows and F features Then the data set has N*F cells Also assume that N’ rows and F’ rows are identified to be essential 12 Then % of essential data is: (N’*F’)*100/(N*F) N = 4 and F = 4 (N = 4) * (F = 4) = 16 cells N’ = 2 and F’ = 2 (2*2)*100/(4*4) = 25%
  12. How big is the essential within data as identified by

    QUICK? The data sets we used and their size (i.e. N and F) 13
  13. How big is the essential within data as identified by

    QUICK? Selected instances and features (i.e. N’ and F’) 14
  14. How big is the essential within data as identified by

    QUICK? Percentage of selected data 15 On median we used 11.54% of the local data At max we used 15% of the local data
  15. Semi supervised learning approaches • Generative – Assumes multivariate distribution

    • Iterative • Density based – Data needs to be naturally divided into clusters • Graph based – Points in high density regions should become to the same class
  16. Comparison of Performance • The proposed method uses at most

    15% of the local data and the labels of cross-company data • We compared its performance to within and cross-company learners – Comparison is performed w.r.t. 7 error measures – For each error measure win, tie, loss values are calculated w.r.t Wilcoxon signed-rank test
  17. Comparison of Performance • The data sets and 7 error

    measures – The ones where the proposed method loses majority of the cases are highlighted
  18. Conclusion: Evaluation of the proposed method – Proposed method performs

    at least as good as within and cross learners for majority of the cases (Competency) – It has Succinctness (uses at most 15% of local data) – The estimates come from local data (Locality), which is given pseudo-labels via semi-supervised learning – The proposed method is easy to implement, no need for calibration (Low Cost)
  19. Summary • Importance of human in the loop • Feasibility

    of obtaining predictions using synergistic methodologies and data • Future work: Extension to other prediction problems, performance improvement.
  20. References [1] Ekrem Kocaguneli, Tim Menzies, Jacky Keung, David Cok,

    Ray Madachy, "Active Learning and Effort Estimation: Finding the Essential Content of Software Effort Estimation Data," IEEE Transactions on Software Engineering, vol. 39, no. 8, pp. 1040-1053, Aug., 2013 [2] B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,” IEEE Trans. Softw. Eng., vol. 33, no. 5, pp. 316–329, 2007. [3] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company and within- company data for defect prediction,” Empirical Software Engineering, vol. 14, no. 5, pp. 540–578, 2009. [4] Ma, Y., Luo, G., Zeng, X., and Chen, A. (2012). Transfer learning for cross- company software defect prediction. Information and Software Technology, 54(3):248 – 256. [5] Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 –1359. [6] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company and within- company data for defect prediction,” Empirical Software Engineering, vol. 14, no. 5, pp. 540–578, 2009. [7] E. Kocaguneli and T. Menzies, “How to find relevant data for effort estimation,” in ESEM’11: International Symposium on Empirical Software Engineering and Measurement, 2011. [8] Huihua Lu, Bojan Cukic, Mark Culp: Software defect prediction using semi-supervised learning with dimension reduction. ASE 2012: 314-317