Building a Second Opinion: Learning Cross-Company Data

Building a Second Opinion: Learning Cross-Company Data Ekrem Kocaguneli, Bojan
Cukic, Tim Menzies, Huihua Lu 10/09/2013, PROMISE’13

Software Effort Estimation (SEE) • Software effort estimation (SEE) methods
are often supervised, i.e. they need data which is: – Described both with independent variables • A.k.a. descriptive features • E.g. metrics defining completed software projects such as analyst capability, function points, lines of code etc. – As well as dependent variables • A.k.a. labels • E.g. effort values past projects

Cost of Data • Collecting independent features (metrics) using local
data generally possible – E.g. static code analysis tools, questionnaires • Collection of effort values (labels) may be costly [1] – Experts frequently manually investigate and collect the effort associated with past projects – Sometimes label info may not even exist • The lack of (or the high cost of) dependent variables from local projects makes the development of effort models with supervised learning challenging.

Cross-company Learning • Cross-company learning has been proposed as a
practical supplement that mitigates local data drought problems – In case local (a.k.a. within) data is not present or missing, cross- company data is used [Turhan et al., Kocaguneli et al.] • Basic idea is to use data of another organization (a.k.a. cross data) for training. • And use the within data as the test set Within test project(s) Cross data (used as training data) Es ma on Method Estimate

Improving Cross-company Learning • Using another organization’s irrelevant effort data
may leads to questionable performance [Turhan et al.] – Kitchenham et al. report equal evidence for and against the use of cross-company data • Relevance filtering has shown to be effective in defect [Kocaguneli et al.] and effort estimation [Zimmerman et al.] – Premise of relevance filtering: Learn from cross-company projects that are similar to local context Within test project(s) Cross data Es ma on Method Estimate Filter Filtered cross data General idea of relevance filtering

Challenges with State-of-the-art Transfer Learners • The lack of human-in-the-loop
interpretation of software engineering data – No second opinion, – Merely provides an estimate (a single number) • E.g. it is difficult to interpret estimates coming from data of another organization • We need to incorporate contributions inherent in other supplementary learning techniques – E.g. use of semi-supervised learning, when data is only partially labeled (Lu et al.) – E.g. use of active learning for finding essential content of data (Kocaguneli et al. )

Expectations from State-of-the-art Transfer Learners • Competency: The performance should
not be worse than existing automated supervised methods • Locality: The model should offer estimates using local data (though it may be limited) • Succinctness: Interpretation of the estimate should not take too much time, i.e. the estimate should come from such an amount of data that it is manually interpretable by human experts • Low cost: Second opinion coming from a transfer learner should be easy to implement and automated

Different Learning Methods Cross data TEAK filter Filtered cross data
• Method 1: A cross-company learner called TEAK – Pro: Filters cross company data without performance loss (Competency) – Con: Resulting filtered cross data with labels is limited in size (few instances) • Semi-supervised learning can solve limited labeled data problem – Con: Estimates come from cross data, hence difficult to interpret (lacks Locality)

Different Learning Methods (cntd.) • Method 2: An active-learner called
QUICK – Pro: Unsupervised, i.e. does not need labels – Pro: Identifies a small subset of the within data, a.k.a. essential data (is succinct) – Con: Requires expert to provide labels once the essential data is found • Semi-supervised learning can use labels coming from cross data Past within data (without labels) QUICK Essential within data

Within test project(s) Cross data Es ma on Method Estimate
TEAK filter Filtered cross data Past within data (without labels) QUICK Essential within data SSL Essential within data with pseudo labels 1 2 3 4 Combining 2 Learning Methods via Semi- supervised Learning Cross data is filtered Essential rows/columns of existing within data is found SSL uses filtered cross data labels to generate pseudo-labels for the essential within data Estimate for within test data is generated from local/within train data with pseudo-labels

Calculating the percentage of essential content? Assume a data set
of N rows and F features Then the data set has N*F cells Also assume that N’ rows and F’ rows are identified to be essential 12 Then % of essential data is: (N’*F’)*100/(N*F) N = 4 and F = 4 (N = 4) * (F = 4) = 16 cells N’ = 2 and F’ = 2 (2*2)*100/(4*4) = 25%

How big is the essential within data as identified by
QUICK? The data sets we used and their size (i.e. N and F) 13

QUICK? Selected instances and features (i.e. N’ and F’) 14

QUICK? Percentage of selected data 15 On median we used 11.54% of the local data At max we used 15% of the local data

Semi supervised learning approaches • Generative – Assumes multivariate distribution
• Iterative • Density based – Data needs to be naturally divided into clusters • Graph based – Points in high density regions should become to the same class

Comparison of Performance • The proposed method uses at most
15% of the local data and the labels of cross-company data • We compared its performance to within and cross-company learners – Comparison is performed w.r.t. 7 error measures – For each error measure win, tie, loss values are calculated w.r.t Wilcoxon signed-rank test

Comparison of Performance • The data sets and 7 error
measures – The ones where the proposed method loses majority of the cases are highlighted

Analysis

Conclusion: Evaluation of the proposed method – Proposed method performs
at least as good as within and cross learners for majority of the cases (Competency) – It has Succinctness (uses at most 15% of local data) – The estimates come from local data (Locality), which is given pseudo-labels via semi-supervised learning – The proposed method is easy to implement, no need for calibration (Low Cost)

Summary • Importance of human in the loop • Feasibility
of obtaining predictions using synergistic methodologies and data • Future work: Extension to other prediction problems, performance improvement.

References [1] Ekrem Kocaguneli, Tim Menzies, Jacky Keung, David Cok,
Ray Madachy, "Active Learning and Effort Estimation: Finding the Essential Content of Software Effort Estimation Data," IEEE Transactions on Software Engineering, vol. 39, no. 8, pp. 1040-1053, Aug., 2013 [2] B. A. Kitchenham, E. Mendes, and G. H. Travassos, “Cross versus within-company cost estimation studies: A systematic review,” IEEE Trans. Softw. Eng., vol. 33, no. 5, pp. 316–329, 2007. [3] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company and within- company data for defect prediction,” Empirical Software Engineering, vol. 14, no. 5, pp. 540–578, 2009. [4] Ma, Y., Luo, G., Zeng, X., and Chen, A. (2012). Transfer learning for cross- company software defect prediction. Information and Software Technology, 54(3):248 – 256. [5] Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 –1359. [6] B. Turhan, T. Menzies, A. Bener, and J. Di Stefano, “On the relative value of cross-company and within- company data for defect prediction,” Empirical Software Engineering, vol. 14, no. 5, pp. 540–578, 2009. [7] E. Kocaguneli and T. Menzies, “How to find relevant data for effort estimation,” in ESEM’11: International Symposium on Empirical Software Engineering and Measurement, 2011. [8] Huihua Lu, Bojan Cukic, Mark Culp: Software defect prediction using semi-supervised learning with dimension reduction. ASE 2012: 314-317

Building a Second Opinion: Learning Cross-Compa...

Building a Second Opinion: Learning Cross-Company Data

PROMISE'13: The 9th International Conference on Predictive Models in Software Engineering

More Decks by PROMISE'13: The 9th International Conference on Predictive Models in Software Engineering

Other Decks in Research

Featured

Transcript

Building a Second Opinion: Learning Cross-Company Data Ekrem Kocaguneli, Bojan

Software Effort Estimation (SEE) • Software effort estimation (SEE) methods

Cost of Data • Collecting independent features (metrics) using local

Cross-company Learning • Cross-company learning has been proposed as a

Improving Cross-company Learning • Using another organization’s irrelevant effort data

Challenges with State-of-the-art Transfer Learners • The lack of human-in-the-loop

Expectations from State-of-the-art Transfer Learners • Competency: The performance should

Different Learning Methods Cross data TEAK filter Filtered cross data

Different Learning Methods (cntd.) • Method 2: An active-learner called

Within test project(s) Cross data Es ma on Method Estimate

Calculating the percentage of essential content? Assume a data set

How big is the essential within data as identified by

How big is the essential within data as identified by

How big is the essential within data as identified by

Semi supervised learning approaches • Generative – Assumes multivariate distribution

Comparison of Performance • The proposed method uses at most

Comparison of Performance • The data sets and 7 error

Analysis

Conclusion: Evaluation of the proposed method – Proposed method performs

Summary • Importance of human in the loop • Feasibility

References [1] Ekrem Kocaguneli, Tim Menzies, Jacky Keung, David Cok,