An Algorithmic Approach to Missing Data Problem in Modeling Human Aspects in Software Development

Data Science Laboratory, Department of Mechanical and Industrial Engineering, Ryerson
University {gcalikli, ayse.bener}@ryerson.ca An Algorithmic Approach to Missing Data Problem in Modeling Human Aspects in Software Development Gul Calikli and Ayse Bener

Roadmap }  Background and Previous Research }  Current Research Problem
}  Proposed Solution }  Empirical Analysis }  Dataset }  Methodology }  Results }  Threats to Validity }  Conclusion and Future Work

Background: Defect Prediction Models •  Aim to guide managers to
make decisions under uncertainty. •  Efficient Allocation of Testing Resources •  Help to complete software projects on time, within budget and with minimum errors. NASA Metrics Data Company Metrics Data Direct Usage of Metrics InfoGain/PCA Equal Weighting Metrics Weighted Metrics Decision Tree Naïve Bayes ANN Linear Discriminant Training the Model Parameters Validation New Metrics Data Prediction

Background: Defect Prediction Models •  k-NN •  Naïve Bayes • 
Bayesian Networks •  Neural Networks •  SVM •  Logistic Regression •  ………. •  ………. How can we enhance the performance of defect prediction models? •  under-sampling outperformed over-sampling. •  micro-sampling Algorithms Data Size •  Design metrics •  File Dependency Graphs •  Churn Metrics •  Static Code Metrics •  CGBR •  Organizational metrics •  # of developers •  Developer experience •  Social Interaction nws Data Content Product/ Process- Related We have focused on People’s thought processes People-related

Background: Human Cognitive Aspects Human Cognitive Aspects & Thought Processes
Software Development Quality of the Software Product affect affect }  People’s thought processes and cognitive aspects have a significant impact on software quality as software is designed, implemented and tested by people. }  In our research, we have focused on a specific human cognitive aspect, namely confirmation bias.

}  Confirmation bias defined as the tendency of people to
seek evidence to verify a hypothesis rather than seeking evidence to refute that hypothesis. }  Due to confirmation bias, developers tend to perform unit tests to make their program work rather than to break their code. }  During all levels of software testing, we must employ a testing strategy, which includes adequate attempts to fail the code to reduce software defect density. Background: Confirmation Bias

Previous Research1: Construct Prediction Models }  Dataset }  Construction of
the Prediction Model }  Algorithm: Naive Bayes }  Input data: static code and churn metrics and confirmation bias values }  Confirmation bias values are evaluated group based }  Models are constructed for each combination of these metrics }  Preprocessing: undersampling }  10x10 cross validation }  Performance measures: Dataset # of Active Files Defect Rate # of Developers ERP 3199 0.07 6 Telecom1 826 0.11 7 Telecom2 1481 0.03 4 Telecom3 63 0.05 10 1 G. Calikli and A. Bener, “Influence of Confirmation Biases of Developers on Software Quality: An Empirical Study“, (21)2:377-416, Software Quality Journal, 2013 Question Type # of Questions Abstract 8 Thematic 6 SW Thematic 8 TOTAL 22 •  Interactive Test (based on “Wason’s Rule Discovery Task”) •  Written Test (based on “Wason’s Selection Task”) Written Test Contents

Previous Research1: Results }  In our previous research, the performance
of the prediction models built using confirmation bias were as good as the performance of the models that were built with static code and churn metrics. Metric Types pd pf balance Confirmation Bias Static Code Churn - + - 0.60 0.41 0.58 + - - 0.66 0.38 0.62 - - + 0.49 0.30 0.55 + + - 0.67 0.33 0.67 - + + 0.57 0.32 0.61 + - + 0.60 0.26 0.62 + + + 0.62 0.28 0.66 Dataset: Telecom1 Metric Types pd pf balance Confirmation Bias Static Code Churn - + - 0.63 0.33 0.63 + - - 0.60 0.35 0.61 - - + 0.70 0.32 0.64 + + - 0.69 0.29 0.69 - + + 0.68 0.26 0.68 + - + 0.64 0.35 0.62 + + + 0.70 0.32 0.67 Dataset: Telecom2 Metric Types pd pf balance Confirmation Bias Static Code Churn - + - 0.91 0.08 0.88 + - - 0.93 0.15 0.85 - - + 0.90 0.11 0.86 + + - 0.93 0.21 0.81 - + + 0.83 0.04 0.85 + - + 0.94 0.11 0.88 + + + 0.94 0.10 0.89 Dataset: Telecom3 Metric Types pd pf balance Confirmation Bias Static Code Churn - + - 0.72 0.29 0.69 + - - 0.91 0.31 0.74 - - + 0.81 0.38 0.66 + + - 0.93 0.30 0.76 - + + 0.71 0.15 0.74 + - + 0.77 0.27 0.69 + + + 0.93 0.32 0.74 Dataset: ERP 1 G. Calikli and A. Bener, “Influence of Confirmation Biases of Developers on Software Quality: An Empirical Study“, (21)2:377-416, Software Quality Journal, 2013

Current Research Problem }  Collecting data (e.g. confirmation bias metrics)
through interviews/tests might be challenging: }  Tight Schedules: In many cases developers have tight schedules to rush the code for the next release. Hence they may see data collection process a waste of time. }  Evaluation Apprehension: Many people are anxious about being evaluated (threat to construct validity) }  Staff Turn-over: Some of the reused code may have en developed by developers who already left the company. }  Lack of Motivation: Developers may see the direct benefit of data collection process. All these result in “missing data problem”

Methods to Handle Missing Data }  Discard incomplete data } 
Weighting procedures (used in the case of non-response data) }  Imputation based procedures (e.g. hot-deck imputation, mean imputation, regression imputation) }  Model-based Procedures (e.g. Expectation-Maximization algorithms) •  Suitable for small amount of missing data •  Bias is introduced in the imputed data

Proposed Solution }  Use Expectation Maximization (EM) Algorithm to impute
missing data. }  Why EM Algorithm? }  Proved to be very powerful leading to high accuracy results. }  Conceptually and computationally simple }  EM Algorithm in general handles the missing data problem as follows: }  (1) Replace missing values by estimated values }  (2) Estimate parameters }  (3) Estimate missing values assuming that new parameter estimates are correct }  (4) Re-estimate parameters and so forth, iterating until it converges.

Proposed Solution }  Use Roweis’ EM Algorithm for PCA2 Expectation
Step (E-Step): Maximization Step (M-Step): X = wTY Y = CX C = (wT)-1 X = C-1Y =C-1IY = C-1(CT)-1CTY = (CTC)-1CTY X = (CTC)-1CTY C = YXT(X XT)-1 2 Roweis S., “EM Algorithms for PCA and SPCA”, in Advances for Neural Information Processing Systems, pp. 626-632, MIT Press, 1998

Empirical Analysis: Dataset }  Dataset ERP belongs to a large
scale company specialized in the development of Enterprise Resource Planning (ERP) software. }  Datasets Telecom1, Telecom2 and Telecom3 belong to Turkey’s largest wireless telecom operator (GSM) company. }  Files are Java, JSP files. Dataset Telecom3 also contains PL/ SQL files. Dataset # of Active Files Defect Rate # of Developers ERP 3199 0.07 6 Telecom1 826 0.11 7 Telecom2 1481 0.03 4 Telecom3 63 0.05 10

Empirical Analysis: Methodology }  Form 2N-2 different missing data configurations
}  (N: Total number of developers) }  Use EM to handle missing data }  Build defect prediction models using imputed data }  Compare obtained performance results with the performance of prediction models built using complete data. Missing data percentage Total number of entries with missing confirmation bias metric values Total number of complete entries

Empirical Analysis: Results Dataset: ERP }  Decrease in pd values:
Imputed form of the dataset ERP yields lower pd values compared to the pd values obtained for dataset’s complete form }  Decrease in pf values: }  Reasons are due to uneven distribution of workload leading to mental exhaustion of top developers }  Our results support that using imputation techniques can be useful in order to remove noise in the data. Dataset: Telecom1 Missing % pd pf balance 0% 0.91 0.31 0.74 ( 0-10]% 0.77 0.29 0.68 (10-20]% 0.74 0.29 0.67 (20-30]% 0.72 0.29 0.66 (30-40]% 0.70 0.31 0.65 (40-50]% 0.67 0.31 0.64 (50-60]% 0.62 0.16 0.68 (60-70]% 0.60 0.20 0.63 (70-80]% 0.64 0.28 0.58 (80-90]% 0.76 0.46 0.52 Missing % pd pf balance 0% 0.66 0.38 0.62 ( 0-10]% 0.65 0.33 0.64 (10-20]% 0.62 0.32 0.63 (20-30]% 0.60 0.31 0.63 (30-40]% 0.58 0.31 0.61 (40-50]% 0.56 0.31 0.58 (50-60]% 0.50 0.28 0.54 (60-70]% 0.46 0.26 0.51 (70-80]% 0.40 0.22 0.48 (80-90]% -- -- -- Lower pf values for missing data percentage > 50%

Empirical Analysis: Results Dataset: Telecom2 Dataset: Telecom3 }  Decrease in
pd values and increase in pf values: }  In-line with our expectations as information content degrades in the presence of missing data }  Unlike in projects ERP and Telecom1, in projects Telecom2 and Telecom3 there is an even distribution among workloads of developers. }  Project teams release software every 6 months. Hence developers had enough time to take the test and they were highly motivated. Missing % pd pf balance 0% 0.60 0.35 0.61 ( 0-10]% -- -- -- (10-20]% -- -- -- (20-30]% 0.69 0.47 0.57 (30-40]% -- -- -- (40-50]% 0.60 0.40 0.57 (50-60]% 0.66 0.52 0.52 (60-70]% 0.65 0.50 0.51 (70-80]% -- -- -- (80-90]% 0.53 0.42 0.43 Missing % pd pf balance 0% 0.93 0.15 0.85 ( 0-10]% 0.94 0.21 0.78 (10-20]% 0.93 0.22 0.77 (20-30]% 0.92 0.27 0.72 (30-40]% 0.91 0.32 0.67 (40-50]% 0.88 0.25 0.72 (50-60]% 0.93 0.27 0.72 (60-70]% 0.89 0.28 0.70 (70-80]% 0.89 0.34 0.64 (80-90]% 0.89 0.38 0.60

Threats to Validity }  Construct Validity: To avoid threats to
construct validity, we used three popular performance measures in software defect prediction research: pd, pf and balance. }  Internal Validity: To avoid threats to internal validity10 X 10 cross validation was performed. }  External Validity: To avoid external threats to validity }  We used 4 datasets from 2 different software companies (1 form an ISV specialized in ERP and 3 from a GSM operator/ telecommunication company) }  Our datasets cover two different software development domains (ERP domain and telecommunication domain) }  3 datasets from the GSM operator were collected from 2 different project groups (Dataset Telecom1 come from the project group that developed software for launching GSM tariffs, Telecom2 and Telecom3 come from the billing and charging system.)

Conclusion and Future Work }  We proposed an imputation algorithm
to fill missing confirmation bias metrics values. }  Our empirical results showed that by using the imputation algorithm with only confirmation bias metrics we can achieve as good prediction results as the other metrics sets (static code attributes, churn and history metrics). }  Our future direction will be to include: }  other cognitive bias types, and }  to improve the proposed different imputation techniques.

THANK YOU ANY QUESTIONS? Gül Çalıklı: [email protected] and Ayşe Bener:
[email protected]

An Algorithmic Approach to Missing Data Problem...

An Algorithmic Approach to Missing Data Problem in Modeling Human Aspects in Software Development

PROMISE'13: The 9th International Conference on Predictive Models in Software Engineering

More Decks by PROMISE'13: The 9th International Conference on Predictive Models in Software Engineering

Other Decks in Research

Featured

Transcript

Data Science Laboratory, Department of Mechanical and Industrial Engineering, Ryerson

Roadmap }  Background and Previous Research }  Current Research Problem

Background: Defect Prediction Models •  Aim to guide managers to

Background: Defect Prediction Models •  k-NN •  Naïve Bayes •

Background: Human Cognitive Aspects Human Cognitive Aspects & Thought Processes

}  Confirmation bias defined as the tendency of people to

Previous Research1: Construct Prediction Models }  Dataset }  Construction of

Previous Research1: Results }  In our previous research, the performance

Current Research Problem }  Collecting data (e.g. confirmation bias metrics)

Methods to Handle Missing Data }  Discard incomplete data }

Proposed Solution }  Use Expectation Maximization (EM) Algorithm to impute

Proposed Solution }  Use Roweis’ EM Algorithm for PCA2 Expectation

Empirical Analysis: Dataset }  Dataset ERP belongs to a large

Empirical Analysis: Methodology }  Form 2N-2 different missing data configurations

Empirical Analysis: Results Dataset: ERP }  Decrease in pd values:

Empirical Analysis: Results Dataset: Telecom2 Dataset: Telecom3 }  Decrease in

Threats to Validity }  Construct Validity: To avoid threats to

Conclusion and Future Work }  We proposed an imputation algorithm

THANK YOU ANY QUESTIONS? Gül Çalıklı: [email protected] and Ayşe Bener: