Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Algorithmic Approach to Missing Data Problem...

An Algorithmic Approach to Missing Data Problem in Modeling Human Aspects in Software Development

by Gul Calikli and Ayse Bener

More Decks by PROMISE'13: The 9th International Conference on Predictive Models in Software Engineering

Other Decks in Research

Transcript

  1. Data Science Laboratory, Department of Mechanical and Industrial Engineering, Ryerson

    University {gcalikli, ayse.bener}@ryerson.ca An Algorithmic Approach to Missing Data Problem in Modeling Human Aspects in Software Development Gul Calikli and Ayse Bener
  2. Roadmap }  Background and Previous Research }  Current Research Problem

    }  Proposed Solution }  Empirical Analysis }  Dataset }  Methodology }  Results }  Threats to Validity }  Conclusion and Future Work
  3. Background: Defect Prediction Models •  Aim to guide managers to

    make decisions under uncertainty. •  Efficient Allocation of Testing Resources •  Help to complete software projects on time, within budget and with minimum errors. NASA Metrics Data Company Metrics Data Direct Usage of Metrics InfoGain/PCA Equal Weighting Metrics Weighted Metrics Decision Tree Naïve Bayes ANN Linear Discriminant Training the Model Parameters Validation New Metrics Data Prediction
  4. Background: Defect Prediction Models •  k-NN •  Naïve Bayes • 

    Bayesian Networks •  Neural Networks •  SVM •  Logistic Regression •  ………. •  ………. How can we enhance the performance of defect prediction models? •  under-sampling outperformed over-sampling. •  micro-sampling Algorithms Data Size •  Design metrics •  File Dependency Graphs •  Churn Metrics •  Static Code Metrics •  CGBR •  Organizational metrics •  # of developers •  Developer experience •  Social Interaction nws Data Content Product/ Process- Related We have focused on People’s thought processes People-related
  5. Background: Human Cognitive Aspects Human Cognitive Aspects & Thought Processes

    Software Development Quality of the Software Product affect affect }  People’s thought processes and cognitive aspects have a significant impact on software quality as software is designed, implemented and tested by people. }  In our research, we have focused on a specific human cognitive aspect, namely confirmation bias.
  6. }  Confirmation bias defined as the tendency of people to

    seek evidence to verify a hypothesis rather than seeking evidence to refute that hypothesis. }  Due to confirmation bias, developers tend to perform unit tests to make their program work rather than to break their code. }  During all levels of software testing, we must employ a testing strategy, which includes adequate attempts to fail the code to reduce software defect density. Background: Confirmation Bias
  7. Previous Research1: Construct Prediction Models }  Dataset }  Construction of

    the Prediction Model }  Algorithm: Naive Bayes }  Input data: static code and churn metrics and confirmation bias values }  Confirmation bias values are evaluated group based }  Models are constructed for each combination of these metrics }  Preprocessing: undersampling }  10x10 cross validation }  Performance measures: Dataset # of Active Files Defect Rate # of Developers ERP 3199 0.07 6 Telecom1 826 0.11 7 Telecom2 1481 0.03 4 Telecom3 63 0.05 10 1 G. Calikli and A. Bener, “Influence of Confirmation Biases of Developers on Software Quality: An Empirical Study“, (21)2:377-416, Software Quality Journal, 2013 Question Type # of Questions Abstract 8 Thematic 6 SW Thematic 8 TOTAL 22 •  Interactive Test (based on “Wason’s Rule Discovery Task”) •  Written Test (based on “Wason’s Selection Task”) Written Test Contents
  8. Previous Research1: Results }  In our previous research, the performance

    of the prediction models built using confirmation bias were as good as the performance of the models that were built with static code and churn metrics. Metric Types pd pf balance Confirmation Bias Static Code Churn - + - 0.60 0.41 0.58 + - - 0.66 0.38 0.62 - - + 0.49 0.30 0.55 + + - 0.67 0.33 0.67 - + + 0.57 0.32 0.61 + - + 0.60 0.26 0.62 + + + 0.62 0.28 0.66 Dataset: Telecom1 Metric Types pd pf balance Confirmation Bias Static Code Churn - + - 0.63 0.33 0.63 + - - 0.60 0.35 0.61 - - + 0.70 0.32 0.64 + + - 0.69 0.29 0.69 - + + 0.68 0.26 0.68 + - + 0.64 0.35 0.62 + + + 0.70 0.32 0.67 Dataset: Telecom2 Metric Types pd pf balance Confirmation Bias Static Code Churn - + - 0.91 0.08 0.88 + - - 0.93 0.15 0.85 - - + 0.90 0.11 0.86 + + - 0.93 0.21 0.81 - + + 0.83 0.04 0.85 + - + 0.94 0.11 0.88 + + + 0.94 0.10 0.89 Dataset: Telecom3 Metric Types pd pf balance Confirmation Bias Static Code Churn - + - 0.72 0.29 0.69 + - - 0.91 0.31 0.74 - - + 0.81 0.38 0.66 + + - 0.93 0.30 0.76 - + + 0.71 0.15 0.74 + - + 0.77 0.27 0.69 + + + 0.93 0.32 0.74 Dataset: ERP 1 G. Calikli and A. Bener, “Influence of Confirmation Biases of Developers on Software Quality: An Empirical Study“, (21)2:377-416, Software Quality Journal, 2013
  9. Current Research Problem }  Collecting data (e.g. confirmation bias metrics)

    through interviews/tests might be challenging: }  Tight Schedules: In many cases developers have tight schedules to rush the code for the next release. Hence they may see data collection process a waste of time. }  Evaluation Apprehension: Many people are anxious about being evaluated (threat to construct validity) }  Staff Turn-over: Some of the reused code may have en developed by developers who already left the company. }  Lack of Motivation: Developers may see the direct benefit of data collection process. All these result in “missing data problem”
  10. Methods to Handle Missing Data }  Discard incomplete data } 

    Weighting procedures (used in the case of non-response data) }  Imputation based procedures (e.g. hot-deck imputation, mean imputation, regression imputation) }  Model-based Procedures (e.g. Expectation-Maximization algorithms) •  Suitable for small amount of missing data •  Bias is introduced in the imputed data
  11. Proposed Solution }  Use Expectation Maximization (EM) Algorithm to impute

    missing data. }  Why EM Algorithm? }  Proved to be very powerful leading to high accuracy results. }  Conceptually and computationally simple }  EM Algorithm in general handles the missing data problem as follows: }  (1) Replace missing values by estimated values }  (2) Estimate parameters }  (3) Estimate missing values assuming that new parameter estimates are correct }  (4) Re-estimate parameters and so forth, iterating until it converges.
  12. Proposed Solution }  Use Roweis’ EM Algorithm for PCA2 Expectation

    Step (E-Step): Maximization Step (M-Step): X = wTY Y = CX C = (wT)-1 X = C-1Y =C-1IY = C-1(CT)-1CTY = (CTC)-1CTY X = (CTC)-1CTY C = YXT(X XT)-1 2 Roweis S., “EM Algorithms for PCA and SPCA”, in Advances for Neural Information Processing Systems, pp. 626-632, MIT Press, 1998
  13. Empirical Analysis: Dataset }  Dataset ERP belongs to a large

    scale company specialized in the development of Enterprise Resource Planning (ERP) software. }  Datasets Telecom1, Telecom2 and Telecom3 belong to Turkey’s largest wireless telecom operator (GSM) company. }  Files are Java, JSP files. Dataset Telecom3 also contains PL/ SQL files. Dataset # of Active Files Defect Rate # of Developers ERP 3199 0.07 6 Telecom1 826 0.11 7 Telecom2 1481 0.03 4 Telecom3 63 0.05 10
  14. Empirical Analysis: Methodology }  Form 2N-2 different missing data configurations

    }  (N: Total number of developers) }  Use EM to handle missing data }  Build defect prediction models using imputed data }  Compare obtained performance results with the performance of prediction models built using complete data. Missing data percentage Total number of entries with missing confirmation bias metric values Total number of complete entries
  15. Empirical Analysis: Results Dataset: ERP }  Decrease in pd values:

    Imputed form of the dataset ERP yields lower pd values compared to the pd values obtained for dataset’s complete form }  Decrease in pf values: }  Reasons are due to uneven distribution of workload leading to mental exhaustion of top developers }  Our results support that using imputation techniques can be useful in order to remove noise in the data. Dataset: Telecom1 Missing % pd pf balance 0% 0.91 0.31 0.74 ( 0-10]% 0.77 0.29 0.68 (10-20]% 0.74 0.29 0.67 (20-30]% 0.72 0.29 0.66 (30-40]% 0.70 0.31 0.65 (40-50]% 0.67 0.31 0.64 (50-60]% 0.62 0.16 0.68 (60-70]% 0.60 0.20 0.63 (70-80]% 0.64 0.28 0.58 (80-90]% 0.76 0.46 0.52 Missing % pd pf balance 0% 0.66 0.38 0.62 ( 0-10]% 0.65 0.33 0.64 (10-20]% 0.62 0.32 0.63 (20-30]% 0.60 0.31 0.63 (30-40]% 0.58 0.31 0.61 (40-50]% 0.56 0.31 0.58 (50-60]% 0.50 0.28 0.54 (60-70]% 0.46 0.26 0.51 (70-80]% 0.40 0.22 0.48 (80-90]% -- -- -- Lower pf values for missing data percentage > 50%
  16. Empirical Analysis: Results Dataset: Telecom2 Dataset: Telecom3 }  Decrease in

    pd values and increase in pf values: }  In-line with our expectations as information content degrades in the presence of missing data }  Unlike in projects ERP and Telecom1, in projects Telecom2 and Telecom3 there is an even distribution among workloads of developers. }  Project teams release software every 6 months. Hence developers had enough time to take the test and they were highly motivated. Missing % pd pf balance 0% 0.60 0.35 0.61 ( 0-10]% -- -- -- (10-20]% -- -- -- (20-30]% 0.69 0.47 0.57 (30-40]% -- -- -- (40-50]% 0.60 0.40 0.57 (50-60]% 0.66 0.52 0.52 (60-70]% 0.65 0.50 0.51 (70-80]% -- -- -- (80-90]% 0.53 0.42 0.43 Missing % pd pf balance 0% 0.93 0.15 0.85 ( 0-10]% 0.94 0.21 0.78 (10-20]% 0.93 0.22 0.77 (20-30]% 0.92 0.27 0.72 (30-40]% 0.91 0.32 0.67 (40-50]% 0.88 0.25 0.72 (50-60]% 0.93 0.27 0.72 (60-70]% 0.89 0.28 0.70 (70-80]% 0.89 0.34 0.64 (80-90]% 0.89 0.38 0.60
  17. Threats to Validity }  Construct Validity: To avoid threats to

    construct validity, we used three popular performance measures in software defect prediction research: pd, pf and balance. }  Internal Validity: To avoid threats to internal validity10 X 10 cross validation was performed. }  External Validity: To avoid external threats to validity }  We used 4 datasets from 2 different software companies (1 form an ISV specialized in ERP and 3 from a GSM operator/ telecommunication company) }  Our datasets cover two different software development domains (ERP domain and telecommunication domain) }  3 datasets from the GSM operator were collected from 2 different project groups (Dataset Telecom1 come from the project group that developed software for launching GSM tariffs, Telecom2 and Telecom3 come from the billing and charging system.)
  18. Conclusion and Future Work }  We proposed an imputation algorithm

    to fill missing confirmation bias metrics values. }  Our empirical results showed that by using the imputation algorithm with only confirmation bias metrics we can achieve as good prediction results as the other metrics sets (static code attributes, churn and history metrics). }  Our future direction will be to include: }  other cognitive bias types, and }  to improve the proposed different imputation techniques.