Machine Learning and Value Generation in Software Development: A Survey

Machine Learning and value generation in Software Development: a survey
Barakat Akinsanya, Luiz Araujo, Mariia Charikova, Susanna Gimaeva, Alexandr Grichshenko, Adil Khan, Manuel Mazzara, Ozioma Okonicha and Daniil Shilintsev, Innopolis University

• Inaccurate programming effort prediction • Poor code • External
risks Software Development challenges

❏ Machine learning and its potential usage ❏ Predicting risks
to a project ❏ Predicting programming effort ❏ Predicting software defects ❏ Discussion ❏ Conclusion and future research Presentation outline

Machine learning

Subﬁeld of artiﬁcial intelligence Mathematical models identify patterns in the
input data and reach a conclusion judging by the data Gaining more and more popularity Machine learning

Predicting project risks

• Budget • Management • Schedule • Technical (Hu et
al. Software project risk management modeling with neural network and support vector machine approaches. (2007)) Types of risks

Regression techniques All of the learning algorithms used in the
research have close prediction performances. (Ceylan et al. Software defect identiﬁcation using machine learning techniques. (2006)) Budget

Multiple Logistic Regression Helps point out the risk factors. (Christiansen
et al. Prediction of risk factors of software development project by using multiple logistic regression(2015)) Management

Predicting programming effort

• Lines of code • Function points • Use case
points • Labour hours • Story points Metrics

1. Expert estimation 2. Logical statistical models 3. Classical ML
models 4. Deep learning models Techniques

Planning Poker Сonsiderable human bias Overestimates in 40% of instances
Very high Mean Magnitude of Relative Error (MMRE) score of 106.8% (Moharreri et al. (2016)) Expert estimation

• Constructive Cost Model (COCOMO) • Software Lifecycle Management (SLIM)
• Function Points Inconsistent performance due to noisy nature of datasets (Azzeh, M.: Software effort estimation based on optimized model tree. (2011)) Logical statistical models

1. Case-based reasoning (more suitable if data is limited) 2.
Decision trees Highly interpretable, superior or at least competitive with parametric methods (Wen et al. Systematic literature review of machine learning based software development effort estimation models.(2012)) Classical ML models

• Noise tolerance • High parallelism • Generalisation capabilities Outperformed
Regression Tree, KNN and Regression Analysis. (Kim et al. A comparison of techniques for software development effort estimating. (2005)) Deep learning models

Predicting software defects

• Lines of code • Weighted methods for class •
Coupling between objects • Response for class • Branch count Metrics

Large dataset - Random Forest Small datasets - Naive Bayes
(Catal et al. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. (2009)) Comparison of models w.r.t dataset size

Method level metrics - Random Forest classiﬁer Class level metric
- SVM (Shanthini et al. Applying machine learning for fault prediction using software metrics. (2012)) Comparison of models w.r.t metrics

Comparison between Random Forest, Adaboost, Bagging, Multilayer Perceptron, Genetic Programming:
Best results: Random Forest and Bagging algorithms (Malhotra et al. Fault prediction using statistical and machine learning methods for improving software quality. (2012)) Results

Best fault prediction models: Random Forest and SVM Class level
metrics show better prediction performance compared to method level metrics. (Karim et al. Software metrics for fault prediction using machine learning approaches: A literature review with PROMISE repository dataset (2017)) Summary

Discussion

Widely used models: Case-based, Neural Networks ML models are gaining
popularity in the academic community. However, in the industry these models are not used as frequently as their reported performance would suggest. Models

• Lack of large software datasets • Imbalance of datasets
• Outdated datasets • Lack of a united and shared dataset Limitations

Conclusion and future research

Considerable progress of ML methods in the ﬁeld over the
last decades. Conclusion

• Reinforcement learning • Convolutional NN • Recurrent NN Future
research and recommendations • Larger more representative datasets • Closer interaction between academic and industrial communities

Thank you for your attention!

Contacts Email: [email protected]

Machine Learning and Value Generation in Softwa...

Machine Learning and Value Generation in Software Development: A Survey

Exactpro
PRO

More Decks by Exactpro

Other Decks in Technology

Featured

Transcript