Slide 1

Slide 1 text

Machine Learning and value generation in Software Development: a survey Barakat Akinsanya, Luiz Araujo, Mariia Charikova, Susanna Gimaeva, Alexandr Grichshenko, Adil Khan, Manuel Mazzara, Ozioma Okonicha and Daniil Shilintsev, Innopolis University

Slide 2

Slide 2 text

● Inaccurate programming effort prediction ● Poor code ● External risks Software Development challenges

Slide 3

Slide 3 text

❏ Machine learning and its potential usage ❏ Predicting risks to a project ❏ Predicting programming effort ❏ Predicting software defects ❏ Discussion ❏ Conclusion and future research Presentation outline

Slide 4

Slide 4 text

Machine learning

Slide 5

Slide 5 text

Subfield of artificial intelligence Mathematical models identify patterns in the input data and reach a conclusion judging by the data Gaining more and more popularity Machine learning

Slide 6

Slide 6 text

Predicting project risks

Slide 7

Slide 7 text

● Budget ● Management ● Schedule ● Technical (Hu et al. Software project risk management modeling with neural network and support vector machine approaches. (2007)) Types of risks

Slide 8

Slide 8 text

Regression techniques All of the learning algorithms used in the research have close prediction performances. (Ceylan et al. Software defect identification using machine learning techniques. (2006)) Budget

Slide 9

Slide 9 text

Multiple Logistic Regression Helps point out the risk factors. (Christiansen et al. Prediction of risk factors of software development project by using multiple logistic regression(2015)) Management

Slide 10

Slide 10 text

Predicting programming effort

Slide 11

Slide 11 text

● Lines of code ● Function points ● Use case points ● Labour hours ● Story points Metrics

Slide 12

Slide 12 text

1. Expert estimation 2. Logical statistical models 3. Classical ML models 4. Deep learning models Techniques

Slide 13

Slide 13 text

Planning Poker Сonsiderable human bias Overestimates in 40% of instances Very high Mean Magnitude of Relative Error (MMRE) score of 106.8% (Moharreri et al. (2016)) Expert estimation

Slide 14

Slide 14 text

● Constructive Cost Model (COCOMO) ● Software Lifecycle Management (SLIM) ● Function Points Inconsistent performance due to noisy nature of datasets (Azzeh, M.: Software effort estimation based on optimized model tree. (2011)) Logical statistical models

Slide 15

Slide 15 text

1. Case-based reasoning (more suitable if data is limited) 2. Decision trees Highly interpretable, superior or at least competitive with parametric methods (Wen et al. Systematic literature review of machine learning based software development effort estimation models.(2012)) Classical ML models

Slide 16

Slide 16 text

● Noise tolerance ● High parallelism ● Generalisation capabilities Outperformed Regression Tree, KNN and Regression Analysis. (Kim et al. A comparison of techniques for software development effort estimating. (2005)) Deep learning models

Slide 17

Slide 17 text

Predicting software defects

Slide 18

Slide 18 text

● Lines of code ● Weighted methods for class ● Coupling between objects ● Response for class ● Branch count Metrics

Slide 19

Slide 19 text

Large dataset - Random Forest Small datasets - Naive Bayes (Catal et al. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. (2009)) Comparison of models w.r.t dataset size

Slide 20

Slide 20 text

Method level metrics - Random Forest classifier Class level metric - SVM (Shanthini et al. Applying machine learning for fault prediction using software metrics. (2012)) Comparison of models w.r.t metrics

Slide 21

Slide 21 text

Comparison between Random Forest, Adaboost, Bagging, Multilayer Perceptron, Genetic Programming: Best results: Random Forest and Bagging algorithms (Malhotra et al. Fault prediction using statistical and machine learning methods for improving software quality. (2012)) Results

Slide 22

Slide 22 text

Best fault prediction models: Random Forest and SVM Class level metrics show better prediction performance compared to method level metrics. (Karim et al. Software metrics for fault prediction using machine learning approaches: A literature review with PROMISE repository dataset (2017)) Summary

Slide 23

Slide 23 text

Discussion

Slide 24

Slide 24 text

Widely used models: Case-based, Neural Networks ML models are gaining popularity in the academic community. However, in the industry these models are not used as frequently as their reported performance would suggest. Models

Slide 25

Slide 25 text

● Lack of large software datasets ● Imbalance of datasets ● Outdated datasets ● Lack of a united and shared dataset Limitations

Slide 26

Slide 26 text

Conclusion and future research

Slide 27

Slide 27 text

Considerable progress of ML methods in the field over the last decades. Conclusion

Slide 28

Slide 28 text

● Reinforcement learning ● Convolutional NN ● Recurrent NN Future research and recommendations ● Larger more representative datasets ● Closer interaction between academic and industrial communities

Slide 29

Slide 29 text

Thank you for your attention!

Slide 30

Slide 30 text

Contacts Email: [email protected]