Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning and Value Generation in Software Development: A Survey

Machine Learning and Value Generation in Software Development: A Survey

Barakat Akinsanya, Luiz Araujo, Mariia Charikova, Susanna Gimaeva, Alexandr Grichshenko, Adil Khan, Manuel Mazzara, Ozioma Okonicha and Daniil Shilintsev

International Conference on Software Testing, Machine Learning and Complex Process Analysis (TMPA-2019)
7-9 November 2019, Tbilisi

Video: https://youtu.be/mpjjDqNOx8Q

TMPA Conference website https://tmpaconf.org/
TMPA Conference on Facebook https://www.facebook.com/groups/tmpaconf/



November 07, 2019


  1. Machine Learning and value generation in Software Development: a survey

    Barakat Akinsanya, Luiz Araujo, Mariia Charikova, Susanna Gimaeva, Alexandr Grichshenko, Adil Khan, Manuel Mazzara, Ozioma Okonicha and Daniil Shilintsev, Innopolis University
  2. • Inaccurate programming effort prediction • Poor code • External

    risks Software Development challenges
  3. ❏ Machine learning and its potential usage ❏ Predicting risks

    to a project ❏ Predicting programming effort ❏ Predicting software defects ❏ Discussion ❏ Conclusion and future research Presentation outline
  4. Machine learning

  5. Subfield of artificial intelligence Mathematical models identify patterns in the

    input data and reach a conclusion judging by the data Gaining more and more popularity Machine learning
  6. Predicting project risks

  7. • Budget • Management • Schedule • Technical (Hu et

    al. Software project risk management modeling with neural network and support vector machine approaches. (2007)) Types of risks
  8. Regression techniques All of the learning algorithms used in the

    research have close prediction performances. (Ceylan et al. Software defect identification using machine learning techniques. (2006)) Budget
  9. Multiple Logistic Regression Helps point out the risk factors. (Christiansen

    et al. Prediction of risk factors of software development project by using multiple logistic regression(2015)) Management
  10. Predicting programming effort

  11. • Lines of code • Function points • Use case

    points • Labour hours • Story points Metrics
  12. 1. Expert estimation 2. Logical statistical models 3. Classical ML

    models 4. Deep learning models Techniques
  13. Planning Poker Сonsiderable human bias Overestimates in 40% of instances

    Very high Mean Magnitude of Relative Error (MMRE) score of 106.8% (Moharreri et al. (2016)) Expert estimation
  14. • Constructive Cost Model (COCOMO) • Software Lifecycle Management (SLIM)

    • Function Points Inconsistent performance due to noisy nature of datasets (Azzeh, M.: Software effort estimation based on optimized model tree. (2011)) Logical statistical models
  15. 1. Case-based reasoning (more suitable if data is limited) 2.

    Decision trees Highly interpretable, superior or at least competitive with parametric methods (Wen et al. Systematic literature review of machine learning based software development effort estimation models.(2012)) Classical ML models
  16. • Noise tolerance • High parallelism • Generalisation capabilities Outperformed

    Regression Tree, KNN and Regression Analysis. (Kim et al. A comparison of techniques for software development effort estimating. (2005)) Deep learning models
  17. Predicting software defects

  18. • Lines of code • Weighted methods for class •

    Coupling between objects • Response for class • Branch count Metrics
  19. Large dataset - Random Forest Small datasets - Naive Bayes

    (Catal et al. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. (2009)) Comparison of models w.r.t dataset size
  20. Method level metrics - Random Forest classifier Class level metric

    - SVM (Shanthini et al. Applying machine learning for fault prediction using software metrics. (2012)) Comparison of models w.r.t metrics
  21. Comparison between Random Forest, Adaboost, Bagging, Multilayer Perceptron, Genetic Programming:

    Best results: Random Forest and Bagging algorithms (Malhotra et al. Fault prediction using statistical and machine learning methods for improving software quality. (2012)) Results
  22. Best fault prediction models: Random Forest and SVM Class level

    metrics show better prediction performance compared to method level metrics. (Karim et al. Software metrics for fault prediction using machine learning approaches: A literature review with PROMISE repository dataset (2017)) Summary
  23. Discussion

  24. Widely used models: Case-based, Neural Networks ML models are gaining

    popularity in the academic community. However, in the industry these models are not used as frequently as their reported performance would suggest. Models
  25. • Lack of large software datasets • Imbalance of datasets

    • Outdated datasets • Lack of a united and shared dataset Limitations
  26. Conclusion and future research

  27. Considerable progress of ML methods in the field over the

    last decades. Conclusion
  28. • Reinforcement learning • Convolutional NN • Recurrent NN Future

    research and recommendations • Larger more representative datasets • Closer interaction between academic and industrial communities
  29. Thank you for your attention!

  30. Contacts Email: s.gimaeva@innopolis.ru