Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning and Value Generation in Software Development: A Survey

Exactpro
PRO
November 07, 2019

Machine Learning and Value Generation in Software Development: A Survey

Barakat Akinsanya, Luiz Araujo, Mariia Charikova, Susanna Gimaeva, Alexandr Grichshenko, Adil Khan, Manuel Mazzara, Ozioma Okonicha and Daniil Shilintsev

International Conference on Software Testing, Machine Learning and Complex Process Analysis (TMPA-2019)
7-9 November 2019, Tbilisi

Video: https://youtu.be/mpjjDqNOx8Q

TMPA Conference website https://tmpaconf.org/
TMPA Conference on Facebook https://www.facebook.com/groups/tmpaconf/

Exactpro
PRO

November 07, 2019
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. Machine Learning and value
    generation in Software
    Development: a survey
    Barakat Akinsanya, Luiz Araujo, Mariia Charikova, Susanna
    Gimaeva, Alexandr Grichshenko, Adil Khan, Manuel Mazzara,
    Ozioma Okonicha and Daniil Shilintsev, Innopolis University

    View Slide

  2. ● Inaccurate programming effort
    prediction
    ● Poor code
    ● External risks
    Software Development
    challenges

    View Slide

  3. ❏ Machine learning and its potential usage
    ❏ Predicting risks to a project
    ❏ Predicting programming effort
    ❏ Predicting software defects
    ❏ Discussion
    ❏ Conclusion and future research
    Presentation outline

    View Slide

  4. Machine learning

    View Slide

  5. Subfield of artificial intelligence
    Mathematical models identify patterns in the input data and
    reach a conclusion judging by the data
    Gaining more and more popularity
    Machine learning

    View Slide

  6. Predicting
    project risks

    View Slide

  7. ● Budget
    ● Management
    ● Schedule
    ● Technical
    (Hu et al. Software project risk management modeling with
    neural network and support vector machine approaches.
    (2007))
    Types of risks

    View Slide

  8. Regression techniques
    All of the learning algorithms used in the research have close
    prediction performances.
    (Ceylan et al. Software defect identification using machine
    learning techniques. (2006))
    Budget

    View Slide

  9. Multiple Logistic Regression
    Helps point out the risk factors.
    (Christiansen et al. Prediction of risk factors of software
    development project by using multiple logistic
    regression(2015))
    Management

    View Slide

  10. Predicting
    programming
    effort

    View Slide

  11. ● Lines of code
    ● Function points
    ● Use case points
    ● Labour hours
    ● Story points
    Metrics

    View Slide

  12. 1. Expert estimation
    2. Logical statistical models
    3. Classical ML models
    4. Deep learning models
    Techniques

    View Slide

  13. Planning Poker
    Сonsiderable human bias
    Overestimates in 40% of instances
    Very high Mean Magnitude of Relative Error (MMRE) score
    of 106.8%
    (Moharreri et al. (2016))
    Expert estimation

    View Slide

  14. ● Constructive Cost Model (COCOMO)
    ● Software Lifecycle Management (SLIM)
    ● Function Points
    Inconsistent performance due to noisy nature of datasets
    (Azzeh, M.: Software effort estimation based on optimized
    model tree. (2011))
    Logical statistical models

    View Slide

  15. 1. Case-based reasoning (more suitable if data is limited)
    2. Decision trees
    Highly interpretable, superior or at least competitive with
    parametric methods
    (Wen et al. Systematic literature review of machine learning
    based software development effort estimation
    models.(2012))
    Classical ML models

    View Slide

  16. ● Noise tolerance
    ● High parallelism
    ● Generalisation capabilities
    Outperformed Regression Tree, KNN and Regression
    Analysis.
    (Kim et al. A comparison of techniques for software
    development effort estimating. (2005))
    Deep learning models

    View Slide

  17. Predicting software
    defects

    View Slide

  18. ● Lines of code
    ● Weighted methods for class
    ● Coupling between objects
    ● Response for class
    ● Branch count
    Metrics

    View Slide

  19. Large dataset - Random Forest
    Small datasets - Naive Bayes
    (Catal et al. Investigating the effect of dataset size, metrics
    sets, and feature selection techniques on software fault
    prediction problem. (2009))
    Comparison of models w.r.t dataset size

    View Slide

  20. Method level metrics - Random Forest classifier
    Class level metric - SVM
    (Shanthini et al. Applying machine learning for fault
    prediction using software metrics. (2012))
    Comparison of models w.r.t metrics

    View Slide

  21. Comparison between Random Forest, Adaboost, Bagging,
    Multilayer Perceptron, Genetic Programming:
    Best results: Random Forest and Bagging algorithms
    (Malhotra et al. Fault prediction using statistical and
    machine learning methods for improving software quality.
    (2012))
    Results

    View Slide

  22. Best fault prediction models: Random Forest and SVM
    Class level metrics show better prediction performance
    compared to method level metrics.
    (Karim et al. Software metrics for fault prediction using
    machine learning approaches: A literature review with
    PROMISE repository dataset (2017))
    Summary

    View Slide

  23. Discussion

    View Slide

  24. Widely used models: Case-based, Neural Networks
    ML models are gaining popularity in the academic
    community.
    However, in the industry these models are not used as
    frequently as their reported performance would suggest.
    Models

    View Slide

  25. ● Lack of large software datasets
    ● Imbalance of datasets
    ● Outdated datasets
    ● Lack of a united and shared dataset
    Limitations

    View Slide

  26. Conclusion and future research

    View Slide

  27. Considerable progress of ML
    methods in the field over the last
    decades.
    Conclusion

    View Slide

  28. ● Reinforcement learning
    ● Convolutional NN
    ● Recurrent NN
    Future research and recommendations
    ● Larger more
    representative datasets
    ● Closer interaction
    between academic and
    industrial communities

    View Slide

  29. Thank you for your attention!

    View Slide

  30. Contacts
    Email: [email protected]

    View Slide