Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning + Software Engineering

Machine Learning + Software Engineering

Machinelearner

January 28, 2021
Tweet

More Decks by Machinelearner

Other Decks in Education

Transcript

  1. What is Software Engineering? • Field of knowledge, scientific and

    engineering discipline ◦ accumulation of experience ◦ discovery of best practices • NATO Software Engineering Conferences, 1968-1969 • “Software crisis” ◦ budget overruns ◦ torn deadlines ◦ ineffective software ◦ poor quality software ◦ requirements are not clear and are not satisfied ◦ uncontrollable projects ◦ maintenance hell ◦ ... 2 http://homepages.cs.ncl.ac.uk/brian.randell/NATO/nato1968.PDF http://homepages.cs.ncl.ac.uk/brian.randell/NATO/nato1969.PDF
  2. Software Engineering Book of Knowledge 1. Software Requirements 2. Software

    Design 3. Software Construction 4. Software Testing 5. Software Maintenance 6. Software Configuration Management 7. Software Engineering Management 8. Software Engineering Process 9. Software Engineering Models and Methods 10. Software Quality 11. Software Engineering Professional Practice 12. Software Engineering Economics 13. Computing, Mathematical and Engineering Foundations https://www.computer.org/education/bodies-of-knowledge/software-engineering
  3. Software engineering entities • Type ◦ processes (design, coding, testing,

    ...) ◦ products (documents, code, deliverables, ...) ◦ resources (personnel, tools, hardware, ...) • Attributes ◦ internal ◦ external 5
  4. Applications of ML in SE • Predicting or estimating measurements

    of software entities • Discovering properties of software entities • Enhancing software entities • Transforming products • Synthesizing/generating products • Reusing products • Enhancing processes 7
  5. What this talk is about • ML in software development

    processes & tools ◦ 11 examples of application areas ◦ brief history & recent works ◦ problems and challenges • Tool support for ML 8
  6. Where does the SE data come from? • Formal data

    ◦ source code ◦ configuration files ◦ binary code ◦ logs and execution traces • Text ◦ documentation ◦ specifications, design documents ◦ communication tools (email, QA forums, ...) • Metadata ◦ version control systems ◦ bug/issue/task trackers, planning systems, ... 9
  7. How to get this data • Existing datasets for specific

    tasks • SE data repositories (Qualitas Corpus, PROMISE, ...) • Code databases (PGA, GHTorrent, GH Archive, ...) • Software data in the wild (GitHub, Gerrit, Jira, ... ) ◦ lack of appropriate mining tools ◦ proper data mining is hard ▪ especially applied to VCS 11
  8. Building vector representations • Text ◦ various NLP techniques •

    Source code embeddings ◦ explicit features (software metrics, simple NLP features, path-based representations, ...) ◦ implicit features (N-grams, AST encodings, feature hashing, autoencoders, GNNs, ...) ◦ on different levels ▪ tokens, methods, API calls, system events, execution traces, code changes, ... • IR, binary code ◦ paths in CFG, NLP features, bitmaps, ... 12
  9. code2vec: AST path-contexts 13 Alon et al. code2vec: Learning distributed

    representations of code (POPL’19) (elements, Name↑FieldAccess↑Foreach↓Block↓IfStmt↓Block↓Return↓BooleanExpr, true)
  10. code2vec: the neural architecture 14 Alon et al. code2vec: Learning

    distributed representations of code (POPL’19)
  11. code2vec: suggesting method names 15 Alon et al. code2vec: Learning

    distributed representations of code (POPL’19)
  12. 1. Estimating size, cost and effort • Various targets: project,

    development, maintenance, correction, … • One of the first research tasks to solve in SE ◦ a survey with 250+ methods published in 2007 • Type of models ◦ expert judgement ◦ parametric models (Use Cases, Function Points, COCOMO I/II, …) ◦ non-parametric models (estimation-via-analogy) • Various learners ◦ linear regression, Bayesian networks, GAs, NNs, DTs, HMMs, association rules, … • Challenges ◦ factors that affect effort and productivity are not understood well ◦ lack of decent historical and production data ◦ the need to adjust models to local environment 16
  13. 02. Software quality prediction • Defect estimation, bug prediction, reliability

    prediction ◦ 200+ papers overviewed in a 2012 study • Input data ◦ process, software and developer metrics, change data, historical data (prev. bug reports) • Various learners ◦ genetic programming, neural networks, decision trees, Bayesian belief networks, ... ◦ unsupervised approaches ◦ effort-aware defect prediction • Popular datasets ◦ 12 NASA datasets ◦ a selection of open-source projects (PROMISE, Eclipse dataset, ...) 17
  14. Deep learning-based bug detectors • Learning representations with a word2vec-like

    neural architecture • Generated artificial dataset • Models to detect swapped arguments, incorrect binary operator, and incorrect binary operand issues • Plugins for WebStorm (JavaScript) and PyCharm (Python) ◦ https://plugins.jetbrains.com/plugin/12220-deepbugsjavascript ◦ https://plugins.jetbrains.com/plugin/12218-deepbugspython Michael Pradel and Koushik Sen. DeepBugs: A Learning Approach to Name-based Bug Detection (OOPSLA'18)
  15. 3. Automatic software repair • Finding a solution to software

    bugs without human intervention ◦ errors, faults, and failures • Behavioral repairs (compile-time) ◦ oracles (tests, pre- and post-conditions, behavioral models) ◦ static analysis ◦ domain-specific • State repairs (runtime) ◦ reinitialization and restart, checkpoint and rollback, reconfiguration… • Challenges ◦ non-trivial syntactic and semantic bugs 19
  16. • Generate-and-validate patching • Localize defect ◦ execution traces on

    negative and positive inputs • Generate candidate patches ◦ modification of only one statement • Rank candidates ◦ program value features and modification features ◦ probabilistic model of correct code • Validate candidates ◦ test suite as an oracle ◦ passed test == fixed bug? Prophet 20 Fan Long and Martin Rinard. Automatic Patch Generation by Learning Correct Code (POPL’16)
  17. 4. ML applications in software testing • Tasks ◦ tests

    and test data generation ◦ fault localization ◦ code repair ◦ test prioritization ◦ finding relevant tests ◦ estimation of testing efforts ◦ replacement of test suites • Data ◦ execution traces, logs ◦ coverage information ◦ failure data: where and why • https://testsigma.com, https://eggplant.io/, ... 21
  18. Test case prioritization and selection • Model-free and online learning

    method ◦ language-agnostic, requires no source code access • Rewards based on duration, previous last execution time and failure history ◦ are either zero or positive • Effective prioritization strategy is discovered after ~60 CI cycles 22 Spieker et al. Reinforcement Learning for Automatic Test Case Prioritization and Selection in Continuous Integration (ISSTA’17)
  19. 5. Detection of code smells and refactoring recommendation • Finding

    code smells in code, automatic suggestion of refactoring opportunities • Features ◦ structural information of the source code (mostly software metrics) ◦ patterns for code smells • Various learners • Most of the tools are standalone applications or Eclipse plugins • Challenges ◦ computational and memory complexity ◦ ambiguous evaluation metrics ◦ low agreement between different detectors ◦ datasets ◦ design patterns 23
  20. Automatic recommendation of refactoring opportunities • Detection of defects in

    object-oriented architecture and automatic recommendation of appropriate refactorings that optimize code structure ◦ clustering ensemble of 3 existing approaches ◦ path-based representations + SVM model (work in progress) • ArchitectureReloaded plugin for IntelliJ IDEA ◦ https://plugins.jetbrains.com/plugin/10411-architecturereloaded • A dataset generator and a dataset for evaluation of Move Method refactoring recommendation approaches Bryksin et. al. Automatic Recommendation of Move Method Refactorings Using Clustering Ensembles (IWoR’18) Novozhilov et al. Evaluation of Move Method refactorings recommendation algorithms: are we doing it right? (IWoR’19) Kurbatova et al. Recommendation of Move Method Refactoring Using Path-Based Representation of Code (IWoR’20)
  21. 6. Duplicate management in SE • Copy/paste is evil (?)

    • Duplicates in source code, documentation, … ◦ detection of duplicated knowledge • 4 types of code clones • All kinds of embeddings and learners involved • Language-specific and language-agnostic algorithms • Challenges ◦ computational time ◦ semantic clones 25 Chanchal Roy and James Cordy. A Survey on Software Clone Detection Research (2007)
  22. AntiCopyPaster Kirilenko et. al. AntiCopyPaster: extracting code duplicates as soon

    as they are introduced in the IDE (submitted to MSR’21) https://doi.org/10.5281/zenodo.4432720
  23. 7. Code completion • Make the best suggestion based on

    the context observed in training ◦ full-line and snipped-based completion • Extracting context ◦ rich structural features (e.g. types) ◦ recurring patterns in source code based on text mining techniques ◦ all in between • Various learners ◦ mostly deep learning models • Challenges ◦ performance and memory limitations ◦ synthetic datasets and evaluation approaches 28
  24. Deep-AutoCoder 29 Hu et. al. Deep-AutoCoder: Learning to Complete Code

    Precisely with Induced Code Tokens (COMPSAC’19)
  25. 8. Processing code changes • Detection of refactorings • Predicting,

    analyzing and fixing bugs • Auto-patching • Test generation • Vector representation of code changes ◦ edit scripts ◦ all kind of neural networks 30
  26. Classification of error types for programming MOOCs • Is based

    on clustering code changes (bug fixes) ◦ fixes as edit scripts between ASTs of incorrect and correct submissions • Prototype implementation for an introductory Java MOOC ◦ ~1M submissions for 34 tasks (5–21 LOC each) ◦ currently being integrated into Stepik.org Lobanov et. al. Automatic Classification of Error Types in Solutions to Programming Assignments at Online Learning Platform (AIED’19)
  27. 9. Anomaly detection in SE • Are used to find

    possible ◦ bugs ◦ security issues ◦ architectural design flaws ◦ workflow errors ◦ synchronization errors in concurrent programs ◦ performance issues ◦ compiler defects and atypical programs • Mostly unsupervised learning ◦ anomaly detection algorithms, clustering, autoencoders, statistical methods, … 32
  28. Finding anomalies in Kotlin programs • A dataset of more

    than 1.5 million unique files • Several experiments with different features and anomaly detectors • Analysis of both source code and bytecode • Detected 30 types of code anomalies, ~60 reported unique anomalies of 10 types were used by the Kotlin compiler team Bryksin et al. Using Large-Scale Anomaly Detection on Code to Improve Kotlin Compiler (MSR’20)
  29. • User intent or constraints -> program ◦ usually involves

    a search over some kind of space of programs • Program repair, automatic programming • Deductive synthesis, transformation-based synthesis • Inductive synthesis ◦ input-output examples, natural language, partial programs, grammar, assertions • Challenges ◦ search space ◦ user intent ◦ search technique 10. Program synthesis 34 Gulwani et al. Program Synthesis (Foundations and Trends in Programming Languages, Vol. 4, No. 1-2, 2017)
  30. Bayou (Bayesian Sketch Learning) • Input: methods calls and class

    names ◦ aiming on generation of API-heavy code • Sketches for programs representation • Bayesian encoder-decoder technique • Combinatorial concretization ◦ random walk-based technique • IntelliJ IDEA plugin ◦ implementations for Java STDLib and Android SDK ◦ https://plugins.jetbrains.com/plugin/10729-bsl-code-synthesizer 35 Murali et al. Neural sketch learning for conditional program generation (ICLR’17) Vladislav Tankov and Timofey Bryksin. Data-based code synthesis in IntelliJ IDEA (SEIM’18)
  31. 11. Code summarization • Generation of NL sequences from source

    code snippets ◦ creating documentation, suggesting better function names, commit messages, etc. • Approaches ◦ rule/template-based text generation ◦ models, adopted from IR and NLP domains ◦ deep learning models from the NMT field ▪ code2seq (Alon et al., 2019) • CoNaLa: The Code/Natural Language Challenge ◦ https://conala-corpus.github.io 36
  32. PtrGNCMsg: generation of commit messages 37 Liu et al. Generating

    Commit Messages from Diffs using Pointer-Generator Network (MSR’19)
  33. Tools for ML • Data availability, collection, cleaning, and management

    • End-to-end pipeline support • Rich visualisation tools • Model evolution • Debugging, testing and interpretability • Integration of ML pipelines into production 38 Amershi et al. Software Engineering for Machine Learning: A Case Study (ICSE’19)
  34. (Some of the) Challenges of ML4SE • Feature engineering •

    Datasets & mining tools • Reproducibility • Extensibility • Interpretability • Evaluation metrics • ML for the sake of ML • Immaturity for the real world • Gap between academia and industry 39 https://d30womf5coomej.cloudfront.net/sa/2c/25ac5102-aa66-4edd-887d-6babe41a20e3.png