Slide 1

Slide 1 text

Machine Learning + Software Engineering Timofey Bryksin

Slide 2

Slide 2 text

What is Software Engineering? ● Field of knowledge, scientific and engineering discipline ○ accumulation of experience ○ discovery of best practices ● NATO Software Engineering Conferences, 1968-1969 ● “Software crisis” ○ budget overruns ○ torn deadlines ○ ineffective software ○ poor quality software ○ requirements are not clear and are not satisfied ○ uncontrollable projects ○ maintenance hell ○ ... 2 http://homepages.cs.ncl.ac.uk/brian.randell/NATO/nato1968.PDF http://homepages.cs.ncl.ac.uk/brian.randell/NATO/nato1969.PDF

Slide 3

Slide 3 text

3 And now? https://standishgroup.com/sample_research_files/CHAOSReport2015-Final.pdf

Slide 4

Slide 4 text

Software Engineering Book of Knowledge 1. Software Requirements 2. Software Design 3. Software Construction 4. Software Testing 5. Software Maintenance 6. Software Configuration Management 7. Software Engineering Management 8. Software Engineering Process 9. Software Engineering Models and Methods 10. Software Quality 11. Software Engineering Professional Practice 12. Software Engineering Economics 13. Computing, Mathematical and Engineering Foundations https://www.computer.org/education/bodies-of-knowledge/software-engineering

Slide 5

Slide 5 text

Software engineering entities ● Type ○ processes (design, coding, testing, ...) ○ products (documents, code, deliverables, ...) ○ resources (personnel, tools, hardware, ...) ● Attributes ○ internal ○ external 5

Slide 6

Slide 6 text

6 http://www.cognub.com/wp-content/uploads/2016/02/1.png

Slide 7

Slide 7 text

Applications of ML in SE ● Predicting or estimating measurements of software entities ● Discovering properties of software entities ● Enhancing software entities ● Transforming products ● Synthesizing/generating products ● Reusing products ● Enhancing processes 7

Slide 8

Slide 8 text

What this talk is about ● ML in software development processes & tools ○ 11 examples of application areas ○ brief history & recent works ○ problems and challenges ● Tool support for ML 8

Slide 9

Slide 9 text

Where does the SE data come from? ● Formal data ○ source code ○ configuration files ○ binary code ○ logs and execution traces ● Text ○ documentation ○ specifications, design documents ○ communication tools (email, QA forums, ...) ● Metadata ○ version control systems ○ bug/issue/task trackers, planning systems, ... 9

Slide 10

Slide 10 text

And all of them change and influence each other 😕 10

Slide 11

Slide 11 text

How to get this data ● Existing datasets for specific tasks ● SE data repositories (Qualitas Corpus, PROMISE, ...) ● Code databases (PGA, GHTorrent, GH Archive, ...) ● Software data in the wild (GitHub, Gerrit, Jira, ... ) ○ lack of appropriate mining tools ○ proper data mining is hard ■ especially applied to VCS 11

Slide 12

Slide 12 text

Building vector representations ● Text ○ various NLP techniques ● Source code embeddings ○ explicit features (software metrics, simple NLP features, path-based representations, ...) ○ implicit features (N-grams, AST encodings, feature hashing, autoencoders, GNNs, ...) ○ on different levels ■ tokens, methods, API calls, system events, execution traces, code changes, ... ● IR, binary code ○ paths in CFG, NLP features, bitmaps, ... 12

Slide 13

Slide 13 text

code2vec: AST path-contexts 13 Alon et al. code2vec: Learning distributed representations of code (POPL’19) (elements, Name↑FieldAccess↑Foreach↓Block↓IfStmt↓Block↓Return↓BooleanExpr, true)

Slide 14

Slide 14 text

code2vec: the neural architecture 14 Alon et al. code2vec: Learning distributed representations of code (POPL’19)

Slide 15

Slide 15 text

code2vec: suggesting method names 15 Alon et al. code2vec: Learning distributed representations of code (POPL’19)

Slide 16

Slide 16 text

1. Estimating size, cost and effort ● Various targets: project, development, maintenance, correction, … ● One of the first research tasks to solve in SE ○ a survey with 250+ methods published in 2007 ● Type of models ○ expert judgement ○ parametric models (Use Cases, Function Points, COCOMO I/II, …) ○ non-parametric models (estimation-via-analogy) ● Various learners ○ linear regression, Bayesian networks, GAs, NNs, DTs, HMMs, association rules, … ● Challenges ○ factors that affect effort and productivity are not understood well ○ lack of decent historical and production data ○ the need to adjust models to local environment 16

Slide 17

Slide 17 text

02. Software quality prediction ● Defect estimation, bug prediction, reliability prediction ○ 200+ papers overviewed in a 2012 study ● Input data ○ process, software and developer metrics, change data, historical data (prev. bug reports) ● Various learners ○ genetic programming, neural networks, decision trees, Bayesian belief networks, ... ○ unsupervised approaches ○ effort-aware defect prediction ● Popular datasets ○ 12 NASA datasets ○ a selection of open-source projects (PROMISE, Eclipse dataset, ...) 17

Slide 18

Slide 18 text

Deep learning-based bug detectors ● Learning representations with a word2vec-like neural architecture ● Generated artificial dataset ● Models to detect swapped arguments, incorrect binary operator, and incorrect binary operand issues ● Plugins for WebStorm (JavaScript) and PyCharm (Python) ○ https://plugins.jetbrains.com/plugin/12220-deepbugsjavascript ○ https://plugins.jetbrains.com/plugin/12218-deepbugspython Michael Pradel and Koushik Sen. DeepBugs: A Learning Approach to Name-based Bug Detection (OOPSLA'18)

Slide 19

Slide 19 text

3. Automatic software repair ● Finding a solution to software bugs without human intervention ○ errors, faults, and failures ● Behavioral repairs (compile-time) ○ oracles (tests, pre- and post-conditions, behavioral models) ○ static analysis ○ domain-specific ● State repairs (runtime) ○ reinitialization and restart, checkpoint and rollback, reconfiguration… ● Challenges ○ non-trivial syntactic and semantic bugs 19

Slide 20

Slide 20 text

● Generate-and-validate patching ● Localize defect ○ execution traces on negative and positive inputs ● Generate candidate patches ○ modification of only one statement ● Rank candidates ○ program value features and modification features ○ probabilistic model of correct code ● Validate candidates ○ test suite as an oracle ○ passed test == fixed bug? Prophet 20 Fan Long and Martin Rinard. Automatic Patch Generation by Learning Correct Code (POPL’16)

Slide 21

Slide 21 text

4. ML applications in software testing ● Tasks ○ tests and test data generation ○ fault localization ○ code repair ○ test prioritization ○ finding relevant tests ○ estimation of testing efforts ○ replacement of test suites ● Data ○ execution traces, logs ○ coverage information ○ failure data: where and why ● https://testsigma.com, https://eggplant.io/, ... 21

Slide 22

Slide 22 text

Test case prioritization and selection ● Model-free and online learning method ○ language-agnostic, requires no source code access ● Rewards based on duration, previous last execution time and failure history ○ are either zero or positive ● Effective prioritization strategy is discovered after ~60 CI cycles 22 Spieker et al. Reinforcement Learning for Automatic Test Case Prioritization and Selection in Continuous Integration (ISSTA’17)

Slide 23

Slide 23 text

5. Detection of code smells and refactoring recommendation ● Finding code smells in code, automatic suggestion of refactoring opportunities ● Features ○ structural information of the source code (mostly software metrics) ○ patterns for code smells ● Various learners ● Most of the tools are standalone applications or Eclipse plugins ● Challenges ○ computational and memory complexity ○ ambiguous evaluation metrics ○ low agreement between different detectors ○ datasets ○ design patterns 23

Slide 24

Slide 24 text

Automatic recommendation of refactoring opportunities ● Detection of defects in object-oriented architecture and automatic recommendation of appropriate refactorings that optimize code structure ○ clustering ensemble of 3 existing approaches ○ path-based representations + SVM model (work in progress) ● ArchitectureReloaded plugin for IntelliJ IDEA ○ https://plugins.jetbrains.com/plugin/10411-architecturereloaded ● A dataset generator and a dataset for evaluation of Move Method refactoring recommendation approaches Bryksin et. al. Automatic Recommendation of Move Method Refactorings Using Clustering Ensembles (IWoR’18) Novozhilov et al. Evaluation of Move Method refactorings recommendation algorithms: are we doing it right? (IWoR’19) Kurbatova et al. Recommendation of Move Method Refactoring Using Path-Based Representation of Code (IWoR’20)

Slide 25

Slide 25 text

6. Duplicate management in SE ● Copy/paste is evil (?) ● Duplicates in source code, documentation, … ○ detection of duplicated knowledge ● 4 types of code clones ● All kinds of embeddings and learners involved ● Language-specific and language-agnostic algorithms ● Challenges ○ computational time ○ semantic clones 25 Chanchal Roy and James Cordy. A Survey on Software Clone Detection Research (2007)

Slide 26

Slide 26 text

CCLearner 26 Li et. al. CCLearner: A Deep Learning-Based Clone Detection Approach (ICSME’17)

Slide 27

Slide 27 text

AntiCopyPaster Kirilenko et. al. AntiCopyPaster: extracting code duplicates as soon as they are introduced in the IDE (submitted to MSR’21) https://doi.org/10.5281/zenodo.4432720

Slide 28

Slide 28 text

7. Code completion ● Make the best suggestion based on the context observed in training ○ full-line and snipped-based completion ● Extracting context ○ rich structural features (e.g. types) ○ recurring patterns in source code based on text mining techniques ○ all in between ● Various learners ○ mostly deep learning models ● Challenges ○ performance and memory limitations ○ synthetic datasets and evaluation approaches 28

Slide 29

Slide 29 text

Deep-AutoCoder 29 Hu et. al. Deep-AutoCoder: Learning to Complete Code Precisely with Induced Code Tokens (COMPSAC’19)

Slide 30

Slide 30 text

8. Processing code changes ● Detection of refactorings ● Predicting, analyzing and fixing bugs ● Auto-patching ● Test generation ● Vector representation of code changes ○ edit scripts ○ all kind of neural networks 30

Slide 31

Slide 31 text

Classification of error types for programming MOOCs ● Is based on clustering code changes (bug fixes) ○ fixes as edit scripts between ASTs of incorrect and correct submissions ● Prototype implementation for an introductory Java MOOC ○ ~1M submissions for 34 tasks (5–21 LOC each) ○ currently being integrated into Stepik.org Lobanov et. al. Automatic Classification of Error Types in Solutions to Programming Assignments at Online Learning Platform (AIED’19)

Slide 32

Slide 32 text

9. Anomaly detection in SE ● Are used to find possible ○ bugs ○ security issues ○ architectural design flaws ○ workflow errors ○ synchronization errors in concurrent programs ○ performance issues ○ compiler defects and atypical programs ● Mostly unsupervised learning ○ anomaly detection algorithms, clustering, autoencoders, statistical methods, … 32

Slide 33

Slide 33 text

Finding anomalies in Kotlin programs ● A dataset of more than 1.5 million unique files ● Several experiments with different features and anomaly detectors ● Analysis of both source code and bytecode ● Detected 30 types of code anomalies, ~60 reported unique anomalies of 10 types were used by the Kotlin compiler team Bryksin et al. Using Large-Scale Anomaly Detection on Code to Improve Kotlin Compiler (MSR’20)

Slide 34

Slide 34 text

● User intent or constraints -> program ○ usually involves a search over some kind of space of programs ● Program repair, automatic programming ● Deductive synthesis, transformation-based synthesis ● Inductive synthesis ○ input-output examples, natural language, partial programs, grammar, assertions ● Challenges ○ search space ○ user intent ○ search technique 10. Program synthesis 34 Gulwani et al. Program Synthesis (Foundations and Trends in Programming Languages, Vol. 4, No. 1-2, 2017)

Slide 35

Slide 35 text

Bayou (Bayesian Sketch Learning) ● Input: methods calls and class names ○ aiming on generation of API-heavy code ● Sketches for programs representation ● Bayesian encoder-decoder technique ● Combinatorial concretization ○ random walk-based technique ● IntelliJ IDEA plugin ○ implementations for Java STDLib and Android SDK ○ https://plugins.jetbrains.com/plugin/10729-bsl-code-synthesizer 35 Murali et al. Neural sketch learning for conditional program generation (ICLR’17) Vladislav Tankov and Timofey Bryksin. Data-based code synthesis in IntelliJ IDEA (SEIM’18)

Slide 36

Slide 36 text

11. Code summarization ● Generation of NL sequences from source code snippets ○ creating documentation, suggesting better function names, commit messages, etc. ● Approaches ○ rule/template-based text generation ○ models, adopted from IR and NLP domains ○ deep learning models from the NMT field ■ code2seq (Alon et al., 2019) ● CoNaLa: The Code/Natural Language Challenge ○ https://conala-corpus.github.io 36

Slide 37

Slide 37 text

PtrGNCMsg: generation of commit messages 37 Liu et al. Generating Commit Messages from Diffs using Pointer-Generator Network (MSR’19)

Slide 38

Slide 38 text

Tools for ML ● Data availability, collection, cleaning, and management ● End-to-end pipeline support ● Rich visualisation tools ● Model evolution ● Debugging, testing and interpretability ● Integration of ML pipelines into production 38 Amershi et al. Software Engineering for Machine Learning: A Case Study (ICSE’19)

Slide 39

Slide 39 text

(Some of the) Challenges of ML4SE ● Feature engineering ● Datasets & mining tools ● Reproducibility ● Extensibility ● Interpretability ● Evaluation metrics ● ML for the sake of ML ● Immaturity for the real world ● Gap between academia and industry 39 https://d30womf5coomej.cloudfront.net/sa/2c/25ac5102-aa66-4edd-887d-6babe41a20e3.png

Slide 40

Slide 40 text

Thank you! [email protected] https://research.jetbrains.org/groups/ml_methods/