Machine Learning + Software Engineering

Machine Learning + Software Engineering Timofey Bryksin

What is Software Engineering? • Field of knowledge, scientific and
engineering discipline ◦ accumulation of experience ◦ discovery of best practices • NATO Software Engineering Conferences, 1968-1969 • “Software crisis” ◦ budget overruns ◦ torn deadlines ◦ ineffective software ◦ poor quality software ◦ requirements are not clear and are not satisfied ◦ uncontrollable projects ◦ maintenance hell ◦ ... 2 http://homepages.cs.ncl.ac.uk/brian.randell/NATO/nato1968.PDF http://homepages.cs.ncl.ac.uk/brian.randell/NATO/nato1969.PDF

3 And now? https://standishgroup.com/sample_research_files/CHAOSReport2015-Final.pdf

Software Engineering Book of Knowledge 1. Software Requirements 2. Software
Design 3. Software Construction 4. Software Testing 5. Software Maintenance 6. Software Configuration Management 7. Software Engineering Management 8. Software Engineering Process 9. Software Engineering Models and Methods 10. Software Quality 11. Software Engineering Professional Practice 12. Software Engineering Economics 13. Computing, Mathematical and Engineering Foundations https://www.computer.org/education/bodies-of-knowledge/software-engineering

Software engineering entities • Type ◦ processes (design, coding, testing,
...) ◦ products (documents, code, deliverables, ...) ◦ resources (personnel, tools, hardware, ...) • Attributes ◦ internal ◦ external 5

6 http://www.cognub.com/wp-content/uploads/2016/02/1.png

Applications of ML in SE • Predicting or estimating measurements
of software entities • Discovering properties of software entities • Enhancing software entities • Transforming products • Synthesizing/generating products • Reusing products • Enhancing processes 7

What this talk is about • ML in software development
processes & tools ◦ 11 examples of application areas ◦ brief history & recent works ◦ problems and challenges • Tool support for ML 8

Where does the SE data come from? • Formal data
◦ source code ◦ configuration files ◦ binary code ◦ logs and execution traces • Text ◦ documentation ◦ specifications, design documents ◦ communication tools (email, QA forums, ...) • Metadata ◦ version control systems ◦ bug/issue/task trackers, planning systems, ... 9

And all of them change and influence each other 😕
10

How to get this data • Existing datasets for specific
tasks • SE data repositories (Qualitas Corpus, PROMISE, ...) • Code databases (PGA, GHTorrent, GH Archive, ...) • Software data in the wild (GitHub, Gerrit, Jira, ... ) ◦ lack of appropriate mining tools ◦ proper data mining is hard ▪ especially applied to VCS 11

Building vector representations • Text ◦ various NLP techniques •
Source code embeddings ◦ explicit features (software metrics, simple NLP features, path-based representations, ...) ◦ implicit features (N-grams, AST encodings, feature hashing, autoencoders, GNNs, ...) ◦ on different levels ▪ tokens, methods, API calls, system events, execution traces, code changes, ... • IR, binary code ◦ paths in CFG, NLP features, bitmaps, ... 12

code2vec: AST path-contexts 13 Alon et al. code2vec: Learning distributed
representations of code (POPL’19) (elements, Name↑FieldAccess↑Foreach↓Block↓IfStmt↓Block↓Return↓BooleanExpr, true)

code2vec: the neural architecture 14 Alon et al. code2vec: Learning
distributed representations of code (POPL’19)

code2vec: suggesting method names 15 Alon et al. code2vec: Learning
distributed representations of code (POPL’19)

1. Estimating size, cost and effort • Various targets: project,
development, maintenance, correction, … • One of the first research tasks to solve in SE ◦ a survey with 250+ methods published in 2007 • Type of models ◦ expert judgement ◦ parametric models (Use Cases, Function Points, COCOMO I/II, …) ◦ non-parametric models (estimation-via-analogy) • Various learners ◦ linear regression, Bayesian networks, GAs, NNs, DTs, HMMs, association rules, … • Challenges ◦ factors that affect effort and productivity are not understood well ◦ lack of decent historical and production data ◦ the need to adjust models to local environment 16

02. Software quality prediction • Defect estimation, bug prediction, reliability
prediction ◦ 200+ papers overviewed in a 2012 study • Input data ◦ process, software and developer metrics, change data, historical data (prev. bug reports) • Various learners ◦ genetic programming, neural networks, decision trees, Bayesian belief networks, ... ◦ unsupervised approaches ◦ effort-aware defect prediction • Popular datasets ◦ 12 NASA datasets ◦ a selection of open-source projects (PROMISE, Eclipse dataset, ...) 17

Deep learning-based bug detectors • Learning representations with a word2vec-like
neural architecture • Generated artificial dataset • Models to detect swapped arguments, incorrect binary operator, and incorrect binary operand issues • Plugins for WebStorm (JavaScript) and PyCharm (Python) ◦ https://plugins.jetbrains.com/plugin/12220-deepbugsjavascript ◦ https://plugins.jetbrains.com/plugin/12218-deepbugspython Michael Pradel and Koushik Sen. DeepBugs: A Learning Approach to Name-based Bug Detection (OOPSLA'18)

3. Automatic software repair • Finding a solution to software
bugs without human intervention ◦ errors, faults, and failures • Behavioral repairs (compile-time) ◦ oracles (tests, pre- and post-conditions, behavioral models) ◦ static analysis ◦ domain-specific • State repairs (runtime) ◦ reinitialization and restart, checkpoint and rollback, reconfiguration… • Challenges ◦ non-trivial syntactic and semantic bugs 19

• Generate-and-validate patching • Localize defect ◦ execution traces on
negative and positive inputs • Generate candidate patches ◦ modification of only one statement • Rank candidates ◦ program value features and modification features ◦ probabilistic model of correct code • Validate candidates ◦ test suite as an oracle ◦ passed test == fixed bug? Prophet 20 Fan Long and Martin Rinard. Automatic Patch Generation by Learning Correct Code (POPL’16)

4. ML applications in software testing • Tasks ◦ tests
and test data generation ◦ fault localization ◦ code repair ◦ test prioritization ◦ finding relevant tests ◦ estimation of testing efforts ◦ replacement of test suites • Data ◦ execution traces, logs ◦ coverage information ◦ failure data: where and why • https://testsigma.com, https://eggplant.io/, ... 21

Test case prioritization and selection • Model-free and online learning
method ◦ language-agnostic, requires no source code access • Rewards based on duration, previous last execution time and failure history ◦ are either zero or positive • Effective prioritization strategy is discovered after ~60 CI cycles 22 Spieker et al. Reinforcement Learning for Automatic Test Case Prioritization and Selection in Continuous Integration (ISSTA’17)

5. Detection of code smells and refactoring recommendation • Finding
code smells in code, automatic suggestion of refactoring opportunities • Features ◦ structural information of the source code (mostly software metrics) ◦ patterns for code smells • Various learners • Most of the tools are standalone applications or Eclipse plugins • Challenges ◦ computational and memory complexity ◦ ambiguous evaluation metrics ◦ low agreement between different detectors ◦ datasets ◦ design patterns 23

Automatic recommendation of refactoring opportunities • Detection of defects in
object-oriented architecture and automatic recommendation of appropriate refactorings that optimize code structure ◦ clustering ensemble of 3 existing approaches ◦ path-based representations + SVM model (work in progress) • ArchitectureReloaded plugin for IntelliJ IDEA ◦ https://plugins.jetbrains.com/plugin/10411-architecturereloaded • A dataset generator and a dataset for evaluation of Move Method refactoring recommendation approaches Bryksin et. al. Automatic Recommendation of Move Method Refactorings Using Clustering Ensembles (IWoR’18) Novozhilov et al. Evaluation of Move Method refactorings recommendation algorithms: are we doing it right? (IWoR’19) Kurbatova et al. Recommendation of Move Method Refactoring Using Path-Based Representation of Code (IWoR’20)

6. Duplicate management in SE • Copy/paste is evil (?)
• Duplicates in source code, documentation, … ◦ detection of duplicated knowledge • 4 types of code clones • All kinds of embeddings and learners involved • Language-specific and language-agnostic algorithms • Challenges ◦ computational time ◦ semantic clones 25 Chanchal Roy and James Cordy. A Survey on Software Clone Detection Research (2007)

CCLearner 26 Li et. al. CCLearner: A Deep Learning-Based Clone
Detection Approach (ICSME’17)

AntiCopyPaster Kirilenko et. al. AntiCopyPaster: extracting code duplicates as soon
as they are introduced in the IDE (submitted to MSR’21) https://doi.org/10.5281/zenodo.4432720

7. Code completion • Make the best suggestion based on
the context observed in training ◦ full-line and snipped-based completion • Extracting context ◦ rich structural features (e.g. types) ◦ recurring patterns in source code based on text mining techniques ◦ all in between • Various learners ◦ mostly deep learning models • Challenges ◦ performance and memory limitations ◦ synthetic datasets and evaluation approaches 28

Deep-AutoCoder 29 Hu et. al. Deep-AutoCoder: Learning to Complete Code
Precisely with Induced Code Tokens (COMPSAC’19)

8. Processing code changes • Detection of refactorings • Predicting,
analyzing and fixing bugs • Auto-patching • Test generation • Vector representation of code changes ◦ edit scripts ◦ all kind of neural networks 30

Classification of error types for programming MOOCs • Is based
on clustering code changes (bug fixes) ◦ fixes as edit scripts between ASTs of incorrect and correct submissions • Prototype implementation for an introductory Java MOOC ◦ ~1M submissions for 34 tasks (5–21 LOC each) ◦ currently being integrated into Stepik.org Lobanov et. al. Automatic Classification of Error Types in Solutions to Programming Assignments at Online Learning Platform (AIED’19)

9. Anomaly detection in SE • Are used to find
possible ◦ bugs ◦ security issues ◦ architectural design flaws ◦ workflow errors ◦ synchronization errors in concurrent programs ◦ performance issues ◦ compiler defects and atypical programs • Mostly unsupervised learning ◦ anomaly detection algorithms, clustering, autoencoders, statistical methods, … 32

Finding anomalies in Kotlin programs • A dataset of more
than 1.5 million unique files • Several experiments with different features and anomaly detectors • Analysis of both source code and bytecode • Detected 30 types of code anomalies, ~60 reported unique anomalies of 10 types were used by the Kotlin compiler team Bryksin et al. Using Large-Scale Anomaly Detection on Code to Improve Kotlin Compiler (MSR’20)

• User intent or constraints -> program ◦ usually involves
a search over some kind of space of programs • Program repair, automatic programming • Deductive synthesis, transformation-based synthesis • Inductive synthesis ◦ input-output examples, natural language, partial programs, grammar, assertions • Challenges ◦ search space ◦ user intent ◦ search technique 10. Program synthesis 34 Gulwani et al. Program Synthesis (Foundations and Trends in Programming Languages, Vol. 4, No. 1-2, 2017)

Bayou (Bayesian Sketch Learning) • Input: methods calls and class
names ◦ aiming on generation of API-heavy code • Sketches for programs representation • Bayesian encoder-decoder technique • Combinatorial concretization ◦ random walk-based technique • IntelliJ IDEA plugin ◦ implementations for Java STDLib and Android SDK ◦ https://plugins.jetbrains.com/plugin/10729-bsl-code-synthesizer 35 Murali et al. Neural sketch learning for conditional program generation (ICLR’17) Vladislav Tankov and Timofey Bryksin. Data-based code synthesis in IntelliJ IDEA (SEIM’18)

11. Code summarization • Generation of NL sequences from source
code snippets ◦ creating documentation, suggesting better function names, commit messages, etc. • Approaches ◦ rule/template-based text generation ◦ models, adopted from IR and NLP domains ◦ deep learning models from the NMT field ▪ code2seq (Alon et al., 2019) • CoNaLa: The Code/Natural Language Challenge ◦ https://conala-corpus.github.io 36

PtrGNCMsg: generation of commit messages 37 Liu et al. Generating
Commit Messages from Diffs using Pointer-Generator Network (MSR’19)

Tools for ML • Data availability, collection, cleaning, and management
• End-to-end pipeline support • Rich visualisation tools • Model evolution • Debugging, testing and interpretability • Integration of ML pipelines into production 38 Amershi et al. Software Engineering for Machine Learning: A Case Study (ICSE’19)

(Some of the) Challenges of ML4SE • Feature engineering •
Datasets & mining tools • Reproducibility • Extensibility • Interpretability • Evaluation metrics • ML for the sake of ML • Immaturity for the real world • Gap between academia and industry 39 https://d30womf5coomej.cloudfront.net/sa/2c/25ac5102-aa66-4edd-887d-6babe41a20e3.png

Thank you! [email protected] https://research.jetbrains.org/groups/ml_methods/

Machine Learning + Software Engineering

Machine Learning + Software Engineering

More Decks by Machinelearner

Other Decks in Education

Featured

Transcript