NLP + SE = ❤️

NLP + SE = ❤ Georgios Gousios TU Delft

function setTimeout(callBack, delay){...}; browserSingleton.startPoller(100, function(delay, fn) { setTimeout(delay,fn); } );
Can you spot the bug?

Finding bugs function setTimeout(callBack, delay) {...}; browserSingleton.startPoller(100, function(delay, fn) {
setTimeout(delay, fn); });

setTimeout(delay, fn); }); Name that denotes a function Name that denotes a function

setTimeout(delay, fn); }); Name that denotes a function Name that denotes a function Order of application Order of application

— Earl Barr, UCL “Source code is bimodal: it combines
a formal, algorithmic channel and a natural language channel of identiﬁers and comments.”

setTimeout(delay,fn); });

setTimeout(delay,fn); }); Natural language channel Natural language channel

setTimeout(delay,fn); }); Natural language channel Natural language channel Code semantics channel Code semantics channel

Static Analysis • Static analysis only captures the semantics channel
• Bug detection and other forms of static analysis is pattern matching on increasingly precise semantics • Most static bug detectors ﬁnd a subset of bugs (Habib and Pradel, ASE 2018) • Humans need to identify the patterns • As the semantics relax, static analysis becomes unsound • Almost impossible for dynamic languages (“stringly typed”)

function setTimeout(callBack: a -> b, delay: int){…}; browserSingleton.startPoller(100, function(delay, fn)
{ setTimeout(delay,fn); } );

function setTimeout(callBack: a -> b, delay: int){…}; browserSingleton.startPoller(100, function(delay, fn)
{ setTimeout(delay,fn); } ); The compiler can only help if humans add semantic information

How can NLP help? NLP approaches to software analysis aim
to exploit the natural language information channel to help with tasks such as: • Bug ﬁnding • Type annotations • Inconsistencies • Source code summarisation • …

The Naturalness hypothesis “Software is a form of human communication;
software corpora have similar statistical properties to natural language corpora; and these properties can be exploited to build better software engineering tools.” Hindle et al. On the naturalness of software. ICSE 2012

Naturalness showcased Hindle et al. On the naturalness of software.
ICSE 2012 Code n-grams are less “surprising” to a language model than English

Naturalness showcased Hindle et al. On the naturalness of software.
ICSE 2012 Code n-grams are less “surprising” to a language model than English We can train language models to predict next tokens better in code

Finding bugs Pradel and Shen. Deepbugs: A Learning Approach to
Name-based Bug Detection. OOPSLA 2018 Training models to distinguish correct from buggy code Buggy code Correct code Buggy Correct

Name-based Bug Detection. OOPSLA 2018 How to produce buggy code? • Swap function arguments foo(a, b) -> foo(b, a) • Replace binary operators i <= length -> i % length • Replace binary operand i <= length -> i <= foo

Name-based Bug Detection. OOPSLA 2018 Training on 150k Javascript ﬁles Swapped arguments Wrong binary operator Wrong binary operand Accuracy 94% 92% 89%

Predicting types def bigger_number(a, b): if a > b: return
a else: return b Python 2.7 code

Predicting types def bigger_number(a, b): if a > b: return
a else: return b Python 2.7 code Geen Python 2.7 na 2019!

Predicting types def bigger_number(a: ???, b: ???) -> ???: if
a > b: return a else: return b Python 3.5+ code. Can you guess the types?

Predicting types def bigger_number(a: ???, b: ???) -> ???: if
a > b: return a else: return b Python 3.5+ code. Can you guess the types? How can we automatically annotate JavaScript/Python code with types?

Predicting types def is_bigger(a:int, b:int) -> boolean: “”” Returns True
if a is number a is bigger than b, else False “”” return a > b Learning from existing code annotations

if a is number a is bigger than b, else False “”” return a > b Learning from existing code annotations embedding

if a is number a is bigger than b, else False “”” return a > b Learning from existing code annotations embedding sequence learning

if a is number a is bigger than b, else False “”” return a > b Learning from existing code annotations embedding sequence learning concat +

if a is number a is bigger than b, else False “”” return a > b Learning from existing code annotations embedding sequence learning concat prediction +

Predicting types Results on 500 partially annotated GitHub projects return
type argument type combined precision .67 .61 .65 recall .62 .57 .59 precision .76 .77 .80 recall .70 .70 .71 Top - 1 Top -3

Finding inconsistencies How to ﬁnd inconsistent function/variable names? Liu et
al. Learning to Spot and Refactor Inconsistent Method Names. ICSE 2019

Finding inconsistencies How to ﬁnd inconsistent function/variable names? Liu et
al. Learning to Spot and Refactor Inconsistent Method Names. ICSE 2019 Methods with the similar names should have similar bodies

Finding inconsistencies Liu et al. Learning to Spot and Refactor
Inconsistent Method Names. ICSE 2019

Finding inconsistencies Liu et al. Learning to Spot and Refactor
Inconsistent Method Names. ICSE 2019 1. Build embeddings of function names and body vectors 2. For each function body: 1. Find functions close to it in vector space 2. Check their respective name distance

Code summarization Wan et al. Improving automatic source code summarization
via deep reinforcement learning. ASE 2018 def add(a, b): return a + b def ???(a, b): return a + b

via deep reinforcement learning. ASE 2018 def add(a, b): return a + b “”” Adds two numbers “”” def ???(a, b): return a + b

via deep reinforcement learning. ASE 2018 def add(a, b): return a + b “”” Adds two numbers “”” def ???(a, b): return a + b add

via deep reinforcement learning. ASE 2018 Use critic network to re-adjust model weights BLEU score of 0.35

via deep reinforcement learning. ASE 2018 NL channel Semantics channel Use critic network to re-adjust model weights BLEU score of 0.35

Main challenges • Developers like to invent names • Code
vocabularies are 10x the size of NL ones • Compression techniques (e.g. BPE) to the rescue • How to feed code to a network without loosing info from either the NL or the semantics channel? • Code2Vec, TreeLSTMs, GGNNs,… • Keeping up with evolution • Making tools — not just research papers

ML4SE course! • Student presentations of course projects, including poster
sessions • Oct 30, 13:45 - 17:00, Pulse-Hall 7

NLP + SE = ❤️

NLP + SE = ❤️

More Decks by Georgios Gousios

Other Decks in Technology

Featured

Transcript