Machine Learning on Source Code

Slide 1

Slide 1 text

Machine Learning on Source Code #MLonCode Francesc Campoy, source{d}

Slide 2

Slide 2 text

● The why ● The way ● The what? Machine Learning on Source Code

Slide 3

Slide 3 text

Francesc Campoy VP of Developer Relations @francesc [email protected] Previously Google Cloud (ML + Go)

Slide 4

Slide 4 text

speakerdeck.com/campoy/machine-learning-on-source-code

Slide 5

Slide 5 text

The why Why do we want to do this?

Slide 6

Slide 6 text

Deep Learning Revolutions - Computer Vision: ImageNet - Natural Language Processing: Siri, Alexa, Google Assistant, - Go: AlphaGo

Slide 7

Slide 7 text

Deep Learning Revolutions source: wikimedia

Slide 8

Slide 8 text

Machine Learning on Source Code - Automated code review - Source code similarity: clustering + search - Translation: - source code to natural language - source code to source code - natural language to source code?

Slide 9

Slide 9 text

source: WOCinTech

Slide 10

Slide 10 text

The way How are we tackling the challenge?

Slide 11

Slide 11 text

'112', '97', '99', '107', '97', '103', '101', '32', '109', '97', '105', '110', '10', '10', '105', '109', '112', '111', '114', '116', '32', '40', '10', '9', '34', '102', '109', '116', '34', '10', '41', '10', '10', '102', '117', '110', '99', '32', '109', '97', '105', '110', '40', '41', '32', '123', '10', '9', '102', '109', '116', '46', '80', '114', '105', '110', '116', '108', '110', '40', '34', '72', '101', '108', '108', '111', '44', '32', '112', '108', '97', '121', '103', '114', '111', '117', '110', '100', '34', '41', '10', '125', '10' package main import “fmt” func main() { fmt.Println(“Hello, Chicago”) } What is Source Code

Slide 12

Slide 12 text

package package IDENT main ; import import STRING "fmt" ; func func IDENT main ( ) package main import “fmt” func main() { fmt.Println(“Hello, Chicago”) } What is Source Code { IDENT fmt . IDENT Println ( STRING "Hello, Chicago" ) ; } ;

Slide 13

Slide 13 text

package main import “fmt” func main() { fmt.Println(“Hello, Chicago”) } What is Source Code

Slide 14

Slide 14 text

Datasets

Slide 15

Slide 15 text

github.com/src-d/borges Borges ● Store git repositories ● Rooted repositories Rovers ● Crawls for git repos ● Supports: ○ GitHub ○ BitBucket ○ Cgit github.com/src-d/rovers Siva ● Seekable Indexed Block Archiver ● Small, concatenable, constant access. Data retrieval at source{d} github.com/src-d/siva

Slide 16

Slide 16 text

Rooted repositories

Slide 17

Slide 17 text

https://arxiv.org/abs/1803.10144

Slide 18

Slide 18 text

github.com/src-d/datasets

Slide 19

Slide 19 text

Learning from Source Code

Slide 20

Slide 20 text

Learning from Code Code is: - a sequence of bytes - a sequence of tokens - a tree

Slide 21

Slide 21 text

Learning sequences Input Output Speech recognition: audio clip text Sentiment analysis: text rating -1 to 1 Machine Translation: Hello, everyone! Hola a tothom! Video recognition Sequence of frames text

Slide 22

Slide 22 text

Learning from Characters

Slide 23

Slide 23 text

Neural networks, right?

Slide 24

Slide 24 text

Recurrent Neural Networks

Slide 25

Slide 25 text

source: Martin Gorner

Slide 26

Slide 26 text

Predicting the next character Input Output ‘p’ ‘a’, ‘c’, ‘k’, ‘a’, ‘g’: ‘e’ 80% ‘i’ 15% ...

Slide 27

Slide 27 text

Training the neural network p a c k a g e a c k a g

Slide 28

Slide 28 text

Demo time!

Slide 29

Slide 29 text

More demo time!

Slide 30

Slide 30 text

Is this useful? - Predicting implies understanding - We can predict aspects of code: - Help with assisted editing - Detect anomalies

Slide 31

Slide 31 text

splittingidentifiersforthewin (Paper coming soon)

Slide 32

Slide 32 text

Learning from Tokens

Slide 33

Slide 33 text

We could use the same mechanism! package main ; import “fmt” ; main ; import “fmt” ; func

Slide 34

Slide 34 text

Categorical (non continuous) variables require one-hot encoding: Predict characters zero to 9: 0 -> [1, 0, 0, 0, 0, 0, 0, 0, 0, 0] 1 -> [0, 1, 0, 0, 0, 0, 0, 0, 0, 0] 2 -> [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] 3 -> [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] … 9 -> [0, 0, 0, 0, 0, 0, 0, 0, 0, 1] But … one-hot encoding

Slide 35

Slide 35 text

Possible characters in human text: hundreds Possible characters in computer programs: hundreds Possible words in English: 171,476 (Oxford dictionary*) * urban dictionary not included But … one-hot encoding

Slide 36

Slide 36 text

How many possible identifiers in X language?

Slide 37

Slide 37 text

Word embeddings

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

word2vec

Slide 41

Slide 41 text

allSeenElements splitting (and we know how!) all, seen, elements stemming all, see, element Splitting and stemming

Slide 42

Slide 42 text

id2vec Vocabulary size: 500,000 Dataset: >100,000 most starred repos on GitHub Result: https://github.com/src-d/ml

Slide 43

Slide 43 text

https://blog.sourced.tech/post/id2vec/

Slide 44

Slide 44 text

Receive is to send as read is to ...

Slide 45

Slide 45 text

Learning from Trees

Slide 46

Slide 46 text

Abstract Syntax Trees ● Each programming language has its own grammar ● Each grammar generates slightly different ASTs ● We want to learn from *all* languages!

Slide 47

Slide 47 text

Babelfish (https://bblf.sh) Universal Abstract Syntax Trees

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

Structural Embeddings ? 0.1 0.5 0.9 ... 0.0

Slide 51

Slide 51 text

https://arxiv.org/abs/1703.00572

Slide 52

Slide 52 text

https://arxiv.org/abs/1803.09544

Slide 53

Slide 53 text

And more coming soon! PS: we’re hiring

Slide 54

Slide 54 text

Other projects - Vecino: finding similar repositories - Apollo: finding source code duplication at scale - TMSC: topic modeling on source code repositories - Snippet-Ranger: topic modeling on source code snippets

Slide 55

Slide 55 text

Awesome Machine Learning on Source Code https://github.com/src-d/awesome-machine-learning-on-source-code