Slide 1

Slide 1 text

“Software is eating the world”

Slide 2

Slide 2 text

128k LoC

Slide 3

Slide 3 text

4-5M LoC

Slide 4

Slide 4 text

9M LoC

Slide 5

Slide 5 text

18M LoC

Slide 6

Slide 6 text

45M LoC

Slide 7

Slide 7 text

150M LoC

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

80 Invented in 1725

Slide 12

Slide 12 text

Founded in 1896, later became IBM

Slide 13

Slide 13 text

Created in 1969

Slide 14

Slide 14 text

Created in 1976 - iMproved in 1991

Slide 15

Slide 15 text

Created in 1981

Slide 16

Slide 16 text

Released in 2014

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Machine Learning will change programming Francesc Campoy

Slide 20

Slide 20 text

VP of Product at Dgraph Labs @francesc Previously: ● VP of Product & DevRel at source{d} ● Senior Developer Advocate at Google (Go team and Google Cloud Platform) Francesc Campoy

Slide 21

Slide 21 text

just for func

Slide 22

Slide 22 text

Agenda ● Machine Learning on Source Code ● Research ● Use Cases ● The Future

Slide 23

Slide 23 text

Machine Learning on Source Code

Slide 24

Slide 24 text

Machine Learning on Source Code Field of Machine Learning where the input data is source code. MLonCode

Slide 25

Slide 25 text

Machine Learning on Source Code Requires: ● Lots of data ● Really, lots and lots of data ● Fancy ML Algorithms ● A little bit of luck Related Fields: ● Data Mining ● Natural Language Processing ● Graph Based Machine Learning

Slide 26

Slide 26 text

What is source code?

Slide 27

Slide 27 text

'112', '97', '99', '107', '97', '103', '101', '32', '109', '97', '105', '110', '10', '10', '105', '109', '112', '111', '114', '116', '32', '40', '10', '9', '34', '102', '109', '116', '34', '10', '41', '10', '10', '102', '117', '110', '99', '32', '109', '97', '105', '110', '40', '41', '32', '123', '10', '9', '102', '109', '116', '46', '80', '114', '105', '110', '116', '108', '110', '40', '34', '72', '101', '108', '108', '111', '44', '32', '112', '108', '97', '121', '103', '114', '111', '117', '110', '100', '34', '41', '10', '125', '10' package main import “fmt” func main() { fmt.Println(“Hello, Denver”) } What is Source Code

Slide 28

Slide 28 text

package package IDENT main ; import import STRING "fmt" ; func func IDENT main ( ) What is Source Code { IDENT fmt . IDENT Println ( STRING "Hello, Denver" ) ; } ; package main import “fmt” func main() { fmt.Println(“Hello, Denver”) }

Slide 29

Slide 29 text

What is Source Code package main import “fmt” func main() { fmt.Println(“Hello, Denver”) }

Slide 30

Slide 30 text

What is Source Code package main import “fmt” func main() { fmt.Println(“Hello, Denver”) }

Slide 31

Slide 31 text

What is Source Code ● A sequence of bytes ● A sequence of tokens ● An abstract syntax tree ● A Graph (e.g. Control Flow Graph)

Slide 32

Slide 32 text

learning from source code

Slide 33

Slide 33 text

learning from source code as bytes

Slide 34

Slide 34 text

Neural Networks Basically fancy linear regression machines Given an input of a constant length, they predict an output of constant length. Example: MNIST: Input: images with 28x28 px Output: a digit from zero to 9

Slide 35

Slide 35 text

MNIST ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~1 ~0

Slide 36

Slide 36 text

MLonCode: Predict the next token f o r i : = 0 ; i

Slide 37

Slide 37 text

Recurrent Neural Networks Can process sequences of variable length. Uses its own output as a new input. Example: Natural Language Translation: Input: “Estic molt constipat” Output: “I got a serious cold”

Slide 38

Slide 38 text

MLonCode: Code Generation charRNN: Given n characters, predict the next one Trained over the Go standard library Achieved 61% accuracy on predictions.

Slide 39

Slide 39 text

Before training r t, kp0t@pp kpktp 0p000 xS%%%?ttk?^@p0rk^@%ppp@ta#p^@ #pp}}%p^@?P%^@@k#%@P}}ta S?@}^@t%@% %%aNt i ^@SSt@@ pyikkp?%y ?t k L P0L t% ^@i%yy ^@p i? %L%LL tyLP?a ?L@Ly?tkk^@ @^@ykk^@i#P^@iL@??@%1tt%^@tPTta L ^@LL%% %i1::yyy^@^@t tP @?@a#Patt 1^@@ k^@k ? yt%L1^@tP%k1?k? % ^@i ^@ta1?1taktt1P?a^@^@Pkt?#^@t^@##1?## #^@t11#:^@%??t%1^@a 1?a at1P ^@^@Pt #%^@^@ ^@aaak^@#a#?P1Pa^@tt%?^@kt?#akP ?#^@i%%aa ^@1%t tt?a?% t^@k^@^@k^@ a : ^@1 P# % ^@^@#t% :% kkP ^@#?P: t^@a ?%##?kkPaP^@ #a k?t?? ai?i%PPk taP% P^@ k^@iiai#?^@# #t ?# P?P^@ i^@ttPt # 1%11 ti a^@k P^@k ^@kt %^@%y?#a a#% @? kt ^@t%k? ^@PtttkL tkLa1 ^@iaay?p1% Pta tt ik?ty k^@kpt%^@tktpkryyp^@?pP# %kt?ki? i @t^@k^@%#P} ?at}akP##Pa11%^@i% ^@?ia ia%##%tki % }i%%%}} a ay^@%yt }%t ^@tU%a% t}yi^@ ^@ @t yt%? aP @% ^@??^@%? ^@??k#% kk#%t?a: P}^@t :#^@#1t^@#: w^@P#%w:Pt t # t%aa%i@ak@@^@ka@^@a # y}^@# ^@? % tP i? ?tk ktPPt a tpprrpt? a^@ pP pt %p ? k? ^@^@ kP^@%%?tk a Pt^@# tP? P kkP1L1tP a%? t1P%PPti^@?%ytk %#%%t?@?^@ty^@iyk%1#^@@^@1#t a t@P^@^@ P@^@1P^@%%#@P:^@%^@ t 1:#P#@LtL#@L L1 %%dt??^@L ^@iBt yTk%p ^@i

Slide 40

Slide 40 text

After one epoch (dataset seen once) if testingValuesIntering() { t.SetCaterCleen(time.SewsallSetrive(true) if weq := nil { t.Errorf("eshould: wont %v", touts anverals, prc.Strnared, error } t, err := ntr.Soare(cueper(err, err) if err != nil { t.Errorf("preveth dime il resetests:%d; want %#',", tr.test.Into } if err != nil { return } if err == nel { t.Errorf("LoconserrSathe foot %q::%q: %s;%want %d", d, err) }, defarenContateFule(temt.Canses) } if err != nil { return err } // Treters and restives of the sesconse stmpeletatareservet // This no to the result digares wheckader. Constate bytes alleal

Slide 41

Slide 41 text

After two epochs if !ok { t.Errorf("%d: %v not %v", i, err) } if !ot.Close() if enr != nil { t.Fatal(err) } if !ers != nil { t.Fatal(err) } if err != nil { t.Fatal(err) } if err != nil { t.Errorf("error %q: %s not %v", i, err) } return nil }

Slide 42

Slide 42 text

if got := t.struct(); !ok { t.Fatalf("Got %q: %q, %v, want %q", test, true } if !strings.Connig(t) { t.Fatalf("Got %q: %q", want %q", t, err) } if !ot { t.Errorf("%s < %v", x, y) } if !ok { t.Errorf("%d <= %d", err) } if !stricgs(); !ot { t.Errorf("!(%d <= %v", x, e) } } if !ot != nil { return "" } After many epochs

Slide 43

Slide 43 text

learning from source code as tokens

Slide 44

Slide 44 text

bytes vs tokens - Number of values - Can we invent new values? - Semantic content - A is to H as D is to ??? - Man is to King as Woman is to ???

Slide 45

Slide 45 text

A kind of dimensionality reduction. 1. Assign an identifier to every token. 2. Hot encode it, so N numbers become N vectors with N dimensions. 3. Try to represent the same information … but with M < N dimensions. Embeddings

Slide 46

Slide 46 text

word2vec source: http://jalammar.github.io/illustrated-word2vec/

Slide 47

Slide 47 text

word2vec

Slide 48

Slide 48 text

projector.tensorflow.org

Slide 49

Slide 49 text

code2vec.org

Slide 50

Slide 50 text

code2vec.org

Slide 51

Slide 51 text

- They provide a “semantic” space for tokens. - They’re normally pre-trained, speeds up our training. - Our model can handle tokens it’s never seen. - Using the word “embedding” makes you sound cool at parties. Benefits of embeddings

Slide 52

Slide 52 text

learning from source code as graphs

Slide 53

Slide 53 text

Three main approaches: - Transforming into tables - Node embeddings - Graph Neural Networks source: https://medium.com/octavian-ai/how-to-get-started-with-machine-learning-on-graphs-7f0795c83763 Learning from graphs

Slide 54

Slide 54 text

Node embeddings - Similar to the previous embeddings, encode information as vectors. - Goal: similarity in embedding space ⇒ similarity on original network source: http://snap.stanford.edu/proj/embeddings-www/files/nrltutorial-part1-embeddings.pdf

Slide 55

Slide 55 text

Node embeddings - They can be applied at multiple levels, leading to some kind of “summary” of a graph. source: https://arxiv.org/pdf/1709.07604.pdf

Slide 56

Slide 56 text

Random walks - Transforms a graph into a series of paths (aka a matrix) - They are often used to create embeddings. - dot product on embedding space ~ prob. of nodes in a random walk.

Slide 57

Slide 57 text

Learning to Represent Programs with Graphs from, err := os.Open("a.txt") if err != nil { log.Fatal(err) } defer from.Close() to, err := os.Open("b.txt") if err != nil { log.Fatal(err) } defer ???.Close() io.Copy(to, from) Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi https://arxiv.org/abs/1711.00740 The VARMISUSE Task: Given a program and a gap in it, predict what variable is missing.

Slide 58

Slide 58 text

code2vec: Learning Distributed Representations of Code Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav https://arxiv.org/abs/1803.09473 | https://code2vec.org/

Slide 59

Slide 59 text

code2vec.org

Slide 60

Slide 60 text

source: A Gentle Introduction to Graph Neural Networks (Basics, DeepWalk, and GraphSage) source: The graph neural network model source: Graph Neural Networks: A Review of Methods and Applications Graph Neural Networks

Slide 61

Slide 61 text

Much more research github.com/src-d/awesome-machine-learning-on-source-code

Slide 62

Slide 62 text

Graph Graph Graph ...

Slide 63

Slide 63 text

D is for distributed!

Slide 64

Slide 64 text

A new generation of tools

Slide 65

Slide 65 text

Microsoft IntelliCode source: www.microsoft.com/en-us/research/blog/learning-source-code/ Uses concepts from Learning to Represent Programs with Graphs.

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

Is this a good name? func XXX(list []string, text string) bool { for _, s := range list { if s == text { return true } } return false } Suggestions: ● Contains ● Has func XXX(list []string, text string) int { for i, s := range list { if s == text { return i } } return -1 } Suggestions: ● Find ● Index code2vec: Learning Distributed Representations of Code

Slide 68

Slide 68 text

Towards Natural Language Semantic Code Search source: github.blog/2018-09-18-towards-natural-language-semantic-code-search/ Embedding code and their descriptions together for semantic search.

Slide 69

Slide 69 text

experiments.github.com/semantic-code-search

Slide 70

Slide 70 text

Facebook Sapienz and SapFix source: Finding and fixing software bugs automatically with SapFix and Sapienz Automated bug detection at scale.

Slide 71

Slide 71 text

Initially by Ubisoft: Commit-Assistant Research: CLEVER: Combining Code Metrics with Clone Detection for Just-In-Time Fault Prevention and Resolution in Large Industrial Projects CLEVER detects “risky commits” and provides potential fixes. Ubisoft + Mozilla: CLEVER-Commit

Slide 72

Slide 72 text

And so much more ● Automated Style Guide Enforcing ● Automated Code Review ● Education ● Code Generation: from unit tests, specification, natural language description. ● ...

Slide 73

Slide 73 text

Will developers be replaced?

Slide 74

Slide 74 text

Developers will be empowered.

Slide 75

Slide 75 text

Want to know more? References: ● github.com/src-d/awesome-machine-learning-on-source-code ● speakerdeck.com/campoy/oscon19 Me: ● [email protected] ● @francesc

Slide 76

Slide 76 text

We’re hiring! dgraph.io/careers

Slide 77

Slide 77 text

Thanks francesc