Machine Learning will change programming

Machine Learning will change programming

ML has revolutionized many fields from cancer detection to self-driving cars, and let’s not forget about connected toilets that allow Alexa to flush at your command.

Researchers have been working on applying ML to source code to predict bugs, find patterns in code, and much more; building products to apply this research to improve day-to-day developer tasks.

D8e5d79ca42edc07693b9c1aacaa7e5e?s=128

Francesc Campoy Flores

July 17, 2019
Tweet

Transcript

  1. “Software is eating the world”

  2. 128k LoC

  3. 4-5M LoC

  4. 9M LoC

  5. 18M LoC

  6. 45M LoC

  7. 150M LoC

  8. None
  9. None
  10. None
  11. 80 Invented in 1725

  12. Founded in 1896, later became IBM

  13. Created in 1969

  14. Created in 1976 - iMproved in 1991

  15. Created in 1981

  16. Released in 2014

  17. None
  18. None
  19. Machine Learning will change programming Francesc Campoy

  20. VP of Product at Dgraph Labs @francesc Previously: • VP

    of Product & DevRel at source{d} • Senior Developer Advocate at Google (Go team and Google Cloud Platform) Francesc Campoy
  21. just for func

  22. Agenda • Machine Learning on Source Code • Research •

    Use Cases • The Future
  23. Machine Learning on Source Code

  24. Machine Learning on Source Code Field of Machine Learning where

    the input data is source code. MLonCode
  25. Machine Learning on Source Code Requires: • Lots of data

    • Really, lots and lots of data • Fancy ML Algorithms • A little bit of luck Related Fields: • Data Mining • Natural Language Processing • Graph Based Machine Learning
  26. What is source code?

  27. '112', '97', '99', '107', '97', '103', '101', '32', '109', '97',

    '105', '110', '10', '10', '105', '109', '112', '111', '114', '116', '32', '40', '10', '9', '34', '102', '109', '116', '34', '10', '41', '10', '10', '102', '117', '110', '99', '32', '109', '97', '105', '110', '40', '41', '32', '123', '10', '9', '102', '109', '116', '46', '80', '114', '105', '110', '116', '108', '110', '40', '34', '72', '101', '108', '108', '111', '44', '32', '112', '108', '97', '121', '103', '114', '111', '117', '110', '100', '34', '41', '10', '125', '10' package main import “fmt” func main() { fmt.Println(“Hello, Denver”) } What is Source Code
  28. package package IDENT main ; import import STRING "fmt" ;

    func func IDENT main ( ) What is Source Code { IDENT fmt . IDENT Println ( STRING "Hello, Denver" ) ; } ; package main import “fmt” func main() { fmt.Println(“Hello, Denver”) }
  29. What is Source Code package main import “fmt” func main()

    { fmt.Println(“Hello, Denver”) }
  30. What is Source Code package main import “fmt” func main()

    { fmt.Println(“Hello, Denver”) }
  31. What is Source Code • A sequence of bytes •

    A sequence of tokens • An abstract syntax tree • A Graph (e.g. Control Flow Graph)
  32. learning from source code

  33. learning from source code as bytes

  34. Neural Networks Basically fancy linear regression machines Given an input

    of a constant length, they predict an output of constant length. Example: MNIST: Input: images with 28x28 px Output: a digit from zero to 9
  35. MNIST ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~1

    ~0
  36. MLonCode: Predict the next token f o r i :

    = 0 ; i
  37. Recurrent Neural Networks Can process sequences of variable length. Uses

    its own output as a new input. Example: Natural Language Translation: Input: “Estic molt constipat” Output: “I got a serious cold”
  38. MLonCode: Code Generation charRNN: Given n characters, predict the next

    one Trained over the Go standard library Achieved 61% accuracy on predictions.
  39. Before training r t, kp0t@pp kpktp 0p000 xS%%%?ttk?^@p0rk^@%ppp@ta#p^@ #pp}}%p^@?P%^@@k#%@P}}ta S?@}^@t%@%

    %%aNt i ^@SSt@@ pyikkp?%y ?t k L P0L t% ^@i%yy ^@p i? %L%LL tyLP?a ?L@Ly?tkk^@ @^@ykk^@i#P^@iL@??@%1tt%^@tPTta L ^@LL%% %i1::yyy^@^@t tP @?@a#Patt 1^@@ k^@k ? yt%L1^@tP%k1?k? % ^@i ^@ta1?1taktt1P?a^@^@Pkt?#^@t^@##1?## #^@t11#:^@%??t%1^@a 1?a at1P ^@^@Pt #%^@^@ ^@aaak^@#a#?P1Pa^@tt%?^@kt?#akP ?#^@i%%aa ^@1%t tt?a?% t^@k^@^@k^@ a : ^@1 P# % ^@^@#t% :% kkP ^@#?P: t^@a ?%##?kkPaP^@ #a k?t?? ai?i%PPk taP% P^@ k^@iiai#?^@# #t ?# P?P^@ i^@ttPt # 1%11 ti a^@k P^@k ^@kt %^@%y?#a a#% @? kt ^@t%k? ^@PtttkL tkLa1 ^@iaay?p1% Pta tt ik?ty k^@kpt%^@tktpkryyp^@?pP# %kt?ki? i @t^@k^@%#P} ?at}akP##Pa11%^@i% ^@?ia ia%##%tki % }i%%%}} a ay^@%yt }%t ^@tU%a% t}yi^@ ^@ @t yt%? aP @% ^@??^@%? ^@??k#% kk#%t?a: P}^@t :#^@#1t^@#: w^@P#%w:Pt t # t%aa%i@ak@@^@ka@^@a # y}^@# ^@? % tP i? ?tk ktPPt a tpprrpt? a^@ pP pt %p ? k? ^@^@ kP^@%%?tk a Pt^@# tP? P kkP1L1tP a%? t1P%PPti^@?%ytk %#%%t?@?^@ty^@iyk%1#^@@^@1#t a t@P^@^@ P@^@1P^@%%#@P:^@%^@ t 1:#P#@LtL#@L L1 %%dt??^@L ^@iBt yTk%p ^@i
  40. After one epoch (dataset seen once) if testingValuesIntering() { t.SetCaterCleen(time.SewsallSetrive(true)

    if weq := nil { t.Errorf("eshould: wont %v", touts anverals, prc.Strnared, error } t, err := ntr.Soare(cueper(err, err) if err != nil { t.Errorf("preveth dime il resetests:%d; want %#',", tr.test.Into } if err != nil { return } if err == nel { t.Errorf("LoconserrSathe foot %q::%q: %s;%want %d", d, err) }, defarenContateFule(temt.Canses) } if err != nil { return err } // Treters and restives of the sesconse stmpeletatareservet // This no to the result digares wheckader. Constate bytes alleal
  41. After two epochs if !ok { t.Errorf("%d: %v not %v",

    i, err) } if !ot.Close() if enr != nil { t.Fatal(err) } if !ers != nil { t.Fatal(err) } if err != nil { t.Fatal(err) } if err != nil { t.Errorf("error %q: %s not %v", i, err) } return nil }
  42. if got := t.struct(); !ok { t.Fatalf("Got %q: %q, %v,

    want %q", test, true } if !strings.Connig(t) { t.Fatalf("Got %q: %q", want %q", t, err) } if !ot { t.Errorf("%s < %v", x, y) } if !ok { t.Errorf("%d <= %d", err) } if !stricgs(); !ot { t.Errorf("!(%d <= %v", x, e) } } if !ot != nil { return "" } After many epochs
  43. learning from source code as tokens

  44. bytes vs tokens - Number of values - Can we

    invent new values? - Semantic content - A is to H as D is to ??? - Man is to King as Woman is to ???
  45. A kind of dimensionality reduction. 1. Assign an identifier to

    every token. 2. Hot encode it, so N numbers become N vectors with N dimensions. 3. Try to represent the same information … but with M < N dimensions. Embeddings
  46. word2vec source: http://jalammar.github.io/illustrated-word2vec/

  47. word2vec

  48. projector.tensorflow.org

  49. code2vec.org

  50. code2vec.org

  51. - They provide a “semantic” space for tokens. - They’re

    normally pre-trained, speeds up our training. - Our model can handle tokens it’s never seen. - Using the word “embedding” makes you sound cool at parties. Benefits of embeddings
  52. learning from source code as graphs

  53. Three main approaches: - Transforming into tables - Node embeddings

    - Graph Neural Networks source: https://medium.com/octavian-ai/how-to-get-started-with-machine-learning-on-graphs-7f0795c83763 Learning from graphs
  54. Node embeddings - Similar to the previous embeddings, encode information

    as vectors. - Goal: similarity in embedding space ⇒ similarity on original network source: http://snap.stanford.edu/proj/embeddings-www/files/nrltutorial-part1-embeddings.pdf
  55. Node embeddings - They can be applied at multiple levels,

    leading to some kind of “summary” of a graph. source: https://arxiv.org/pdf/1709.07604.pdf
  56. Random walks - Transforms a graph into a series of

    paths (aka a matrix) - They are often used to create embeddings. - dot product on embedding space ~ prob. of nodes in a random walk.
  57. Learning to Represent Programs with Graphs from, err := os.Open("a.txt")

    if err != nil { log.Fatal(err) } defer from.Close() to, err := os.Open("b.txt") if err != nil { log.Fatal(err) } defer ???.Close() io.Copy(to, from) Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi https://arxiv.org/abs/1711.00740 The VARMISUSE Task: Given a program and a gap in it, predict what variable is missing.
  58. code2vec: Learning Distributed Representations of Code Uri Alon, Meital Zilberstein,

    Omer Levy, Eran Yahav https://arxiv.org/abs/1803.09473 | https://code2vec.org/
  59. code2vec.org

  60. source: A Gentle Introduction to Graph Neural Networks (Basics, DeepWalk,

    and GraphSage) source: The graph neural network model source: Graph Neural Networks: A Review of Methods and Applications Graph Neural Networks
  61. Much more research github.com/src-d/awesome-machine-learning-on-source-code

  62. Graph Graph Graph ...

  63. D is for distributed!

  64. A new generation of tools

  65. Microsoft IntelliCode source: www.microsoft.com/en-us/research/blog/learning-source-code/ Uses concepts from Learning to Represent

    Programs with Graphs.
  66. None
  67. Is this a good name? func XXX(list []string, text string)

    bool { for _, s := range list { if s == text { return true } } return false } Suggestions: • Contains • Has func XXX(list []string, text string) int { for i, s := range list { if s == text { return i } } return -1 } Suggestions: • Find • Index code2vec: Learning Distributed Representations of Code
  68. Towards Natural Language Semantic Code Search source: github.blog/2018-09-18-towards-natural-language-semantic-code-search/ Embedding code

    and their descriptions together for semantic search.
  69. experiments.github.com/semantic-code-search

  70. Facebook Sapienz and SapFix source: Finding and fixing software bugs

    automatically with SapFix and Sapienz Automated bug detection at scale.
  71. Initially by Ubisoft: Commit-Assistant Research: CLEVER: Combining Code Metrics with

    Clone Detection for Just-In-Time Fault Prevention and Resolution in Large Industrial Projects CLEVER detects “risky commits” and provides potential fixes. Ubisoft + Mozilla: CLEVER-Commit
  72. And so much more • Automated Style Guide Enforcing •

    Automated Code Review • Education • Code Generation: from unit tests, specification, natural language description. • ...
  73. Will developers be replaced?

  74. Developers will be empowered.

  75. Want to know more? References: • github.com/src-d/awesome-machine-learning-on-source-code • speakerdeck.com/campoy/oscon19 Me:

    • francesc@dgraph.io • @francesc
  76. We’re hiring! dgraph.io/careers

  77. Thanks francesc