Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning On Go Code

Machine Learning On Go Code

We've all wondered how to use Machine Learning with Go, but what about turning the tables for once? What can Machine Learning do *for* Go? During this presentation, we will discover how different Machine Learning models can help us write better go by predicting from our next character to our next bug!

Francesc’s talk will cover the basics of what Machine Learning techniques can be applied to source code, specifically:

- [embeddings over identifiers] (https://bit.ly/2HEcQhg)
- structural embeddings over source code, answering the question of how similar two fragments of code are,
- recurrent neural networks for code completion,
- future direction of the research.

While the topic is advanced, the level of mathematics required for this talk will be kept to a minimum. Rather than getting stuck in the details, we'll discuss the advantages and limitations of these techniques, and their possible implications to our developer lives.

Francesc Campoy Flores

August 28, 2018
Tweet

More Decks by Francesc Campoy Flores

Other Decks in Technology

Transcript

  1. “Software is eating the world”

    View Slide

  2. 128k LoC

    View Slide

  3. 4-5M LoC

    View Slide

  4. 9M LoC

    View Slide

  5. 18M LoC

    View Slide

  6. 45M LoC

    View Slide

  7. 150M LoC

    View Slide

  8. View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. View Slide

  13. Machine Learning on Go Code
    Francesc Campoy

    View Slide

  14. Machine Learning in Go Code
    Francesc Campoy

    View Slide

  15. VP of Developer Relations
    Previously:
    ● Developer Advocate at Google
    (Go team and Google Cloud Platform)
    twitter.com/francesc | github.com/campoy
    Francesc Campoy

    View Slide

  16. just for
    func

    View Slide

  17. Agenda
    ● Machine Learning on Source Code
    ● Research
    ● Use Cases
    ● The Future

    View Slide

  18. Machine Learning on Source Code

    View Slide

  19. Machine Learning on Source Code
    Field of Machine Learning where the input data is source code.
    MLonCode

    View Slide

  20. Machine Learning on Source Code
    Requires:
    ● Lots of data
    ● Really, lots and lots of data
    ● Fancy ML Algorithms
    ● A little bit of luck
    Related Fields:
    ● Data Mining
    ● Natural Language Processing
    ● Graph Based Machine Learning

    View Slide

  21. Challenge #1
    Data Retrieval

    View Slide

  22. The datasets of ML on Code
    ● GH Archive: https://www.gharchive.org
    ● Public Git Archive https://pga.sourced.tech

    View Slide

  23. Tasks
    ● Language Classification
    ● File Parsing
    ● Token Extraction
    ● Reference Resolution
    ● History Analysis
    Retrieving data for ML on Code
    Tools
    ● enry, linguist, etc
    ● Babelfish, ad-hoc parsers
    ● XPath / CSS selectors
    ● Kythe
    ● go-git

    View Slide

  24. srcd sql
    # total lines of code per language in the Go repo
    SELECT lang, SUM(lines) as total_lines
    FROM (
    SELECT
    LANGUAGE(t.tree_entry_name, b.blob_content) AS lang,
    ARRAY_LENGTH(SPLIT(b.blob_content, '\n')) as lines
    FROM refs r
    NATURAL JOIN commits c
    NATURAL JOIN commit_trees ct
    NATURAL JOIN tree_entries t
    NATURAL JOIN blobs b
    WHERE r.ref_name = 'HEAD' and r.repository_id = 'go'
    ) AS lines
    WHERE lang is not null
    GROUP BY lang
    ORDER BY total_lines DESC;

    View Slide

  25. SELECT files.repository_id, files.file_path,
    ARRAY_LENGTH(UAST(
    files.blob_content,
    LANGUAGE(files.file_path, files.blob_content),
    '//*[@roleFunction and @roleDeclaration]')
    ) as functions
    FROM files
    NATURAL JOIN refs
    WHERE
    LANGUAGE(files.file_path,files.blob_content) = 'Go'
    AND refs.ref_name = 'HEAD'
    srcd sql

    View Slide

  26. View Slide

  27. source{d} engine
    github.com/src-d/engine

    View Slide

  28. Challenge #2
    Data Analysis

    View Slide

  29. '112', '97', '99', '107', '97', '103',
    '101', '32', '109', '97', '105', '110',
    '10', '10', '105', '109', '112', '111',
    '114', '116', '32', '40', '10', '9',
    '34', '102', '109', '116', '34', '10',
    '41', '10', '10', '102', '117', '110',
    '99', '32', '109', '97', '105', '110',
    '40', '41', '32', '123', '10', '9',
    '102', '109', '116', '46', '80', '114',
    '105', '110', '116', '108', '110', '40',
    '34', '72', '101', '108', '108', '111',
    '44', '32', '112', '108', '97', '121',
    '103', '114', '111', '117', '110', '100',
    '34', '41', '10', '125', '10'
    package main
    import “fmt”
    func main() {
    fmt.Println(“Hello, Denver”)
    }
    What is Source Code

    View Slide

  30. package package
    IDENT main
    ;
    import import
    STRING "fmt"
    ;
    func func
    IDENT main
    (
    )
    What is Source Code
    {
    IDENT fmt
    .
    IDENT Println
    (
    STRING "Hello, Denver"
    )
    ;
    }
    ;
    package main
    import “fmt”
    func main() {
    fmt.Println(“Hello, Denver”)
    }

    View Slide

  31. What is Source Code
    package main
    import “fmt”
    func main() {
    fmt.Println(“Hello, Denver”)
    }

    View Slide

  32. What is Source Code
    package main
    import “fmt”
    func main() {
    fmt.Println(“Hello, Denver”)
    }

    View Slide

  33. What is Source Code
    ● A sequence of bytes
    ● A sequence of tokens
    ● An abstract syntax tree
    ● A Graph (e.g. Control Flow Graph)

    View Slide

  34. Challenge #3
    Learning from Source Code

    View Slide

  35. Neural Networks
    Basically fancy linear regression machines
    Given an input of a constant length,
    they predict an output of constant length.
    Example:
    MNIST:
    Input: images with 28x28 px
    Output: a digit from zero to 9

    View Slide

  36. MNIST
    ~0
    ~0
    ~0
    ~0
    ~0
    ~0
    ~0
    ~0
    ~1
    ~0

    View Slide

  37. MLonCode: Predict the next token
    for
    i
    :=
    0
    ;
    i
    <
    10
    ;
    i
    ++

    View Slide

  38. Recurrent Neural Networks
    Can process sequences of variable length.
    Uses its own output as a new input.
    Example:
    Natural Language Translation:
    Input: “bonjour, les gauffres”
    Output: “hi, waffles”

    View Slide

  39. MLonCode: Code Generation
    charRNN: Given n characters, predict the next one
    Trained over the Go standard library
    Achieved 61% accuracy on predictions.

    View Slide

  40. Before training
    r t,
    [email protected] kpktp 0p000 xS%%%?ttk?^@p0rk^@%[email protected]#p^@ #pp}}%p^@?P%^@@k#%@P}}ta [email protected]}^@t%@% %%aNt i
    ^@[email protected]@ pyikkp?%y ?t k L P0L t% ^@i%yy ^@p i? %L%LL tyLP?a [email protected]?tkk^@ @^@ykk^@i#P^@[email protected][email protected]%1tt%^@tPTta L
    ^@LL%% %i1::yyy^@^@t tP @[email protected]#Patt 1^@@ k^@k ? yt%L1^@tP%k1?k? % ^@i ^@ta1?1taktt1P?a^@^@Pkt?#^@t^@##1?##
    #^@t11#:^@%??t%1^@a 1?a at1P ^@^@Pt #%^@^@ ^@aaak^@#a#?P1Pa^@tt%?^@kt?#akP ?#^@i%%aa ^@1%t tt?a?%
    t^@k^@^@k^@ a : ^@1 P# % ^@^@#t% :% kkP ^@#?P: t^@a
    ?%##?kkPaP^@ #a k?t?? ai?i%PPk taP% P^@ k^@iiai#?^@# #t ?# P?P^@ i^@ttPt #
    1%11 ti a^@k P^@k ^@kt %^@%y?#a a#% @? kt ^@t%k? ^@PtttkL tkLa1 ^@iaay?p1% Pta tt ik?ty
    k^@kpt%^@tktpkryyp^@?pP# %kt?ki? i @t^@k^@%#P} ?at}akP##Pa11%^@i% ^@?ia ia%##%tki %
    }i%%%}} a ay^@%yt }%t ^@tU%a% t}yi^@ ^@ @t yt%? aP @% ^@??^@%? ^@??k#%
    kk#%t?a: P}^@t :#^@#1t^@#: w^@P#%w:Pt t # t%aa%[email protected]@@^@[email protected]^@a # y}^@# ^@? % tP i?
    ?tk ktPPt a tpprrpt? a^@ pP pt %p ? k? ^@^@ kP^@%%?tk a Pt^@#
    tP? P kkP1L1tP a%? t1P%PPti^@?%ytk %#%%[email protected]?^@ty^@iyk%1#^@@^@1#t a [email protected]^@^@ [email protected]^@1P^@%%#@P:^@%^@ t
    1:#P#@LtL#@L L1 %%dt??^@L ^@iBt yTk%p ^@i

    View Slide

  41. After one epoch (dataset seen once)
    if testingValuesIntering() {
    t.SetCaterCleen(time.SewsallSetrive(true)
    if weq := nil {
    t.Errorf("eshould: wont %v", touts anverals, prc.Strnared, error
    }
    t, err := ntr.Soare(cueper(err, err)
    if err != nil {
    t.Errorf("preveth dime il resetests:%d; want %#',", tr.test.Into
    }
    if err != nil {
    return
    }
    if err == nel {
    t.Errorf("LoconserrSathe foot %q::%q: %s;%want %d", d, err)
    },
    defarenContateFule(temt.Canses)
    }
    if err != nil {
    return err
    }
    // Treters and restives of the sesconse stmpeletatareservet
    // This no to the result digares wheckader. Constate bytes alleal

    View Slide

  42. After two epochs
    if !ok {
    t.Errorf("%d: %v not %v", i, err)
    }
    if !ot.Close()
    if enr != nil {
    t.Fatal(err)
    }
    if !ers != nil {
    t.Fatal(err)
    }
    if err != nil {
    t.Fatal(err)
    }
    if err != nil {
    t.Errorf("error %q: %s not %v", i, err)
    }
    return nil
    }

    View Slide

  43. if got := t.struct(); !ok {
    t.Fatalf("Got %q: %q, %v, want %q", test, true
    }
    if !strings.Connig(t) {
    t.Fatalf("Got %q: %q", want %q", t, err)
    }
    if !ot {
    t.Errorf("%s < %v", x, y)
    }
    if !ok {
    t.Errorf("%d <= %d", err)
    }
    if !stricgs(); !ot {
    t.Errorf("!(%d <= %v", x, e)
    }
    }
    if !ot != nil {
    return ""
    }
    After many epochs

    View Slide

  44. Learning to Represent Programs with Graphs
    from, err := os.Open("a.txt")
    if err != nil {
    log.Fatal(err)
    }
    defer from.Close()
    to, err := os.Open("b.txt")
    if err != nil {
    log.Fatal(err)
    }
    defer ???.Close()
    io.Copy(to, from)
    Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi
    https://arxiv.org/abs/1711.00740
    The VARMISUSE Task:
    Given a program and a gap in it,
    predict what variable is missing.

    View Slide

  45. code2vec: Learning Distributed Representations of Code
    Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav
    https://arxiv.org/abs/1803.09473 | https://code2vec.org/

    View Slide

  46. Much more research
    github.com/src-d/awesome-machine-learning-on-source-code

    View Slide

  47. Challenge #4
    What can we build?

    View Slide

  48. Predictable vs Predicted
    ~0
    ~0
    ~0
    ~0
    ~0
    ~0
    ~0
    ~0
    ~1
    ~0

    View Slide

  49. A
    G
    o
    PR
    An attention model for code reviews.

    View Slide

  50. View Slide

  51. View Slide

  52. Can you see the mistake?
    VARMISUSE
    from, err := os.Open("a.txt")
    if err != nil {
    log.Fatal(err)
    }
    defer from.Close()
    to, err := os.Open("b.txt")
    if err != nil {
    log.Fatal(err)
    }
    defer from.Close()
    io.Copy(to, from)

    View Slide

  53. Can you see the mistake?
    VARMISUSE
    from, err := os.Open("a.txt")
    if err != nil {
    log.Fatal(err)
    }
    defer from.Close()
    to, err := os.Open("b.txt")
    if err != nil {
    log.Fatal(err)
    }
    defer from.Close() ← s/from/to/
    io.Copy(to, from)

    View Slide

  54. Is this a good name?
    func XXX(list []string, text string) bool {
    for _, s := range list {
    if s == text {
    return true
    }
    }
    return false
    }
    Suggestions:
    ● Contains
    ● Has
    func XXX(list []string, text string) int {
    for i, s := range list {
    if s == text {
    return i
    }
    }
    return -1
    }
    Suggestions:
    ● Find
    ● Index
    code2vec: Learning Distributed Representations of Code

    View Slide

  55. source: WOCinTech
    Assisted code review. src-d/lookout

    View Slide

  56. And so much more
    Coming soon:
    ● Automated Style Guide Enforcing
    ● Bug Prediction
    ● Automated Code Review
    ● Education
    Coming … later:
    ● Code Generation: from unit tests, specification, natural language description.
    ● Natural Analysis: code description and conversational analysis.

    View Slide

  57. Will developers be replaced?

    View Slide

  58. Developers will be empowered.

    View Slide

  59. Want to know more?
    ● sourced.tech (pssh, we’re hiring)
    ● github.com/src-d/awesome-machine-learning-on-source-code
    [email protected]
    ● come say hi, I have stickers

    View Slide

  60. Thanks
    francesc

    View Slide