Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning on Source Code

Machine Learning on Source Code

source{d} is building the open-source components to enable large-scale code analysis and machine learning on source code.

Their powerful tools can ingest all of the world’s public git repositories turning code into ASTs ready for machine learning and other analyses, all exposed through a flexible and friendly API.

Francesc Campoy, VP of Developer Relations at source{d}, will show you how to run machine learning on source code with a series of live demos.

Francesc Campoy Flores

April 25, 2018
Tweet

More Decks by Francesc Campoy Flores

Other Decks in Programming

Transcript

  1. Machine Learning on Source Code
    #MLonCode
    Francesc Campoy, source{d}

    View Slide

  2. ● The why
    ● The way
    ● The what?
    Machine Learning on Source Code

    View Slide

  3. Francesc Campoy
    VP of Developer Relations
    @francesc
    [email protected]
    Previously Google Cloud (ML + Go)

    View Slide

  4. speakerdeck.com/campoy/machine-learning-on-source-code

    View Slide

  5. The why
    Why do we want to do this?

    View Slide

  6. Deep Learning Revolutions
    - Computer Vision: ImageNet
    - Natural Language Processing: Siri, Alexa, Google Assistant,
    - Go: AlphaGo

    View Slide

  7. Deep Learning Revolutions
    source: wikimedia

    View Slide

  8. Machine Learning on Source Code
    - Automated code review
    - Source code similarity: clustering + search
    - Translation:
    - source code to natural language
    - source code to source code
    - natural language to source code?

    View Slide

  9. source: WOCinTech

    View Slide

  10. The way
    How are we tackling the challenge?

    View Slide

  11. '112', '97', '99', '107', '97', '103', '101', '32', '109',
    '97', '105', '110', '10', '10', '105', '109', '112', '111',
    '114', '116', '32', '40', '10', '9', '34', '102', '109',
    '116', '34', '10', '41', '10', '10', '102', '117', '110',
    '99', '32', '109', '97', '105', '110', '40', '41', '32',
    '123', '10', '9', '102', '109', '116', '46', '80', '114',
    '105', '110', '116', '108', '110', '40', '34', '72', '101',
    '108', '108', '111', '44', '32', '112', '108', '97', '121',
    '103', '114', '111', '117', '110', '100', '34', '41', '10',
    '125', '10'
    package main
    import “fmt”
    func main() {
    fmt.Println(“Hello, Chicago”)
    }
    What is Source Code

    View Slide

  12. package package
    IDENT main
    ;
    import import
    STRING "fmt"
    ;
    func func
    IDENT main
    (
    )
    package main
    import “fmt”
    func main() {
    fmt.Println(“Hello, Chicago”)
    }
    What is Source Code
    {
    IDENT fmt
    .
    IDENT Println
    (
    STRING "Hello, Chicago"
    )
    ;
    }
    ;

    View Slide

  13. package main
    import “fmt”
    func main() {
    fmt.Println(“Hello, Chicago”)
    }
    What is Source Code

    View Slide

  14. Datasets

    View Slide

  15. github.com/src-d/borges
    Borges
    ● Store git repositories
    ● Rooted repositories
    Rovers
    ● Crawls for git repos
    ● Supports:
    ○ GitHub
    ○ BitBucket
    ○ Cgit
    github.com/src-d/rovers
    Siva
    ● Seekable Indexed Block
    Archiver
    ● Small, concatenable,
    constant access.
    Data retrieval at source{d}
    github.com/src-d/siva

    View Slide

  16. Rooted repositories

    View Slide

  17. https://arxiv.org/abs/1803.10144

    View Slide

  18. github.com/src-d/datasets

    View Slide

  19. Learning from Source Code

    View Slide

  20. Learning from Code
    Code is:
    - a sequence of bytes
    - a sequence of tokens
    - a tree

    View Slide

  21. Learning sequences
    Input Output
    Speech recognition: audio clip text
    Sentiment analysis: text rating -1 to 1
    Machine Translation: Hello, everyone! Hola a tothom!
    Video recognition Sequence of frames text

    View Slide

  22. Learning from Characters

    View Slide

  23. Neural networks, right?

    View Slide

  24. Recurrent Neural Networks

    View Slide

  25. source: Martin Gorner

    View Slide

  26. Predicting the next character
    Input Output
    ‘p’ ‘a’, ‘c’, ‘k’, ‘a’, ‘g’: ‘e’ 80%
    ‘i’ 15%
    ...

    View Slide

  27. Training the neural network
    p a c k a g
    e
    a c k a g

    View Slide

  28. Demo time!

    View Slide

  29. More demo time!

    View Slide

  30. Is this useful?
    - Predicting implies understanding
    - We can predict aspects of code:
    - Help with assisted editing
    - Detect anomalies

    View Slide

  31. splittingidentifiersforthewin
    (Paper coming soon)

    View Slide

  32. Learning from Tokens

    View Slide

  33. We could use the same mechanism!
    package main ; import “fmt” ;
    main ; import “fmt” ; func

    View Slide

  34. Categorical (non continuous) variables require one-hot encoding:
    Predict characters zero to 9:
    0 -> [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    1 -> [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    2 -> [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
    3 -> [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

    9 -> [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
    But … one-hot encoding

    View Slide

  35. Possible characters in human text: hundreds
    Possible characters in computer programs: hundreds
    Possible words in English: 171,476 (Oxford dictionary*)
    * urban dictionary not included
    But … one-hot encoding

    View Slide

  36. How many possible identifiers in X language?

    View Slide

  37. Word embeddings

    View Slide

  38. View Slide

  39. View Slide

  40. word2vec

    View Slide

  41. allSeenElements
    splitting (and we know how!)
    all, seen, elements
    stemming
    all, see, element
    Splitting and stemming

    View Slide

  42. id2vec
    Vocabulary size: 500,000
    Dataset: >100,000 most starred repos on GitHub
    Result: https://github.com/src-d/ml

    View Slide

  43. https://blog.sourced.tech/post/id2vec/

    View Slide

  44. Receive is to send as read is to ...

    View Slide

  45. Learning from Trees

    View Slide

  46. Abstract Syntax Trees
    ● Each programming language has its own grammar
    ● Each grammar generates slightly different ASTs
    ● We want to learn from *all* languages!

    View Slide

  47. Babelfish (https://bblf.sh)
    Universal Abstract Syntax Trees

    View Slide

  48. View Slide

  49. View Slide

  50. Structural Embeddings
    ? 0.1 0.5 0.9 ... 0.0

    View Slide

  51. https://arxiv.org/abs/1703.00572

    View Slide

  52. https://arxiv.org/abs/1803.09544

    View Slide

  53. And more coming soon!
    PS: we’re hiring

    View Slide

  54. Other projects
    - Vecino: finding similar repositories
    - Apollo: finding source code duplication at scale
    - TMSC: topic modeling on source code repositories
    - Snippet-Ranger: topic modeling on source code snippets

    View Slide

  55. Awesome Machine Learning on Source Code
    https://github.com/src-d/awesome-machine-learning-on-source-code

    View Slide

  56. Want to know more?
    @sourcedtech
    sourced.tech
    slack!

    View Slide

  57. Thanks
    Francesc Campoy
    source{d}
    @francesc
    [email protected]

    View Slide