Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CSI: Gopher

CSI: Gopher

If we could intelligently parse all of the open-source Go code on GitHub, what could we learn? We’re going to show you some of the interesting things we’ve found in Go projects, from library usage, idioms & package layouts, to how Gophers can use this data to make decisions about their own APIs.

Francesc Campoy Flores

April 12, 2019
Tweet

More Decks by Francesc Campoy Flores

Other Decks in Programming

Transcript

  1. Francesc Campoy @francesc - funemployed
    Matt Silverlock @elithrar - Google

    View full-size slide

  2. Two years ago, Francesc wrote a blog post

    View full-size slide

  3. What are the most imported packages?
    Got some cool results

    View full-size slide

  4. It wasn’t easy ...

    View full-size slide

  5. Last year I attended a GoSF meetup

    View full-size slide

  6. Francesc attending Matt’s talk

    View full-size slide

  7. ● Wrote a LOT of regex
    ● Found some neat bugs!
    ● Francesc had been working with tools that made
    what I was doing more accessible.

    View full-size slide

  8. How do you find bugs or API (mis) usage?
    ● Across microservice code-bases
    ● In all consumers of your library?
    ● Accurately?

    View full-size slide

  9. Previously: ugly regular expressions

    View full-size slide

  10. And: really ugly regular expressions

    View full-size slide

  11. Obviously, this is No Good ™
    How can we do better?

    View full-size slide

  12. What's this talk actually about?

    View full-size slide

  13. Your task
    Find the most common string literal in your source code

    View full-size slide

  14. What’s a string? It depends:
    'single quotes'
    "double quotes"
    '''triple single quotes'''
    """triple double quotes"""
    `back quotes`
    «Please»
    „make it“
    qq§ STOP! §

    View full-size slide

  15. https://xkcd.com/1171/

    View full-size slide

  16. Oh, also prepare to escape your escaping characters ...

    View full-size slide

  17. https://xkcd.com/1638/

    View full-size slide

  18. Well, that’s a complicated* regular expression
    *impossible without recursive regular expressions
    recursive regular expressions?

    View full-size slide

  19. Recursive regular expressions?

    View full-size slide

  20. Well, we can parse the repositories as AST
    (Abstract Syntax Trees).

    View full-size slide

  21. Universal Abstract Syntax Trees
    package main
    import "fmt"
    func main() {
    fmt.Println("Hello, gophers")
    }
    File
    Declaration
    Import
    “fmt”
    Declaration
    FunctionGroup
    main Body
    go:CallExpr
    fmt Println
    “Hello,
    gophers”
    Fun Args

    View full-size slide

  22. Universal Abstract Syntax Trees
    print('''Hello, Pythonistas''')
    File
    CallExpr
    print
    “Hello,
    Pythonistas”
    Fun Args

    View full-size slide

  23. A single tree format for all languages: a single tool for all

    View full-size slide

  24. A single label set: a single concept list

    View full-size slide

  25. Universal Abstract Syntax Trees
    uast:String
    XPath: //uast:String
    File
    Declaration
    Import
    “fmt”
    Declaration
    FunctionGroup
    main Body
    go:CallExpr
    fmt Println
    “Hello,
    gophers”
    Fun Args
    “fmt”
    “Hello,
    gophers”

    View full-size slide

  26. “Hello,
    Pythonistas”
    Universal Abstract Syntax Trees
    '''Hello,
    Pythonistas'''
    uast:String
    XPath: //uast:String
    File
    CallExpr
    print
    Fun Args

    View full-size slide

  27. Let’s run it!

    View full-size slide

  28. While this is running ...

    View full-size slide

  29. So, how are ASTs relevant here?
    - We downloaded ~19GB of Go repositories
    from GitHub*
    - Parsed them as UAST with src-d/engine
    - Queried the generated databases with
    SQL + UAST extensions
    https://github.com/src-d/engine

    View full-size slide

  30. OK, so: what are some of the things we can
    investigate?
    - Finding ‘bad’ code
    - Best practices and idioms
    - Usage analysis of APIs

    View full-size slide

  31. Investigation #1: Finding 'bad' code

    View full-size slide

  32. "Bad crypto"
    - One of my favorite topics
    - How is a non-expert supposed to know
    that math/rand is bad vs. crypto/rand?
    - What about hash functions vs. KDFs?

    View full-size slide

  33. Investigation #2: Best practices & idioms

    View full-size slide

  34. Computer says ...

    View full-size slide

  35. Number of init functions per file: mean 0.12

    View full-size slide

  36. Number of init functions per file (0 to 10)

    View full-size slide

  37. Number of init functions per file (log scale)
    352 init
    functions
    LOL

    View full-size slide

  38. Smells like vendoring ...

    View full-size slide

  39. github.com/vmware/govmomi/vim25/types/enum.go

    View full-size slide

  40. Applied the scientific method to best practices
    Make an observation.
    Ask a question.
    Form a hypothesis, or testable explanation.
    Make a prediction based on the hypothesis.
    Test the prediction.
    Iterate: use the results to make new hypotheses or predictions.

    View full-size slide

  41. Investigation #3: API usage & breaking them

    View full-size slide

  42. Premise: You maintain a popular OSS library.
    Problem: You're thinking about deprecating
    a method, but want to quantify the impact.
    How do you accurately find these cases?

    View full-size slide

  43. Actual problem:
    ● gorilla/context predates net/http's
    Request.Context() implementation.
    ● Using both causes a memory leak, due to
    "islanding" the pointer to the original
    *Request.
    ● Can we find these users & help?!

    View full-size slide

  44. Compared to...

    View full-size slide

  45. ● Order your godoc by most-used identifiers - rather
    than a bunch of ErrSomething at the top
    ● Identify dependency version usage across your org, or
    as a maintainer.
    ● Smart auto-completion based on previous usages of
    the API.

    View full-size slide

  46. Tools we used and where to find them

    View full-size slide

  47. Tools:
    ● Google BigQuery
    ● source{d} engine: github.com/src-d/engine
    ● A lot of RAM, CPUs, and time
    Resources:
    ● Finding Bugs with BigQuery & GitHub: bit.ly/gosf-bq
    ● Analyzing Go code with BigQuery: bit.ly/go-bq

    View full-size slide

  48. So … what about those strings?

    View full-size slide

  49. #1 with 1,260,374 occurrences:
    ''
    #2 with 456,480 occurrences:
    'fmt'
    #3 with 337,064 occurrences:
    'json:"-"'
    Most common strings in Go

    View full-size slide

  50. Other languages
    most common strings in Python:
    'automanaged' : 25,446
    '//visibility:public' : 21,962
    'go_library' : 16,485
    most common strings in Ruby:
    '\n' : 658
    '' : 551
    'shell' : 363
    most common strings in Java:
    '' : 613
    '\n' : 392
    '0' : 293
    most common strings in PHP:
    'strict_param' : 1
    'strict' : 1
    'short_array_syntax' : 1

    View full-size slide

  51. Thanks!
    @elithrar
    @francesc

    View full-size slide