Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CSI: Gopher

CSI: Gopher

If we could intelligently parse all of the open-source Go code on GitHub, what could we learn? We’re going to show you some of the interesting things we’ve found in Go projects, from library usage, idioms & package layouts, to how Gophers can use this data to make decisions about their own APIs.

Francesc Campoy Flores

April 12, 2019

More Decks by Francesc Campoy Flores

Other Decks in Programming


  1. CSI: Gopher

  2. Francesc Campoy @francesc - funemployed Matt Silverlock @elithrar - Google

  3. Two years ago, Francesc wrote a blog post

  4. What are the most imported packages? Got some cool results

  5. It wasn’t easy ...

  6. None
  7. None
  8. None
  9. Last year I attended a GoSF meetup

  10. Francesc attending Matt’s talk

  11. • Wrote a LOT of regex • Found some neat

    bugs! • Francesc had been working with tools that made what I was doing more accessible.
  12. How do you find bugs or API (mis) usage? •

    Across microservice code-bases • In all consumers of your library? • Accurately?
  13. Previously: ugly regular expressions

  14. None
  15. And: really ugly regular expressions

  16. Obviously, this is No Good ™ How can we do

  17. What's this talk actually about?

  18. Your task Find the most common string literal in your

    source code
  19. What’s a string? It depends: 'single quotes' "double quotes" '''triple

    single quotes''' """triple double quotes""" `back quotes` «Please» „make it“ qq§ STOP! §
  20. https://xkcd.com/1171/

  21. Oh, also prepare to escape your escaping characters ...

  22. https://xkcd.com/1638/

  23. Well, that’s a complicated* regular expression *impossible without recursive regular

    expressions recursive regular expressions?
  24. Recursive regular expressions?

  25. Well, we can parse the repositories as AST (Abstract Syntax

  26. Universal Abstract Syntax Trees package main import "fmt" func main()

    { fmt.Println("Hello, gophers") } File Declaration Import “fmt” Declaration FunctionGroup main Body go:CallExpr fmt Println “Hello, gophers” Fun Args
  27. Universal Abstract Syntax Trees print('''Hello, Pythonistas''') File CallExpr print “Hello,

    Pythonistas” Fun Args
  28. A single tree format for all languages: a single tool

    for all
  29. A single label set: a single concept list

  30. Universal Abstract Syntax Trees uast:String XPath: //uast:String File Declaration Import

    “fmt” Declaration FunctionGroup main Body go:CallExpr fmt Println “Hello, gophers” Fun Args “fmt” “Hello, gophers”
  31. “Hello, Pythonistas” Universal Abstract Syntax Trees '''Hello, Pythonistas''' uast:String XPath:

    //uast:String File CallExpr print Fun Args
  32. Let’s run it!

  33. While this is running ...

  34. So, how are ASTs relevant here? - We downloaded ~19GB

    of Go repositories from GitHub* - Parsed them as UAST with src-d/engine - Queried the generated databases with SQL + UAST extensions https://github.com/src-d/engine
  35. OK, so: what are some of the things we can

    investigate? - Finding ‘bad’ code - Best practices and idioms - Usage analysis of APIs
  36. Investigation #1: Finding 'bad' code

  37. "Bad crypto" - One of my favorite topics - How

    is a non-expert supposed to know that math/rand is bad vs. crypto/rand? - What about hash functions vs. KDFs?
  38. None
  39. None
  40. Investigation #2: Best practices & idioms

  41. None
  42. None
  43. None
  44. None
  45. Computer says ...

  46. None
  47. Number of init functions per file: mean 0.12

  48. None
  49. Number of init functions per file (0 to 10)

  50. Number of init functions per file (log scale) 352 init

    functions LOL
  51. Smells like vendoring ...

  52. github.com/vmware/govmomi/vim25/types/enum.go

  53. None
  54. Applied the scientific method to best practices Make an observation.

    Ask a question. Form a hypothesis, or testable explanation. Make a prediction based on the hypothesis. Test the prediction. Iterate: use the results to make new hypotheses or predictions.
  55. Investigation #3: API usage & breaking them

  56. Premise: You maintain a popular OSS library. Problem: You're thinking

    about deprecating a method, but want to quantify the impact. How do you accurately find these cases?
  57. Actual problem: • gorilla/context predates net/http's Request.Context() implementation. • Using

    both causes a memory leak, due to "islanding" the pointer to the original *Request. • Can we find these users & help?!
  58. The Problem

  59. Users!

  60. Compared to...

  61. What else?

  62. • Order your godoc by most-used identifiers - rather than

    a bunch of ErrSomething at the top • Identify dependency version usage across your org, or as a maintainer. • Smart auto-completion based on previous usages of the API.
  63. None
  64. None
  65. Tools we used and where to find them

  66. Tools: • Google BigQuery • source{d} engine: github.com/src-d/engine • A

    lot of RAM, CPUs, and time Resources: • Finding Bugs with BigQuery & GitHub: bit.ly/gosf-bq • Analyzing Go code with BigQuery: bit.ly/go-bq
  67. So … what about those strings?

  68. None
  69. #1 with 1,260,374 occurrences: '' #2 with 456,480 occurrences: 'fmt'

    #3 with 337,064 occurrences: 'json:"-"' Most common strings in Go
  70. Other languages most common strings in Python: 'automanaged' : 25,446

    '//visibility:public' : 21,962 'go_library' : 16,485 most common strings in Ruby: '\n' : 658 '' : 551 'shell' : 363 most common strings in Java: '' : 613 '\n' : 392 '0' : 293 most common strings in PHP: 'strict_param' : 1 'strict' : 1 'short_array_syntax' : 1
  71. Thanks! @elithrar @francesc