Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What can you learn about thousands of Groovy source files in Github using BigQuery

What can you learn about thousands of Groovy source files in Github using BigQuery

Github and Google recently released a huge dataset on BigQuery about all the Open Source code contained on Github, thanks to the data from the Github Archive project. What can you learn from such a dataset? What about learning about the most frequently used dependencies and imports? Which version of Gradle is the most widespread? What are the most frequent test libraries used?

Guillaume will show you some concrete examples from the dataset, with examples applied to the Apache Groovy language, to learn more about our programming habits. And figure out on your own about your favorite language, web framework and build tool usage patterns!

Guillaume Laforge

November 22, 2016
Tweet

More Decks by Guillaume Laforge

Other Decks in Technology

Transcript

  1. What can you learn from thousands of source files in

    Github? Guillaume Laforge Developer Advocate Google Cloud Platform Apache Groovy PMC chair @glaforge Groovy using BigQuery G3 Summit 2016
  2. The numbers? HUGE 3TB+ of data over 2.8 million repositories

    with 145 million unique commits 2 billion file paths and the contents of the latest revision of 163 million files
  3. What’s BigQuery? “BigQuery is Google's fully managed, petabyte scale, low

    cost enterprise data warehouse for analytics. BigQuery is serverless. There is no infrastructure to manage and you don't need a database administrator, so you can focus on analyzing data to find meaningful insights using familiar SQL. ” Source: cloud.google.com/bigquery
  4. BigQuery is… Dremel Google published a research paper on “Dremel:

    Interactive Analysis of Web-scale Datasets”. Dremel in production since 2008, and BigQuery since 2012, processing exabytes every month! Source: research.google.com/pubs/pub36632.html
  5. Apache Groovy “A multi-faceted language for the Java platform Apache

    Groovy is a powerful, optionally typed and dynamic language, with static-typing and static compilation capabilities, for the Java platform aimed at improving developer productivity thanks to a concise, familiar and easy to learn syntax. It integrates smoothly with any Java program, and immediately delivers to your application powerful features, including scripting capabilities, Domain-Specific Language authoring, runtime and compile-time meta-programming and functional programming.” Source: www.groovy-lang.org
  6. Analyzing Groovy source code What can we learn from all

    the Groovy files in those million repositories on Github? ➔ How many Groovy files are there on Github? ➔ What are the most popular Groovy file names? ➔ How many lines of Groovy source code are there? ➔ What's the distribution of size of source files? ➔ What are the most frequent imported packages? ➔ What are the most popular Groovy APIs used? ➔ What are the most used AST transformations? ➔ Do people use import aliases much? ➔ Did developers adopt traits?
  7. A bit of setup... SELECT * FROM [bigquery-public-data:github_repos.files] WHERE RIGHT(path,

    7) = '.groovy' SELECT * FROM [bigquery-public-data:github_repos.contents] WHERE id IN (SELECT id FROM [github.files])
  8. A bit of setup... SELECT * FROM [bigquery-public-data:github_repos.files] WHERE RIGHT(path,

    7) = '.gradle' SELECT * FROM [bigquery-public-data:github_repos.contents] WHERE id IN (SELECT id FROM [github.gradle_build_files])
  9. A bit of setup... SELECT * FROM [bigquery-public-data:github_repos.contents] WHERE id

    IN ( SELECT id FROM [bigquery-public-data:github_repos.files] WHERE path LIKE '%gradle-wrapper.properties' )
  10. How many Groovy files are there on Github? SELECT COUNT(*)

    FROM [github-groovy-files:github.files] 743,070
  11. What are the most frequent Groovy file names? SELECT TOP(f,

    20) AS filename, COUNT(*) AS size FROM ( SELECT LAST(SPLIT(path, '/')) AS f FROM [github-groovy-files:github.files] )
  12. How many lines of Groovy source code are there? SELECT

    COUNT(line) AS total_lines FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.contents] ) 16,464,376
  13. What’s the distribution of size of source files? SELECT QUANTILES(total_lines,

    11) AS lines FROM ( SELECT COUNT(line) AS total_lines FROM ( SELECT SPLIT(content, '\n') AS line, id FROM [github-groovy-files:github.contents] ) GROUP BY id ) 10% < 10 lines 20% < 16 lines 30% < 24 lines 40% < 33 lines 50% < 43 lines 60 % < 54 lines 70% < 72 lines 80% < 101 lines 90% < 162 lines 100% < 9506 lines
  14. What are the most frequently imported packages? SELECT package, COUNT(*)

    AS count FROM ( SELECT REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') AS package, id FROM ( SELECT SPLIT(content, '\n') AS line, id FROM [github-groovy-files:github.contents] WHERE content CONTAINS 'import' HAVING LEFT(line, 6)='import' ) GROUP BY package, id ) GROUP BY 1 ORDER BY count DESC LIMIT 14
  15. What are the most popular Groovy APIs used? SELECT package,

    COUNT(*) AS count FROM ( SELECT REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') AS package, id FROM ( SELECT SPLIT(content, '\n') AS line, id FROM [github-groovy-files:github.contents] WHERE content CONTAINS 'import' HAVING LEFT(line, 6)='import' ) GROUP BY package, id ) WHERE package LIKE 'groovy.%' GROUP BY 1 ORDER BY count DESC LIMIT 14
  16. What are the most used AST transformations? SELECT TOP(class_name, 10)

    AS class_name, COUNT(*) AS count FROM ( SELECT REGEXP_EXTRACT(line, r' [a-z0-9\._]*\.([a-zA-Z0-9_]*)') AS class_name, id FROM ( SELECT SPLIT(content, '\n') AS line, id FROM [github-groovy-files:github.contents] WHERE content CONTAINS 'import' ) WHERE line LIKE '%groovy.transform.%' GROUP BY class_name, id ) WHERE class_name != 'null'
  17. Do people use aliased imports much? SELECT aliased, count(aliased) AS

    total FROM ( SELECT REGEXP_MATCH(line, r'.* (as) .*') AS aliased FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.contents] ) WHERE line CONTAINS 'import ' ) GROUP BY aliased
  18. Did developers adopt traits? SELECT COUNT(*) FROM ( SELECT SPLIT(content,

    '\n') AS line FROM [github-groovy-files:github.contents] ) WHERE line CONTAINS 'trait ' 1,698
  19. Analyzing Gradle build files What can we learn from all

    the Gradle build files in those million repositories on Github? ➔ How many Gradle build files are there? ➔ How many Maven build files are there? ➔ Which versions of Gradle are being used? ➔ How many of those Gradle files are settings files? ➔ What are the most frequent build file names? ➔ What are the most frequent Gradle plugins? ➔ What are the most frequent “compile” and “test” dependencies?
  20. How many Gradle build files? SELECT COUNT(*) as count FROM

    [github-groovy-files:github.gradle_build_files] 488,311
  21. Which versions of Gradle are being used? SELECT version, COUNT(version)

    AS count FROM ( SELECT REGEXP_EXTRACT(line, r'gradle-(.*)-all.zip') AS version FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.gradle_wrapper_properties_files] ) WHERE line LIKE 'distributionUrl%' ) GROUP BY version ORDER BY count DESC
  22. How many of those Gradle files are settings files? SELECT

    COUNT(*) as count FROM [github-groovy-files:github.gradle_build_files] WHERE path LIKE '%settings.gradle' 102,433
  23. What are the most frequent Gradle build file names? SELECT

    f, COUNT(f) as count FROM ( SELECT LAST(SPLIT(path, '/')) AS f FROM [github-groovy-files:github.gradle_build_files] ) GROUP BY f ORDER BY count DESC
  24. What are the most frequently used Gradle plugins? SELECT plugin,

    COUNT(plugin) AS count FROM ( SELECT REGEXP_EXTRACT(line, r'apply plugin: (?:\'|\")(.*)(?:\'|\")') AS plugin FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.gradle_build_contents] ) ) GROUP BY plugin ORDER BY count DESC
  25. What are the most frequently used “id” plugins? SELECT newplugin,

    COUNT(newplugin) AS count FROM ( SELECT REGEXP_EXTRACT(line, r'id (?:\'|\")(.*)(?:\'|\") version') AS newplugin FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.gradle_build_contents] ) ) GROUP BY newplugin ORDER BY count DESC
  26. What are the most frequent “compile” dependencies? SELECT dep, COUNT(dep)

    AS count FROM ( SELECT REGEXP_EXTRACT(line, r'compile(?: |\()(?:\'|\")(.*):') AS dep FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.gradle_build_contents] ) ) GROUP BY dep ORDER BY count DESC
  27. What are the most frequent “ test compile” dependencies? SELECT

    dep, COUNT(dep) AS count FROM ( SELECT REGEXP_EXTRACT(line, r'testCompile(?: |\()(?:\'|\")(.*):') AS dep FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.gradle_build_contents] ) ) GROUP BY dep ORDER BY count DESC
  28. Analyzing Grails apps What can we learn from all the

    Grails apps in those million repositories on Github? ➔ What are the most used SQL database used? ➔ What are the most frequent controller names? ➔ What are the repositories with the biggest number of controllers? ➔ What is the distribution of number of controllers?
  29. What are the most used SQL database drivers used in

    Grails apps? SELECT driver, COUNT(*) AS count FROM ( SELECT REGEXP_EXTRACT(line, r'\s*driverClassName\s*=\s*(?:\"|\')(.*)(?:\"|\')\s*') AS driver, id FROM ( SELECT SPLIT(content, '\n') AS line, id FROM [github-groovy-files:github.contents] WHERE id IN ( SELECT id FROM [github-groovy-files:github.files] WHERE path LIKE '%DataSource.groovy' ) ) GROUP BY driver, id ) GROUP BY driver ORDER BY count DESC
  30. What are the most frequent controller names? SELECT ctrlName, COUNT(ctrlName)

    AS count FROM ( SELECT REGEXP_EXTRACT(path, r'.*/controllers/(\w*)Controller.groovy') AS ctrlName FROM [github-groovy-files:github.files] WHERE path LIKE '%Controller.groovy' ) GROUP BY ctrlName ORDER BY count DESC
  31. What are the repositories with the biggest number of controllers?

    SELECT repo_name, COUNT(path) AS count FROM ( SELECT path, repo_name FROM [github-groovy-files:github.files] WHERE path LIKE '%/controllers/%Controller.groovy' ) GROUP BY repo_name ORDER BY count DESC
  32. What is the distribution of number of controllers? SELECT QUANTILES(count,

    11) AS n FROM ( SELECT repo_name, COUNT(path) AS count FROM ( SELECT path, repo_name FROM [github-groovy-files:github.files] WHERE path LIKE '%/controllers/%Controller.groovy' ) GROUP BY repo_name ORDER BY count DESC ) 30% < 1 ctrl 40% < 2 ctrl 50% < 3 ctrl 60% < 4 ctrl 70% < 6 ctrl 80% < 11 ctrl 90% < 19 ctrl 100% < 136 ctrl
  33. References Some reading for diving more into the Github dataset

    and Google BigQuery ➔ Announcements http://bit.ly/gh-bq-dataset http://bit.ly/gcp-gh-ann ➔ Tabs vs Spaces! http://bit.ly/gh-tabspace ➔ Analyzing Groovy code http://bit.ly/gh-groovy-code ➔ More analyzing http://bit.ly/gh-analysis Spoiler: Spaces win! (except in Go)