Slide 1

Slide 1 text

What can you learn from thousands of source files in Github? Guillaume Laforge Developer Advocate Google Cloud Platform Apache Groovy PMC chair @glaforge Groovy using BigQuery G3 Summit 2016

Slide 2

Slide 2 text

Github dataset Release of the Github archive on Google BigQuery

Slide 3

Slide 3 text

The numbers? HUGE 3TB+ of data over 2.8 million repositories with 145 million unique commits 2 billion file paths and the contents of the latest revision of 163 million files

Slide 4

Slide 4 text

What’s BigQuery? “BigQuery is Google's fully managed, petabyte scale, low cost enterprise data warehouse for analytics. BigQuery is serverless. There is no infrastructure to manage and you don't need a database administrator, so you can focus on analyzing data to find meaningful insights using familiar SQL. ” Source: cloud.google.com/bigquery

Slide 5

Slide 5 text

BigQuery is… Dremel Google published a research paper on “Dremel: Interactive Analysis of Web-scale Datasets”. Dremel in production since 2008, and BigQuery since 2012, processing exabytes every month! Source: research.google.com/pubs/pub36632.html

Slide 6

Slide 6 text

Apache Groovy “A multi-faceted language for the Java platform Apache Groovy is a powerful, optionally typed and dynamic language, with static-typing and static compilation capabilities, for the Java platform aimed at improving developer productivity thanks to a concise, familiar and easy to learn syntax. It integrates smoothly with any Java program, and immediately delivers to your application powerful features, including scripting capabilities, Domain-Specific Language authoring, runtime and compile-time meta-programming and functional programming.” Source: www.groovy-lang.org

Slide 7

Slide 7 text

Analyzing Groovy source code What can we learn from all the Groovy files in those million repositories on Github? ➔ How many Groovy files are there on Github? ➔ What are the most popular Groovy file names? ➔ How many lines of Groovy source code are there? ➔ What's the distribution of size of source files? ➔ What are the most frequent imported packages? ➔ What are the most popular Groovy APIs used? ➔ What are the most used AST transformations? ➔ Do people use import aliases much? ➔ Did developers adopt traits?

Slide 8

Slide 8 text

DEMO

Slide 9

Slide 9 text

A bit of setup... SELECT * FROM [bigquery-public-data:github_repos.files] WHERE RIGHT(path, 7) = '.groovy' SELECT * FROM [bigquery-public-data:github_repos.contents] WHERE id IN (SELECT id FROM [github.files])

Slide 10

Slide 10 text

A bit of setup... SELECT * FROM [bigquery-public-data:github_repos.files] WHERE RIGHT(path, 7) = '.gradle' SELECT * FROM [bigquery-public-data:github_repos.contents] WHERE id IN (SELECT id FROM [github.gradle_build_files])

Slide 11

Slide 11 text

A bit of setup... SELECT * FROM [bigquery-public-data:github_repos.contents] WHERE id IN ( SELECT id FROM [bigquery-public-data:github_repos.files] WHERE path LIKE '%gradle-wrapper.properties' )

Slide 12

Slide 12 text

Let’s Get Groovy!

Slide 13

Slide 13 text

How many Groovy files are there on Github? SELECT COUNT(*) FROM [github-groovy-files:github.files] 743,070

Slide 14

Slide 14 text

What are the most frequent Groovy file names? SELECT TOP(f, 20) AS filename, COUNT(*) AS size FROM ( SELECT LAST(SPLIT(path, '/')) AS f FROM [github-groovy-files:github.files] )

Slide 15

Slide 15 text

What are the most frequent Groovy file names?

Slide 16

Slide 16 text

How many lines of Groovy source code are there? SELECT COUNT(line) AS total_lines FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.contents] ) 16,464,376

Slide 17

Slide 17 text

What’s the distribution of size of source files? SELECT QUANTILES(total_lines, 11) AS lines FROM ( SELECT COUNT(line) AS total_lines FROM ( SELECT SPLIT(content, '\n') AS line, id FROM [github-groovy-files:github.contents] ) GROUP BY id ) 10% < 10 lines 20% < 16 lines 30% < 24 lines 40% < 33 lines 50% < 43 lines 60 % < 54 lines 70% < 72 lines 80% < 101 lines 90% < 162 lines 100% < 9506 lines

Slide 18

Slide 18 text

What are the most frequently imported packages? SELECT package, COUNT(*) AS count FROM ( SELECT REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') AS package, id FROM ( SELECT SPLIT(content, '\n') AS line, id FROM [github-groovy-files:github.contents] WHERE content CONTAINS 'import' HAVING LEFT(line, 6)='import' ) GROUP BY package, id ) GROUP BY 1 ORDER BY count DESC LIMIT 14

Slide 19

Slide 19 text

What are the most frequently imported packages?

Slide 20

Slide 20 text

What are the most popular Groovy APIs used? SELECT package, COUNT(*) AS count FROM ( SELECT REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') AS package, id FROM ( SELECT SPLIT(content, '\n') AS line, id FROM [github-groovy-files:github.contents] WHERE content CONTAINS 'import' HAVING LEFT(line, 6)='import' ) GROUP BY package, id ) WHERE package LIKE 'groovy.%' GROUP BY 1 ORDER BY count DESC LIMIT 14

Slide 21

Slide 21 text

What are the most popular Groovy APIs used?

Slide 22

Slide 22 text

What are the most used AST transformations? SELECT TOP(class_name, 10) AS class_name, COUNT(*) AS count FROM ( SELECT REGEXP_EXTRACT(line, r' [a-z0-9\._]*\.([a-zA-Z0-9_]*)') AS class_name, id FROM ( SELECT SPLIT(content, '\n') AS line, id FROM [github-groovy-files:github.contents] WHERE content CONTAINS 'import' ) WHERE line LIKE '%groovy.transform.%' GROUP BY class_name, id ) WHERE class_name != 'null'

Slide 23

Slide 23 text

What are the most used AST transformations?

Slide 24

Slide 24 text

Do people use aliased imports much? SELECT aliased, count(aliased) AS total FROM ( SELECT REGEXP_MATCH(line, r'.* (as) .*') AS aliased FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.contents] ) WHERE line CONTAINS 'import ' ) GROUP BY aliased

Slide 25

Slide 25 text

Do people use aliased imports much?

Slide 26

Slide 26 text

Did developers adopt traits? SELECT COUNT(*) FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.contents] ) WHERE line CONTAINS 'trait ' 1,698

Slide 27

Slide 27 text

What about Gradle build files?

Slide 28

Slide 28 text

Analyzing Gradle build files What can we learn from all the Gradle build files in those million repositories on Github? ➔ How many Gradle build files are there? ➔ How many Maven build files are there? ➔ Which versions of Gradle are being used? ➔ How many of those Gradle files are settings files? ➔ What are the most frequent build file names? ➔ What are the most frequent Gradle plugins? ➔ What are the most frequent “compile” and “test” dependencies?

Slide 29

Slide 29 text

How many Gradle build files? SELECT COUNT(*) as count FROM [github-groovy-files:github.gradle_build_files] 488,311

Slide 30

Slide 30 text

How many Maven build files? SELECT count(*) FROM [bigquery-public-data:github_repos.files] WHERE path LIKE '%pom.xml' 1,009,745

Slide 31

Slide 31 text

Which versions of Gradle are being used? SELECT version, COUNT(version) AS count FROM ( SELECT REGEXP_EXTRACT(line, r'gradle-(.*)-all.zip') AS version FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.gradle_wrapper_properties_files] ) WHERE line LIKE 'distributionUrl%' ) GROUP BY version ORDER BY count DESC

Slide 32

Slide 32 text

Which versions of Gradle are being used?

Slide 33

Slide 33 text

How many of those Gradle files are settings files? SELECT COUNT(*) as count FROM [github-groovy-files:github.gradle_build_files] WHERE path LIKE '%settings.gradle' 102,433

Slide 34

Slide 34 text

What are the most frequent Gradle build file names? SELECT f, COUNT(f) as count FROM ( SELECT LAST(SPLIT(path, '/')) AS f FROM [github-groovy-files:github.gradle_build_files] ) GROUP BY f ORDER BY count DESC

Slide 35

Slide 35 text

What are the most frequent Gradle build file names?

Slide 36

Slide 36 text

What are the most frequently used Gradle plugins? SELECT plugin, COUNT(plugin) AS count FROM ( SELECT REGEXP_EXTRACT(line, r'apply plugin: (?:\'|\")(.*)(?:\'|\")') AS plugin FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.gradle_build_contents] ) ) GROUP BY plugin ORDER BY count DESC

Slide 37

Slide 37 text

What are the most frequently used Gradle plugins?

Slide 38

Slide 38 text

What are the most frequently used “id” plugins? SELECT newplugin, COUNT(newplugin) AS count FROM ( SELECT REGEXP_EXTRACT(line, r'id (?:\'|\")(.*)(?:\'|\") version') AS newplugin FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.gradle_build_contents] ) ) GROUP BY newplugin ORDER BY count DESC

Slide 39

Slide 39 text

What are the most frequently used “id” plugins?

Slide 40

Slide 40 text

What are the most frequent “compile” dependencies? SELECT dep, COUNT(dep) AS count FROM ( SELECT REGEXP_EXTRACT(line, r'compile(?: |\()(?:\'|\")(.*):') AS dep FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.gradle_build_contents] ) ) GROUP BY dep ORDER BY count DESC

Slide 41

Slide 41 text

What are the most frequent “compile” dependencies?

Slide 42

Slide 42 text

What are the most frequent “ test compile” dependencies? SELECT dep, COUNT(dep) AS count FROM ( SELECT REGEXP_EXTRACT(line, r'testCompile(?: |\()(?:\'|\")(.*):') AS dep FROM ( SELECT SPLIT(content, '\n') AS line FROM [github-groovy-files:github.gradle_build_contents] ) ) GROUP BY dep ORDER BY count DESC

Slide 43

Slide 43 text

What are the most frequent “ test compile” dependencies?

Slide 44

Slide 44 text

And your Grails apps?

Slide 45

Slide 45 text

Analyzing Grails apps What can we learn from all the Grails apps in those million repositories on Github? ➔ What are the most used SQL database used? ➔ What are the most frequent controller names? ➔ What are the repositories with the biggest number of controllers? ➔ What is the distribution of number of controllers?

Slide 46

Slide 46 text

What are the most used SQL database drivers used in Grails apps? SELECT driver, COUNT(*) AS count FROM ( SELECT REGEXP_EXTRACT(line, r'\s*driverClassName\s*=\s*(?:\"|\')(.*)(?:\"|\')\s*') AS driver, id FROM ( SELECT SPLIT(content, '\n') AS line, id FROM [github-groovy-files:github.contents] WHERE id IN ( SELECT id FROM [github-groovy-files:github.files] WHERE path LIKE '%DataSource.groovy' ) ) GROUP BY driver, id ) GROUP BY driver ORDER BY count DESC

Slide 47

Slide 47 text

What are the most used SQL database drivers used in Grails apps?

Slide 48

Slide 48 text

What are the most frequent controller names? SELECT ctrlName, COUNT(ctrlName) AS count FROM ( SELECT REGEXP_EXTRACT(path, r'.*/controllers/(\w*)Controller.groovy') AS ctrlName FROM [github-groovy-files:github.files] WHERE path LIKE '%Controller.groovy' ) GROUP BY ctrlName ORDER BY count DESC

Slide 49

Slide 49 text

What are the most frequent controller names?

Slide 50

Slide 50 text

What are the repositories with the biggest number of controllers? SELECT repo_name, COUNT(path) AS count FROM ( SELECT path, repo_name FROM [github-groovy-files:github.files] WHERE path LIKE '%/controllers/%Controller.groovy' ) GROUP BY repo_name ORDER BY count DESC

Slide 51

Slide 51 text

What are the repositories with the biggest number of controllers?

Slide 52

Slide 52 text

What is the distribution of number of controllers? SELECT QUANTILES(count, 11) AS n FROM ( SELECT repo_name, COUNT(path) AS count FROM ( SELECT path, repo_name FROM [github-groovy-files:github.files] WHERE path LIKE '%/controllers/%Controller.groovy' ) GROUP BY repo_name ORDER BY count DESC ) 30% < 1 ctrl 40% < 2 ctrl 50% < 3 ctrl 60% < 4 ctrl 70% < 6 ctrl 80% < 11 ctrl 90% < 19 ctrl 100% < 136 ctrl

Slide 53

Slide 53 text

Your turn to play!

Slide 54

Slide 54 text

References Some reading for diving more into the Github dataset and Google BigQuery ➔ Announcements http://bit.ly/gh-bq-dataset http://bit.ly/gcp-gh-ann ➔ Tabs vs Spaces! http://bit.ly/gh-tabspace ➔ Analyzing Groovy code http://bit.ly/gh-groovy-code ➔ More analyzing http://bit.ly/gh-analysis Spoiler: Spaces win! (except in Go)

Slide 55

Slide 55 text

Thanks for your attention @glaforge

Slide 56

Slide 56 text

APPENDIX