Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What can you learn about thousands of Groovy source files in Github using BigQuery

What can you learn about thousands of Groovy source files in Github using BigQuery

Github and Google recently released a huge dataset on BigQuery about all the Open Source code contained on Github, thanks to the data from the Github Archive project. What can you learn from such a dataset? What about learning about the most frequently used dependencies and imports? Which version of Gradle is the most widespread? What are the most frequent test libraries used?

Guillaume will show you some concrete examples from the dataset, with examples applied to the Apache Groovy language, to learn more about our programming habits. And figure out on your own about your favorite language, web framework and build tool usage patterns!

Guillaume Laforge

November 22, 2016
Tweet

More Decks by Guillaume Laforge

Other Decks in Technology

Transcript

  1. What can you learn
    from thousands of
    source files in Github?
    Guillaume Laforge
    Developer Advocate
    Google Cloud Platform
    Apache Groovy PMC chair
    @glaforge
    Groovy
    using BigQuery
    G3 Summit 2016

    View Slide

  2. Github
    dataset
    Release of the
    Github archive
    on Google BigQuery

    View Slide

  3. The numbers?
    HUGE
    3TB+ of data over
    2.8 million
    repositories with
    145 million
    unique commits
    2 billion file
    paths and the
    contents of the
    latest revision of
    163 million files

    View Slide

  4. What’s BigQuery?
    “BigQuery is Google's fully managed, petabyte scale,
    low cost enterprise data warehouse for analytics.
    BigQuery is serverless. There is no infrastructure to
    manage and you don't need a database administrator,
    so you can focus on analyzing data to find meaningful
    insights using familiar SQL. ”
    Source: cloud.google.com/bigquery

    View Slide

  5. BigQuery is… Dremel
    Google published a research paper on
    “Dremel: Interactive Analysis of
    Web-scale Datasets”.
    Dremel in production since 2008,
    and BigQuery since 2012,
    processing exabytes every month!
    Source: research.google.com/pubs/pub36632.html

    View Slide

  6. Apache Groovy
    “A multi-faceted language for the Java platform
    Apache Groovy is a powerful, optionally typed and
    dynamic language, with static-typing and static
    compilation capabilities, for the Java platform aimed at
    improving developer productivity thanks to a concise,
    familiar and easy to learn syntax. It integrates smoothly
    with any Java program, and immediately delivers to your
    application powerful features, including scripting
    capabilities, Domain-Specific Language authoring,
    runtime and compile-time meta-programming and
    functional programming.”
    Source: www.groovy-lang.org

    View Slide

  7. Analyzing Groovy
    source code
    What can we learn from all the
    Groovy files in those million
    repositories on Github?
    ➔ How many Groovy files are there on Github?
    ➔ What are the most popular Groovy file names?
    ➔ How many lines of Groovy source code are there?
    ➔ What's the distribution of size of source files?
    ➔ What are the most frequent imported packages?
    ➔ What are the most popular Groovy APIs used?
    ➔ What are the most used AST transformations?
    ➔ Do people use import aliases much?
    ➔ Did developers adopt traits?

    View Slide

  8. DEMO

    View Slide

  9. A bit of setup...
    SELECT * FROM [bigquery-public-data:github_repos.files]
    WHERE RIGHT(path, 7) = '.groovy'
    SELECT *
    FROM [bigquery-public-data:github_repos.contents]
    WHERE id IN (SELECT id FROM [github.files])

    View Slide

  10. A bit of setup...
    SELECT * FROM [bigquery-public-data:github_repos.files]
    WHERE RIGHT(path, 7) = '.gradle'
    SELECT *
    FROM [bigquery-public-data:github_repos.contents]
    WHERE id IN (SELECT id FROM [github.gradle_build_files])

    View Slide

  11. A bit of setup...
    SELECT *
    FROM [bigquery-public-data:github_repos.contents]
    WHERE id IN (
    SELECT id
    FROM [bigquery-public-data:github_repos.files]
    WHERE path LIKE '%gradle-wrapper.properties'
    )

    View Slide

  12. Let’s
    Get
    Groovy!

    View Slide

  13. How many Groovy files are there on Github?
    SELECT COUNT(*)
    FROM [github-groovy-files:github.files]
    743,070

    View Slide

  14. What are the most frequent Groovy file names?
    SELECT TOP(f, 20) AS filename, COUNT(*) AS size
    FROM ( SELECT LAST(SPLIT(path, '/')) AS f
    FROM [github-groovy-files:github.files] )

    View Slide

  15. What are the most frequent Groovy file names?

    View Slide

  16. How many lines of Groovy source code are there?
    SELECT COUNT(line) AS total_lines
    FROM ( SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.contents] )
    16,464,376

    View Slide

  17. What’s the distribution of size of source files?
    SELECT QUANTILES(total_lines, 11) AS lines
    FROM (
    SELECT COUNT(line) AS total_lines
    FROM (
    SELECT SPLIT(content, '\n') AS line, id
    FROM [github-groovy-files:github.contents]
    )
    GROUP BY id
    )
    10% < 10 lines
    20% < 16 lines
    30% < 24 lines
    40% < 33 lines
    50% < 43 lines
    60 % < 54 lines
    70% < 72 lines
    80% < 101 lines
    90% < 162 lines
    100% < 9506 lines

    View Slide

  18. What are the most frequently imported packages?
    SELECT package, COUNT(*) AS count
    FROM (
    SELECT REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') AS package, id
    FROM (
    SELECT SPLIT(content, '\n') AS line, id
    FROM [github-groovy-files:github.contents]
    WHERE content CONTAINS 'import'
    HAVING LEFT(line, 6)='import'
    )
    GROUP BY package, id
    )
    GROUP BY 1
    ORDER BY count DESC
    LIMIT 14

    View Slide

  19. What are the most frequently imported packages?

    View Slide

  20. What are the most popular Groovy APIs used?
    SELECT package, COUNT(*) AS count
    FROM (
    SELECT REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') AS package, id
    FROM (
    SELECT SPLIT(content, '\n') AS line, id
    FROM [github-groovy-files:github.contents]
    WHERE content CONTAINS 'import'
    HAVING LEFT(line, 6)='import' )
    GROUP BY package, id
    )
    WHERE package LIKE 'groovy.%'
    GROUP BY 1
    ORDER BY count DESC
    LIMIT 14

    View Slide

  21. What are the most popular Groovy APIs used?

    View Slide

  22. What are the most used AST transformations?
    SELECT TOP(class_name, 10) AS class_name, COUNT(*) AS count
    FROM (
    SELECT
    REGEXP_EXTRACT(line, r' [a-z0-9\._]*\.([a-zA-Z0-9_]*)') AS class_name, id
    FROM (
    SELECT SPLIT(content, '\n') AS line, id
    FROM [github-groovy-files:github.contents]
    WHERE content CONTAINS 'import'
    )
    WHERE line LIKE '%groovy.transform.%'
    GROUP BY class_name, id
    )
    WHERE class_name != 'null'

    View Slide

  23. What are the most used AST transformations?

    View Slide

  24. Do people use aliased imports much?
    SELECT aliased, count(aliased) AS total
    FROM (
    SELECT REGEXP_MATCH(line, r'.* (as) .*') AS aliased
    FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.contents]
    )
    WHERE line CONTAINS 'import '
    )
    GROUP BY aliased

    View Slide

  25. Do people use aliased imports much?

    View Slide

  26. Did developers adopt traits?
    SELECT COUNT(*)
    FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.contents]
    )
    WHERE line CONTAINS 'trait '
    1,698

    View Slide

  27. What
    about
    Gradle
    build
    files?

    View Slide

  28. Analyzing Gradle
    build files
    What can we learn from all the
    Gradle build files in those million
    repositories on Github?
    ➔ How many Gradle build files are there?
    ➔ How many Maven build files are there?
    ➔ Which versions of Gradle are being used?
    ➔ How many of those Gradle files are settings files?
    ➔ What are the most frequent build file names?
    ➔ What are the most frequent Gradle plugins?
    ➔ What are the most frequent “compile”
    and “test” dependencies?

    View Slide

  29. How many Gradle build files?
    SELECT COUNT(*) as count
    FROM [github-groovy-files:github.gradle_build_files]
    488,311

    View Slide

  30. How many Maven build files?
    SELECT count(*)
    FROM [bigquery-public-data:github_repos.files]
    WHERE path LIKE '%pom.xml'
    1,009,745

    View Slide

  31. Which versions of Gradle are being used?
    SELECT version, COUNT(version) AS count
    FROM (
    SELECT REGEXP_EXTRACT(line, r'gradle-(.*)-all.zip') AS version
    FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_wrapper_properties_files]
    )
    WHERE line LIKE 'distributionUrl%'
    )
    GROUP BY version
    ORDER BY count DESC

    View Slide

  32. Which versions of Gradle are being used?

    View Slide

  33. How many of those Gradle files are settings files?
    SELECT COUNT(*) as count
    FROM [github-groovy-files:github.gradle_build_files]
    WHERE path LIKE '%settings.gradle'
    102,433

    View Slide

  34. What are the most frequent Gradle build file names?
    SELECT f, COUNT(f) as count
    FROM (
    SELECT LAST(SPLIT(path, '/')) AS f
    FROM [github-groovy-files:github.gradle_build_files]
    )
    GROUP BY f
    ORDER BY count DESC

    View Slide

  35. What are the most frequent Gradle build file names?

    View Slide

  36. What are the most frequently used Gradle plugins?
    SELECT plugin, COUNT(plugin) AS count
    FROM (
    SELECT REGEXP_EXTRACT(line, r'apply plugin: (?:\'|\")(.*)(?:\'|\")') AS plugin
    FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_build_contents]
    )
    )
    GROUP BY plugin
    ORDER BY count DESC

    View Slide

  37. What are the most frequently used Gradle plugins?

    View Slide

  38. What are the most frequently used “id” plugins?
    SELECT newplugin, COUNT(newplugin) AS count
    FROM (
    SELECT REGEXP_EXTRACT(line, r'id (?:\'|\")(.*)(?:\'|\") version') AS newplugin
    FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_build_contents]
    )
    )
    GROUP BY newplugin
    ORDER BY count DESC

    View Slide

  39. What are the most frequently used “id” plugins?

    View Slide

  40. What are the most frequent “compile” dependencies?
    SELECT dep, COUNT(dep) AS count
    FROM (
    SELECT REGEXP_EXTRACT(line, r'compile(?: |\()(?:\'|\")(.*):') AS dep
    FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_build_contents]
    )
    )
    GROUP BY dep
    ORDER BY count DESC

    View Slide

  41. What are the most frequent “compile” dependencies?

    View Slide

  42. What are the most frequent “ test compile” dependencies?
    SELECT dep, COUNT(dep) AS count
    FROM (
    SELECT REGEXP_EXTRACT(line, r'testCompile(?: |\()(?:\'|\")(.*):') AS dep
    FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_build_contents]
    )
    )
    GROUP BY dep
    ORDER BY count DESC

    View Slide

  43. What are the most frequent “ test compile” dependencies?

    View Slide

  44. And
    your
    Grails
    apps?

    View Slide

  45. Analyzing
    Grails apps
    What can we learn from all the
    Grails apps in those million
    repositories on Github?
    ➔ What are the most used SQL database used?
    ➔ What are the most frequent controller names?
    ➔ What are the repositories with the biggest
    number of controllers?
    ➔ What is the distribution of number of controllers?

    View Slide

  46. What are the most used SQL database drivers used in Grails apps?
    SELECT driver, COUNT(*) AS count
    FROM (
    SELECT REGEXP_EXTRACT(line,
    r'\s*driverClassName\s*=\s*(?:\"|\')(.*)(?:\"|\')\s*')
    AS driver, id
    FROM (
    SELECT SPLIT(content, '\n') AS line, id
    FROM [github-groovy-files:github.contents]
    WHERE id IN (
    SELECT id
    FROM [github-groovy-files:github.files]
    WHERE path LIKE '%DataSource.groovy'
    )
    )
    GROUP BY driver, id
    )
    GROUP BY driver
    ORDER BY count DESC

    View Slide

  47. What are the most used SQL database drivers used in Grails apps?

    View Slide

  48. What are the most frequent controller names?
    SELECT ctrlName, COUNT(ctrlName) AS count
    FROM (
    SELECT REGEXP_EXTRACT(path, r'.*/controllers/(\w*)Controller.groovy')
    AS ctrlName
    FROM [github-groovy-files:github.files]
    WHERE path LIKE '%Controller.groovy'
    )
    GROUP BY ctrlName
    ORDER BY count DESC

    View Slide

  49. What are the most frequent controller names?

    View Slide

  50. What are the repositories with the biggest number of controllers?
    SELECT repo_name, COUNT(path) AS count
    FROM (
    SELECT path, repo_name
    FROM [github-groovy-files:github.files]
    WHERE path LIKE '%/controllers/%Controller.groovy'
    )
    GROUP BY repo_name
    ORDER BY count DESC

    View Slide

  51. What are the repositories with the biggest number of controllers?

    View Slide

  52. What is the distribution of number of controllers?
    SELECT QUANTILES(count, 11) AS n
    FROM (
    SELECT repo_name, COUNT(path) AS count
    FROM (
    SELECT path, repo_name
    FROM [github-groovy-files:github.files]
    WHERE path LIKE '%/controllers/%Controller.groovy'
    )
    GROUP BY repo_name
    ORDER BY count DESC
    )
    30% < 1 ctrl
    40% < 2 ctrl
    50% < 3 ctrl
    60% < 4 ctrl
    70% < 6 ctrl
    80% < 11 ctrl
    90% < 19 ctrl
    100% < 136 ctrl

    View Slide

  53. Your
    turn to
    play!

    View Slide

  54. References
    Some reading for diving more into the
    Github dataset and Google BigQuery
    ➔ Announcements
    http://bit.ly/gh-bq-dataset
    http://bit.ly/gcp-gh-ann
    ➔ Tabs vs Spaces!
    http://bit.ly/gh-tabspace
    ➔ Analyzing Groovy code
    http://bit.ly/gh-groovy-code
    ➔ More analyzing
    http://bit.ly/gh-analysis
    Spoiler:
    Spaces win!
    (except in Go)

    View Slide

  55. Thanks
    for your
    attention
    @glaforge

    View Slide

  56. APPENDIX

    View Slide