Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lean GHTorrent: Github data on demand

Lean GHTorrent: Github data on demand

Presentation given at the MSR 2014 data track

Georgios Gousios

June 03, 2014
Tweet

More Decks by Georgios Gousios

Other Decks in Research

Transcript

  1. Lean GHTorrent
    Georgios Gousios, Bogdan Vasilescu,
    Alexander Serebrenik, Andy Zaidman
    @gousiosg

    View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. MSR !
    19 GB

    View Slide

  6. MSR !
    19 GB
    VISSOFT!
    0.5 GB

    View Slide

  7. MSR !
    19 GB
    VISSOFT!
    0.5 GB
    GHTorrent !
    3.5TB

    View Slide

  8. MSR !
    19 GB
    VISSOFT!
    0.5 GB
    GHTorrent !
    3.5TB
    Sun = 109x Earth!
    GHTorrent = 184x MSR

    View Slide

  9. I need a fortune for H/W

    I need an army of researchers

    Replication?

    View Slide

  10. View Slide

  11. There is a
    solution!

    View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. VS

    View Slide

  18. View Slide

  19. View Slide

  20. View Slide

  21. View Slide

  22. View Slide

  23. @gousiosg
    http://ghtorrent.org/lean.html
    Lean GHTorrent: Github data on demand
    Georgios Gousios, Bogdan Vasilescu, Alexander Serebrenik and Andy Zaidman
    {g.gousios, a.e.zaidman}@tudelft.nl {b.n.vasilescu, a.serebrenik}@tue.nl
    Web server
    Web form
    1
    GHTorrent server
    5
    6
    8
    Job db
    Retrieval workers

    Requests queue
    Responses queue
    3
    Dispatcher
    GHTorrent db GitHub API
    2
    Request
    listener Response
    listener
    4
    9
    7
    Requests db
    Software Engineering Research Group
    http://swerl.tudelft.nl/
    Delft University of Technology
    Want to do research with GHTorrent data?
    It is now as easy as:
    2. Getting the data!
    No need to care about this
    (but ask if you do!)
    1. Filling in the form at
    ghtorrent.org/lean.html
    (
    (
    In the package, you will find:
    • A MySQL dump (to query like a boss)
    • MongoDB collection dumps (all Github API data)
    for all repos specified in step 1!

    View Slide