Lean GHTorrent: Github data on demand

Lean GHTorrent: Github data on demand

Presentation given at the MSR 2014 data track

43df3993acc9af4e9f619e59cd849aee?s=128

Georgios Gousios

June 03, 2014
Tweet

Transcript

  1. Lean GHTorrent Georgios Gousios, Bogdan Vasilescu, Alexander Serebrenik, Andy Zaidman

    @gousiosg
  2. None
  3. None
  4. None
  5. MSR ! 19 GB

  6. MSR ! 19 GB VISSOFT! 0.5 GB

  7. MSR ! 19 GB VISSOFT! 0.5 GB GHTorrent ! 3.5TB

  8. MSR ! 19 GB VISSOFT! 0.5 GB GHTorrent ! 3.5TB

    Sun = 109x Earth! GHTorrent = 184x MSR
  9. I need a fortune for H/W I need an army

    of researchers Replication?
  10. None
  11. There is a solution!

  12. None
  13. None
  14. None
  15. None
  16. None
  17. VS

  18. None
  19. None
  20. None
  21. None
  22. None
  23. @gousiosg http://ghtorrent.org/lean.html Lean GHTorrent: Github data on demand Georgios Gousios,

    Bogdan Vasilescu, Alexander Serebrenik and Andy Zaidman {g.gousios, a.e.zaidman}@tudelft.nl {b.n.vasilescu, a.serebrenik}@tue.nl Web server Web form 1 GHTorrent server 5 6 8 Job db Retrieval workers … Requests queue Responses queue 3 Dispatcher GHTorrent db GitHub API 2 Request listener Response listener 4 9 7 Requests db Software Engineering Research Group http://swerl.tudelft.nl/ Delft University of Technology Want to do research with GHTorrent data? It is now as easy as: 2. Getting the data! No need to care about this (but ask if you do!) 1. Filling in the form at ghtorrent.org/lean.html ( ( In the package, you will find: • A MySQL dump (to query like a boss) • MongoDB collection dumps (all Github API data) for all repos specified in step 1!