Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lean GHTorrent: Github data on demand

Lean GHTorrent: Github data on demand

Presentation given at the MSR 2014 data track

Georgios Gousios

June 03, 2014
Tweet

More Decks by Georgios Gousios

Other Decks in Research

Transcript

  1. MSR ! 19 GB VISSOFT! 0.5 GB GHTorrent ! 3.5TB

    Sun = 109x Earth! GHTorrent = 184x MSR
  2. I need a fortune for H/W I need an army

    of researchers Replication?
  3. VS

  4. @gousiosg http://ghtorrent.org/lean.html Lean GHTorrent: Github data on demand Georgios Gousios,

    Bogdan Vasilescu, Alexander Serebrenik and Andy Zaidman {g.gousios, a.e.zaidman}@tudelft.nl {b.n.vasilescu, a.serebrenik}@tue.nl Web server Web form 1 GHTorrent server 5 6 8 Job db Retrieval workers … Requests queue Responses queue 3 Dispatcher GHTorrent db GitHub API 2 Request listener Response listener 4 9 7 Requests db Software Engineering Research Group http://swerl.tudelft.nl/ Delft University of Technology Want to do research with GHTorrent data? It is now as easy as: 2. Getting the data! No need to care about this (but ask if you do!) 1. Filling in the form at ghtorrent.org/lean.html ( ( In the package, you will find: • A MySQL dump (to query like a boss) • MongoDB collection dumps (all Github API data) for all repos specified in step 1!