Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A tale of two datasets

A tale of two datasets

Presentation given at the 2013 ICSM panel on Open Access

Georgios Gousios

September 25, 2013
Tweet

More Decks by Georgios Gousios

Other Decks in Technology

Transcript

  1. SQO-OSS facts • 6 partners • ~30 publications • 4

    PhDs funded • press releases on project releases • rated excellent by the EC
  2. <<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits

    ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
  3. repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {  

     "type":  "User",    "public_gists":  0,    "login":  "gousiosg",    "followers":  8,    "name":  "Georgios  Gousios",    "public_repos":  4,    "created_at":  ...,    "id":  386172,    "following":  4, } { . . . NoSQL database as cache
  4. GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL

    • 1 developer • 3 papers • advertised on social media • 1.5 years
  5. but Github is hot! so was SourceForge, Gnome, KDE etc

    the Github Archive project offers a subset of the data in an easier to query format
  6. SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools

    Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months