A tale of two datasets

A tale of two datasets

Presentation given at the 2013 ICSM panel on Open Access

43df3993acc9af4e9f619e59cd849aee?s=128

Georgios Gousios

September 25, 2013
Tweet

Transcript

  1. A tale of two datasets Georgios Gousios TU Delft

  2. open access

  3. None
  4. None
  5. None
  6. Software Quality Observatory for OSS

  7. None
  8. 50k LOC!

  9.  750 OSS repositories, SVN, bugs, emails 1.5GB processed data

    dump
  10. demo.sqo-oss.org

  11. SQO-OSS facts • 6 partners • ~30 publications • 4

    PhDs funded • press releases on project releases • rated excellent by the EC
  12. 1 external user 2 external publications

  13. None
  14.  

  15.   GHTorrent

  16.   GHTorrent

  17.  mirror event stream

  18. <<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits

    ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
  19. relational database

  20. repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {  

     "type":  "User",    "public_gists":  0,    "login":  "gousiosg",    "followers":  8,    "name":  "Georgios  Gousios",    "public_repos":  4,    "created_at":  ...,    "id":  386172,    "following":  4, } { . . . NoSQL database as cache
  21. periodic dumps of DBs online

  22. query DBs online

  23. GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL

    • 1 developer • 3 papers • advertised on social media • 1.5 years
  24. 5 external users 3 external papers MSR14 challenge dataset

  25. why the difference?

  26. but Github is hot!

  27. but Github is hot!

  28. but Github is hot! so was SourceForge, Gnome, KDE etc

  29. but Github is hot! so was SourceForge, Gnome, KDE etc

    the Github Archive project offers a subset of the data in an easier to query format
  30. SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools

    Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
  31. open source research

  32. aim for lean and mean

  33. infrastructures and platforms are overrated

  34. open now trumps open when it’s done