Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large Scale Empirical Software Engineering Research using GitHub Data

Large Scale Empirical Software Engineering Research using GitHub Data

GitHub has in recent years become the world’s largest collection of open source software, with around 9 million users and 17 million public repositories. These numbers make GitHub an invaluable source of data for large-scale research in empirical software engineering. In this talk, we describe recent research conducted in our group, using GitHub data. For example, we are using GitHub to understand and predict the popularity of open source projects, to understand the motivations behind refactoring, to assess the concentration of knowledge in software teams, and to measure code authorship.


December 15, 2016

More Decks by ASERG, DCC, UFMG

Other Decks in Research


  1. Large  Scale  Empirical  So/ware   Engineering  Research  using   GitHub

     Data     Marco  Tulio  Valente     Applied  So*ware  Engineering  Research  Group       Department  of  Computer  Science     Federal  University  of  Minas  Gerais  Brazil      
  2. GitHub   •  Largest  collecFon  of  open  source  so*ware  

    – 9  million  users  and  17  million  public  repositories   •  Public  API   •  It  is  more  than  a  version  control  system   – Social  coding     – Issue  tracker   – Code  review  
  3. Outline   1.  MoFvaFons  for  refactoring   – FSE  2016  (disFnguished

     paper  and  arFfact)     2.  Popularity  (#  stars)  of  GitHub  so*ware     –  ICSME  2016,  PROMISE  2016   3.  Code  authorship  measures   –  ICPC  2016    
  4. STUDY  #1     Why  We  Refactor?  Confessions  of  

    GitHub  Contributors   Danilo  Silva,  Nikolaos  Tsantalis,  Marco  Tulio  Valente  
  5. Why  we  refactor?  [In  theory]   Kim  et  al.  An

     Empirical  Study  of  Refactoring  Challenges  and  Benefits  at     Microso*,  IEEE  TSE  2014  
  6. Dataset   8   1,000  most  popular  Java   repositories

      748  repositories   SPRING-­‐FRAMEWORK   ELASTICSEARCH   INTELLIJ-­‐COMMUNITY   ...   Filter  by  number  of  commits  
  7. 9   Select  code   repositories   1 Mine  recent

      refactorings   2 Inspect   manually   3 Contact   developers   4 Analyze  and   classify  responses   5 Repeat  daily  
  8. Tool  Support   Refactoring  Miner   •  An  automated  refactoring

     detecFon  tool   •  12  well-­‐known  refactoring  types   10   Move  Class   Extract  Superclass   Rename  Package   Extract  Interface   Extract  Method   Inline  Method   Pull  Up  Method   Push  Down  Method   Move  Method   Move  Aeribute   Pull  Up  Aeribute   Push  Down  Aeribute  
  9. 11   revision   previous   revision   git  repository

      List  of  refactorings   •  Move  Class  X  to  Y   •  Extract  Method  a  from  b   •  ...   Refactoring  Miner  
  10. ContacFng  developers   12   Dear  xxxx,     (…)

     I  found  that  you  recently  performed  the  following  refactoring  on  yyyy  project:     Move  Class  PropertyRule  from  org.yyyy.wfs.xml  to  org.yyyy.util   This  is  the  GitHub  link  to  the  commit:  heps://github.com/yyyy/commit/abcd   (…)  I  am  wondering  if  you  could  answer  the  following  brief  quesFons:             1.  Could  you  describe  why  did  you  perform  this  refactoring?   2.  Did  you  perform  this  refactoring  using  automated   refactoring  support  of  your  IDE?    
  11. Some  numbers   185   repositories  with  confirmed  refactoring  acFvity

      1,411   confirmed  refactoring  instances   465   e-­‐mails  sent   195   responses  received  (41.9%  response  rate)   27   commit  messages  explaining  the  moFvaFon   13  
  12. Lessons  learned   17   •  Refactoring  acFvity  is  mainly

     driven  by  the   need  to  add  new  features  and  fix  bugs,  and   much  less  by  code  smell  resoluFon   •  Extract  Method  is  the  “Swiss  army  knife  of   refactorings”  (11  different  moFvaFons)  
  13. Lessons  learned   23   •  Refactoring  tools  are  sFll

     underused,  as   suggested  by  previous  studies   •  Results  are  different  for  users  of  different   IDE’s  
  14.     STUDY  #  2     Understanding  and  Predic\ng

     the   Popularity  of  GitHub  Repositories   Hudson  Borges,  Andre  Hora,  Marco  Tulio  Valente  
  15. Social  Coding  Features   26   “Stars  are  used  to

     show  apprecia0on  to  the   repository  maintainer  for  their  work”  
  16. CorrelaFon  Analysis   32   Age   No  correlaFon  

    Contributors   Weak  correlaFon   Commits   Weak  correlaFon   Forks   Strong  correlaFon  
  17. Developers  Feedback   1.  Impact  of  account  types  (users  vs

     orgs)   2.  Reasons  for  viral  growth      
  18. Do  you  plan  to  migrate  to  an  organiza\on  account?  

    All  developers  answered  negaFvely         36   Repositories  Owned  by  Users   “I worked hard to create the project, and having it under my personal username is necessary to have proper credit for it.”
  19. Do  you  agree  that  an  organiza\on  account  would  help  to

     a_ract   more  users?   80%  answered  negaFvely         37   Repositories  Owned  by  Users   “It depends on what organization it is. If it’s a well known org I’m sure it helps, otherwise I don’t think it makes a difference.”
  20. Reasons  for  Viral  Growth   38   “I posted about

    this project on HackerNews. It quickly got a lot of attention ...” How  do  you  explain  the  peaks  in  the  number  of  stars?            
  21. ApplicaFon:  Popularity  PredicFon   • Technique:  MulFple  Linear  Regression    

    where:   ◦  Yt  →  Predicted  number  of  stars  at  week  t   ◦  bj  →  Regression  coefficients     ◦  Xj  →  Stars  at  week  j  (  0    ≤    j    ≤    r    <    t  )     39  
  22.     STUDY  #  3     Measuring  Code  Authorship:

      Algorithms  and  Applica\ons   Guilherme  Avelino,  Leonardo  Passos,  Andre  Hora,     Marco  Tulio  Valente  
  23. Degree-­‐of-­‐Authorship  (DOA)  Metric   •  DOA  (d,f)  depends  on  three

     variables:   – FA  =  1  if  d  made  the  first  commit  in  f;  0,  otherwise   – DL  =  number  of  further  commits  to  f  by  d   – AC  =  number  of  commits  in  f  by  other  devs       T.  Fritz,  et  al.    “Degree-­‐of-­‐knowledge:  modeling  a  developer’s  knowledge  of   code,”  ACM  TOSEM,  2014  
  24. Author  IdenFficaFon               …

    Degree  of  Authorship   Developers  
  25. Author  IdenFficaFon               …

    Degree  of  Authorship   Developers   Authors
  26. Author  Iden\fica\on               …

    Degree  of  Authorship   Developers   Authors Empirically  defined:     6  systems  of  different   languages   0.75  
  27. ...   1   subsystem   At  least  2  

    subsystems   Specialist   Generalist   Linux   Subsystems  
  28. Application: Truck/Bus Factor “The number of people on your team

    that have to be hit by a truck (or quit; or win in the lottery) before the project is in serious trouble”  
  29. Estimating Truck Factor 53 A1 A1 A1 A1 A1 A2

    A2 A3 A3 A4 A5 A6 A7 A8 A9 A10 System’s Files … Number of Files Authors A1 A2 A3 A4 An
  30. Estimating Truck Factors 54 System’s Files X Authors … A1

    A2 A3 A4 An A1 A1 A1 A1 A1 A2 A2 A3 A3 A4 A5 A6 A7 A8 A9 An Number of Files
  31. Estimating Truck Factors 55 System’s Files X X Authors …

    A1 A2 A3 A4 An A1 A1 A1 A1 A1 A2 A2 A3 A3 A4 A5 A6 A7 A8 A9 An Number of Files
  32. Estimating Truck Factors 56 System’s Files 50% Authors X …

    A1 A2 A3 A4 An A1 A1 A1 A1 A1 A2 A2 A3 A3 A4 A5 A6 A7 A8 A9 An Number of Files X X
  33. Estimating Truck Factors 57 System’s Files 50% Authors X TF

    = 3 … A1 A2 A3 A4 An A1 A1 A1 A1 A1 A2 A2 A3 A3 A4 A5 A6 A7 A8 A9 An Number of Files X X
  34. Results   •  45  systems  (34%)  have  TF  =  1

      –  mbostock/d3,  less/less.js     •  42  systems  (31%)  have  TF  =  2     –  clojure/clojure,  cucumber/  cucumber,     –  ashkenas/  backbone,  elasFcsearch/elasFcsearch   [  updated  results:      12K  out  of  the  top-­‐17K    (72%)    projects  have  TF=1  ]    
  35. Do developers agree that the TF authors are the main

    developers of their projects? 62
  36. 63 Do developers agree that their projects will be in

    trouble if they lose the truck factor authors?
  37. Thanks!   Marco  Tulio  Valente     Applied  So*ware  Engineering

     Research  Group       Department  of  Computer  Science     Federal  University  of  Minas  Gerais  Brazil      
  38. Ongoing  study:  why  OSS  fail?  (“dual  study”)   •  Top-­‐5000

     systems  (stars)   •  540  systems  without  commits  in  the  last  year  (fail?)   •  342  mails  sent  to  main  developer  (public  mail  available)   •  94  answers  (27.5%)   •  Do  you  agree  it  is  no  longer  under  maintenance?  Yes  (78)   •  Why  did  you  stop  maintainig  the  system?  [37  answers]   ◦  Lack  of  Fme:  13   ◦  Project  is  completed:  6   ◦  Usurped  by  compeFtor:  5   ◦  Lack  of  interest:  5