Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large Scale Empirical Software Engineering Research using GitHub Data

Large Scale Empirical Software Engineering Research using GitHub Data

GitHub has in recent years become the world’s largest collection of open source software, with around 9 million users and 17 million public repositories. These numbers make GitHub an invaluable source of data for large-scale research in empirical software engineering. In this talk, we describe recent research conducted in our group, using GitHub data. For example, we are using GitHub to understand and predict the popularity of open source projects, to understand the motivations behind refactoring, to assess the concentration of knowledge in software teams, and to measure code authorship.

ASERG, DCC, UFMG

December 15, 2016
Tweet

More Decks by ASERG, DCC, UFMG

Other Decks in Research

Transcript

  1. Large  Scale  Empirical  So/ware  
    Engineering  Research  using  
    GitHub  Data    
    Marco  Tulio  Valente  
     
    Applied  So*ware  Engineering  Research  Group  
     
     
    Department  of  Computer  Science  
     
    Federal  University  of  Minas  Gerais  Brazil  
     
     

    View Slide

  2. GitHub  
    •  Largest  collecFon  of  open  source  so*ware  
    – 9  million  users  and  17  million  public  repositories  
    •  Public  API  
    •  It  is  more  than  a  version  control  system  
    – Social  coding    
    – Issue  tracker  
    – Code  review  

    View Slide

  3. Outline  
    1.  MoFvaFons  for  refactoring  
    – FSE  2016  (disFnguished  paper  and  arFfact)  
     
    2.  Popularity  (#  stars)  of  GitHub  so*ware    
    –  ICSME  2016,  PROMISE  2016  
    3.  Code  authorship  measures  
    –  ICPC  2016  
     

    View Slide

  4. STUDY  #1  
     
    Why  We  Refactor?  Confessions  of  
    GitHub  Contributors  
    Danilo  Silva,  Nikolaos  Tsantalis,  Marco  Tulio  Valente  

    View Slide

  5. Why  we  refactor?  [In  theory]  

    View Slide

  6. Why  we  refactor?  [In  theory]  
    Kim  et  al.  An  Empirical  Study  of  Refactoring  Challenges  and  Benefits  at    
    Microso*,  IEEE  TSE  2014  

    View Slide

  7. But  why  do  we  really  refactor?    
    In  daily  programming  ...  
    7  

    View Slide

  8. Dataset  
    8  
    1,000  most  popular  Java  
    repositories  
    748  repositories  
    SPRING-­‐FRAMEWORK  
    ELASTICSEARCH  
    INTELLIJ-­‐COMMUNITY  
    ...  
    Filter  by  number  of  commits  

    View Slide

  9. 9  
    Select  code  
    repositories  
    1
    Mine  recent  
    refactorings  
    2
    Inspect  
    manually  
    3
    Contact  
    developers  
    4
    Analyze  and  
    classify  responses  
    5
    Repeat  daily  

    View Slide

  10. Tool  Support  
    Refactoring  Miner  
    •  An  automated  refactoring  detecFon  tool  
    •  12  well-­‐known  refactoring  types  
    10  
    Move  Class   Extract  Superclass  
    Rename  Package   Extract  Interface  
    Extract  Method   Inline  Method  
    Pull  Up  Method   Push  Down  Method  
    Move  Method   Move  Aeribute  
    Pull  Up  Aeribute   Push  Down  Aeribute  

    View Slide

  11. 11  
    revision  
    previous  
    revision  
    git  repository  
    List  of  refactorings  
    •  Move  Class  X  to  Y  
    •  Extract  Method  a  from  b  
    •  ...  
    Refactoring  Miner  

    View Slide

  12. ContacFng  developers  
    12  
    Dear  xxxx,  
     
    (…)  I  found  that  you  recently  performed  the  following  refactoring  on  yyyy  project:  
     
    Move  Class  PropertyRule  from  org.yyyy.wfs.xml  to  org.yyyy.util
     
    This  is  the  GitHub  link  to  the  commit:  heps://github.com/yyyy/commit/abcd  
    (…)  I  am  wondering  if  you  could  answer  the  following  brief  quesFons:  
     
     
     
     
     
    1.  Could  you  describe  why  did  you  perform  this  refactoring?  
    2.  Did  you  perform  this  refactoring  using  automated  
    refactoring  support  of  your  IDE?  
     

    View Slide

  13. Some  numbers  
    185   repositories  with  confirmed  refactoring  acFvity  
    1,411   confirmed  refactoring  instances  
    465   e-­‐mails  sent  
    195   responses  received  (41.9%  response  rate)  
    27   commit  messages  explaining  the  moFvaFon  
    13  

    View Slide

  14. WHY  DO  DEVELOPERS  REFACTOR?  
    14  

    View Slide

  15. We  found  44  reasons  
    15  

    View Slide

  16. Extract  Method  moFvaFons  
    16  

    View Slide

  17. Lessons  learned  
    17  
    •  Refactoring  acFvity  is  mainly  driven  by  the  
    need  to  add  new  features  and  fix  bugs,  and  
    much  less  by  code  smell  resoluFon  
    •  Extract  Method  is  the  “Swiss  army  knife  of  
    refactorings”  (11  different  moFvaFons)  

    View Slide

  18. DO  DEVELOPERS  USE  REFACTORING  
    TOOLS?  
    18  

    View Slide

  19. Manual  vs.  automated  refactorings  
    19  

    View Slide

  20. Refactoring  automaFon  per  type  

    View Slide

  21. Reasons  for  not  using  Refactoring  
    Tools  

    View Slide

  22. The  influence  of  the  IDE  
    22  

    View Slide

  23. Lessons  learned  
    23  
    •  Refactoring  tools  are  sFll  underused,  as  
    suggested  by  previous  studies  
    •  Results  are  different  for  users  of  different  
    IDE’s  

    View Slide

  24. Dataset    
    24  
    hep://aserg-­‐ufmg.github.io/why-­‐we-­‐refactor  

    View Slide

  25.  
     
    STUDY  #  2  
     
    Understanding  and  Predic\ng  the  
    Popularity  of  GitHub  Repositories  
    Hudson  Borges,  Andre  Hora,  Marco  Tulio  Valente  

    View Slide

  26. Social  Coding  Features  
    26  
    “Stars  are  used  to  show  apprecia0on  to  the  
    repository  maintainer  for  their  work”  

    View Slide

  27. “Our  First  50,000  Stars”  
    27  
    heps://facebook.github.io/react/blog/2016/09/28/our-­‐first-­‐50000-­‐stars.html  

    View Slide

  28. What  are  the  characterisFcs  of  highly  
    successful  so*ware?  
    28  

    View Slide

  29. Top-­‐6  most  starred  repositories  (Dec  14,  2016)
     
    source:  hep://gierends.io  

    View Slide

  30. Data  CollecFon  
    ●  Top-­‐2,500  repositories  (March,  2016)    
    ●  Historical  data  on  #  stars  
    30  

    View Slide

  31. Top-­‐10  programming  languages  

    View Slide

  32. CorrelaFon  Analysis  
    32  
    Age  
    No  correlaFon  
    Contributors  
    Weak  correlaFon  
    Commits  
    Weak  correlaFon  
    Forks  
    Strong  correlaFon  

    View Slide

  33. Popularity  Growth  Paeerns  
    ● K-­‐Spectral  Centroid  (Fme  series)  clustering  algorithm  
    33  
    slow   moderate   fast   viral  

    View Slide

  34. Clustering  Results  

    View Slide

  35. Developers  Feedback  
    1.  Impact  of  account  types  (users  vs  orgs)  
    2.  Reasons  for  viral  growth  
     
     

    View Slide

  36. Do  you  plan  to  migrate  to  an  organiza\on  account?  
    All  developers  answered  negaFvely  
     
     
     
    36  
    Repositories  Owned  by  Users  
    “I worked hard to create the project, and having it under my
    personal username is necessary to have proper credit for it.”

    View Slide

  37. Do  you  agree  that  an  organiza\on  account  would  help  to  a_ract  
    more  users?  
    80%  answered  negaFvely  
     
     
     
    37  
    Repositories  Owned  by  Users  
    “It depends on what organization it is. If it’s a well known org
    I’m sure it helps, otherwise I don’t think it makes a
    difference.”

    View Slide

  38. Reasons  for  Viral  Growth  
    38  
    “I posted about this project on
    HackerNews. It quickly got a lot of
    attention ...”
    How  do  you  explain  the  peaks  in  the  number  of  stars?    
       
       

    View Slide

  39. ApplicaFon:  Popularity  PredicFon  
    ● Technique:  MulFple  Linear  Regression  
     
    where:  
    ○  Yt  →  Predicted  number  of  stars  at  week  t  
    ○  bj  →  Regression  coefficients    
    ○  Xj  →  Stars  at  week  j  (  0    ≤    j    ≤    r    <    t  )  
     
    39  

    View Slide

  40. RQ  #1  .  PredicFon  Examples  
    40  

    View Slide

  41.  
     
    STUDY  #  3  
     
    Measuring  Code  Authorship:  
    Algorithms  and  Applica\ons  
    Guilherme  Avelino,  Leonardo  Passos,  Andre  Hora,    
    Marco  Tulio  Valente  

    View Slide

  42. Degree-­‐of-­‐Authorship  (DOA)  Metric  
    •  DOA  (d,f)  depends  on  three  variables:  
    – FA  =  1  if  d  made  the  first  commit  in  f;  0,  otherwise  
    – DL  =  number  of  further  commits  to  f  by  d  
    – AC  =  number  of  commits  in  f  by  other  devs  
     
     
    T.  Fritz,  et  al.    “Degree-­‐of-­‐knowledge:  modeling  a  developer’s  knowledge  of  
    code,”  ACM  TOSEM,  2014  

    View Slide

  43. Author  IdenFficaFon  
     
     
     
     
     
     

    Degree  of  Authorship  
    Developers  

    View Slide

  44. Author  IdenFficaFon  
     
     
     
     
     
     

    Degree  of  Authorship  
    Developers  
    Authors

    View Slide

  45. Author  Iden\fica\on  
     
     
     
     
     
     

    Degree  of  Authorship  
    Developers  
    Authors
    Empirically  defined:    
    6  systems  of  different  
    languages  
    0.75  

    View Slide

  46. Example:  
     Linux  Kernel  
    46  

    View Slide

  47. Linux  Kernel:  Devs  vs  Authors  
     
    8x  
    8.5x  

    View Slide

  48. Authors  RaFo  
     

    View Slide

  49. ...  
    1  
    subsystem  
    At  least  2  
    subsystems  
    Specialist   Generalist  
    Linux  
    Subsystems  

    View Slide

  50. Specialists  vs  Generalists    
     

    View Slide

  51. ApplicaFon:  
     EsFmaFng  Truck/Bus  Factor  
    51  

    View Slide

  52. Application: Truck/Bus Factor
    “The number of people on your team that have
    to be hit by a truck (or quit; or win in the
    lottery) before the project is in serious trouble”
     

    View Slide

  53. Estimating Truck Factor
    53
    A1
    A1
    A1
    A1
    A1
    A2
    A2
    A3
    A3
    A4
    A5
    A6
    A7
    A8
    A9
    A10
    System’s Files

    Number of Files
    Authors
    A1
    A2
    A3
    A4
    An

    View Slide

  54. Estimating Truck Factors
    54
    System’s Files
    X
    Authors

    A1
    A2
    A3
    A4
    An
    A1
    A1
    A1
    A1
    A1
    A2
    A2
    A3
    A3
    A4
    A5
    A6
    A7
    A8
    A9
    An
    Number of Files

    View Slide

  55. Estimating Truck Factors
    55
    System’s Files
    X
    X
    Authors

    A1
    A2
    A3
    A4
    An
    A1
    A1
    A1
    A1
    A1
    A2
    A2
    A3
    A3
    A4
    A5
    A6
    A7
    A8
    A9
    An
    Number of Files

    View Slide

  56. Estimating Truck Factors
    56
    System’s Files
    50%
    Authors
    X

    A1
    A2
    A3
    A4
    An
    A1
    A1
    A1
    A1
    A1
    A2
    A2
    A3
    A3
    A4
    A5
    A6
    A7
    A8
    A9
    An
    Number of Files
    X
    X

    View Slide

  57. Estimating Truck Factors
    57
    System’s Files
    50%
    Authors
    X TF = 3

    A1
    A2
    A3
    A4
    An
    A1
    A1
    A1
    A1
    A1
    A2
    A2
    A3
    A3
    A4
    A5
    A6
    A7
    A8
    A9
    An
    Number of Files
    X
    X

    View Slide

  58. Dataset
     

    View Slide

  59. Results  
    •  45  systems  (34%)  have  TF  =  1  
    –  mbostock/d3,  less/less.js  
     
    •  42  systems  (31%)  have  TF  =  2    
    –  clojure/clojure,  cucumber/  cucumber,    
    –  ashkenas/  backbone,  elasFcsearch/elasFcsearch  
    [  updated  results:    
     12K  out  of  the  top-­‐17K    (72%)    projects  have  TF=1  ]  
     

    View Slide

  60. Systems  with  highest  TF  

    View Slide

  61. Survey with Developers
    ▪  GitHub issues
    ▪  Opened: 114
    Response ratio: 54%
     

    View Slide

  62. Do developers agree that the TF authors are the main
    developers of their projects?
    62

    View Slide

  63. 63
    Do developers agree that their projects will be in trouble if
    they lose the truck factor authors?

    View Slide

  64. What are the development practices that can attenuate
    the loss of top-ranked authors?
    64

    View Slide

  65. h_p://gi_rends.io  
    65  

    View Slide

  66. hep://gierends.io  
    66  

    View Slide

  67. 67  

    View Slide

  68. Thanks!  
    Marco  Tulio  Valente  
     
    Applied  So*ware  Engineering  Research  Group  
     
     
    Department  of  Computer  Science  
     
    Federal  University  of  Minas  Gerais  Brazil  
     
     

    View Slide

  69. Ongoing  study:  why  OSS  fail?  (“dual  study”)  
    ●  Top-­‐5000  systems  (stars)  
    ●  540  systems  without  commits  in  the  last  year  (fail?)  
    ●  342  mails  sent  to  main  developer  (public  mail  available)  
    ●  94  answers  (27.5%)  
    ●  Do  you  agree  it  is  no  longer  under  maintenance?  Yes  (78)  
    ●  Why  did  you  stop  maintainig  the  system?  [37  answers]  
    ○  Lack  of  Fme:  13  
    ○  Project  is  completed:  6  
    ○  Usurped  by  compeFtor:  5  
    ○  Lack  of  interest:  5  
     

    View Slide