Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Of Software Changes: Studying Pull Requests in ...

Of Software Changes: Studying Pull Requests in GitHub

Presentation for the IFI Colloquium of the University of Zurich, November 2013.

http://www.ifi.uzh.ch/agenda/software-changes.html

Based on Georgios Gousios, Martin Pinzger, Arie van Deursen: An exploratory study of the pull-based software development model. ICSE 2014: 345-355

http://swerl.tudelft.nl/twiki/pub/Main/TechnicalReports/TUD-SERG-2014-005.pdf

Arie van Deursen

November 14, 2013
Tweet

More Decks by Arie van Deursen

Other Decks in Technology

Transcript

  1. Arie  van  Deursen   Del-  University  of  Technology    

    Joint  work  with     Georgios  Gousios  and     Mar=n  Pinzger   1   Presenta=on  at     University  of  Zurich,   IFI  Colloquium   November  14,  2013  
  2. The  So-ware  Engineering     Research  Group   Educa&on  

    •  Programming,     so-ware  engineering   •  MSc,  BSc  projects   Research   •  So-ware  evolu=on   •  So-ware  architecture   •  So-ware  tes=ng   •  Collabora=on   •  So-ware  services   •  End-­‐user  programming   •  Language  engineering   2  
  3. The  So-ware  Evolu=on   Coordinate  System   5   Business

     value   Agility   v1   v2   v3   v4   v5  
  4. We  Need  to  Study   So-ware  Change   12  

    Requirements   Documenta=on   Issue  Tracker   Test  Reports   What  can  we   learn  from   so-ware   repositories?   Time   Source  Control  
  5. Change  Granularity   •  Edit   •  Commit   • 

    Bug  fix   •  Pull  request   •  User  visible  feature     •  Project  /  API  release   •  Product  delivery   13  
  6. The  (GitHub)  Pull  Request   •  Offer  coherent  set  of

     changes     to  a  project  owner   •  “I  take  responsibility  for  this  change;   are  you  willing  to  integrate  it”?   •  How  successful  is  this  model?   •  What  are  the  underlying  success  factors?   14  
  7. Git  Basics  (I)   •  System  is  collec=on  of  changes

     (commits)   •  Each  change  has  unique  id  (hash)   •  Version  is  sequence  of  applied  changes   – Changes  can  be  ‘replayed’   •  Changes  can  be  grouped  into  branches   15  
  8. Git  Basics  (II)   •  Each  dev  maintains  repository  of

     (all)  changes   •  Devs  can  exchange  changes  via  branches   •  I  can  offer  my  branch  to  other  developers   – This  is  a  request  to  pull  changes  from  my   repository   18  
  9. Pull  Request  Process        

                           21  
  10. Research  Ques=ons   1.  How  popular  is  pull-­‐based  development?  

    2.  What  does  the  life  cycle  of  a  pull  request   look  like?   3.  What  factors  affect  the  decision  to  merge   (accept)  a  pull  request?   4.  What  factors  affect  the  &me  it  takes  to   decide  to  merge?     5.  Why  are  some  pull  requests  not  merged?   22  
  11. GitHub  Popularity  2012  /  2013   (Our  Dataset!)   • 

    7  million  repositories   •  2.3  million  users     – changing  4.9  million  repos   •  1.9  million  repos  (45%)  are  originals     – not  forks   24  
  12. Do  All  Repositories  Use  Pull  Requests?   •  18%  of

     projects  use  shared  repository  model   – No  pull  requests   – Mul=ple  commimers  to  shared  repository   •  14%  of  projects  use  pull  request  model   – At  least  one  pull  request   – At  least  one  commit  not  from  team   •  (Rest:  single  user  repositories)   25  
  13. • Github introduces pull requests 0 5 10 15 20

    2002 2004 2006 2008 2010 2012 Date Number of active committers per month project cakephp facter jquery junit monodevelop phpbb3 puppet rubygems Use  of  pull  request   helps  to  amract   external  commimers   26  
  14. #Pull  requests  per  project   •  Median  =  2;  

      •  95-­‐percen=le  =  21   •  Rails  /  Homebrew:     •  >  10,000  pull  requests   0 1000 2000 3000 100 10000 Number of pull requests (log) Number of projects 27  
  15. Pull  Request  Sample   Obtain  bemer  understanding  of    

    pull  request  intensive  projects.     Sample  projects  with:   •  >  200  pull  requests   •  test  suite   •  Ruby,  Python,  Java,  Scala   •  At  least  one  commit  from  a  pull  request   •  Frameworks  /  applica=ons  (not  doc)   28   Resul=ng  PR   sample:   300  projects;   166,000  pull   requests  
  16. Detec=ng  Merges   •  Overall:  ~85%  of  pull  requests  merged.

      – Mostly  through  GitHub  web  UI   •  Alterna=ve  approaches:   – Local  git  merge,  then  push   – Cherry-­‐picking   – Squash  /  rebase   – Apply  patch  locally,  then  push  to  github.   29   Developed   heuris=cs  to   detect   alterna=ve  ways   to  merge  
  17. Time  to  Merge   •  95%:  within  26    

         days   •  90%:  within  10          days   •  80%:  within        3.7  days   •  30%:  within    one    hour.   •  Close  with  merge:              median      7  hours   •  Close  without  merge:  median  37  hours   30  
  18. Tests  &  Pull  Requests   •  33%  of  pull  requests

     change  test  code   •     4%  modifies  only  test  code   •  Presence  of  tests  in  PR  irrelevant.     With  and  without:   – 83%  of  pull  requests  accepted   – Median  =me  7  hours   – With  tests  it  seems  slower   (may  be  due  to  size)   1e+01 1e+03 1e+05 FALSE TRUE has_tests mergetime_minutes 32  
  19. Pull  Request  Size   Median   80%   90%  

    95%   #  Commits   1   3   6   12   #  Files   2   7   17   36   #  lines  changed   20   168   497   1227   0 20000 40000 60000 10 1,000 Number of files changed by the pull request (log) Number of pull requests 0 3000 6000 9000 10 1,000 100,000 Lines of code changed in pull request (log) Number of pull requests 33  
  20. Discussions   •  #  comments:      95%:  <  12;

       80%:  <  4   •  #  par=cipants:  95%:    <  4   •  #  comments:  (weak)  correla=on  with     – Time  to  merge  (rho  =  0.48)   – Time  to  close  non-­‐merged  (rho  =  0.37)   •  99.89%  of  comments  come  devs  who   contributed  at  least  one  commit  to  the  repo   34   0 5000 10000 15000 20000 10 1,000 Number of code review and discussion comments (log) Number of pull requests
  21. In  Line  Code  Review   •  12%  of  pull  requests

     have  at  least  one  code  comment   •  Existence  of  code  review  does  not  affect     decision  to  merge  (84%)   •  Slows  down  accept  =me  by  factor  of  10   –  From  5h  median  to  50h.   –  (then  likely  includes  addi=onal  commits)   •  Happens  most  on  larger  projects     (team,  code)   35  
  22. Which  Factors  Affect     Merge  Decision  /  Decision  Time?

      •  Determine  key  characteris=cs   •  Use  machine  learning  to  derive  classifier.     36  
  23. Classifica=on  Approach   •  Select  classifier,  derive  model,  evaluate  stability

      –  10-­‐fold  random  selec=on  cross  valida=on  on  all  data   –  Explore  naïve  Bayes,  random  forests,  regression   –  Random  forests  most  stable  in  terms  of  AUC  and  ACC     •  Determine  model  variable  importance   –  Run  random  forest  with  maxed  out  parameters     (depth  5,  2000  trees),  50  =mes   –  Average  parameter  importance  over  itera=ons   39  
  24. Stability  Plots   40   auc acc prec rec 0.6

    0.7 0.8 0.9 0.6 0.7 0.8 0.9 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 run value classifier binlogregr naive bayes randomforest Merge decision task cross validation (166884 items) auc acc 0.4 0.5 0.6 0.7 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 run value classifier multinomregr naivebayes randomforest Merge time task cross validation (3 classes, 166884 items)
  25. Commits  on   files  touched  in   previous  3  

    months   Merge  Decision   42  
  26. Developer’s   past   performance   Size  of  the  

      full  project   Test  code   density  in   project   Merge  Time   43  
  27. Summary   •  14%  of  github  projects  use  pull  requests

      –  small;  processed  in  <  1  day   •  Decision  to  merge  :   –  depends  on  how  hot  area  of  code  is   •  Time  to  merge:   –  depends  on  test  density  of  project   •  Pull  request  rejec=on:     –  caused  by  (lack  of)  task  ar=cula=on   45  
  28. Implica=ons   •  Contributors:   –  Keep  it  short;  keep

     it  hot   –  Figure  out  what  others  are  doing   •  Core  team:   –  Invest  in  test  suite  –  it  speeds  up  merging   –  Clarify  what  everyone  is  doing!   •  Research:   –  Incorporate  in  recommenda=on  tool   –  Integrate  task  ar=cula=on  in  pull-­‐based  model   46