Assessing Code Authorship: The Case of the Linux Kernel (OSS 2017)

Assessing Code Authorship: The Case of the Linux Kernel (OSS 2017)

Code authorship is a key information in large-scale open-source systems. Among others, it allows maintainers to assess division of work and identify key collaborators. Interestingly, open-source communities lack guidelines on how to manage authorship. This could be mitigated by setting to build an empirical body of knowledge on how authorship-related measures evolve in successful open-source communities. Towards that direction, we perform a case study on the Linux kernel. Our results show that: (a) only a small portion of developers (26%) makes significant contributions to the code base; (b) the distribution of the number of files per author is highly skewed—a small group of top-authors (3%) is responsible for hundreds of files, while most authors (75%) are responsible for at most 11 files; (c) most authors (62%) have a specialist profile; (d) authors with a high number of co-authorship connections tend to collaborate with others with less connections.

13beaa3b7239eca3319d54c6a9f3a85a?s=128

ASERG, DCC, UFMG

May 22, 2017
Tweet

Transcript

  1.  Assessing  Code  Authorship:  The   Case  of  the  Linux  Kernel

      Guilherme    Avelino,  UFMG/UFPI,  BR    Leonardo  Passos,  Univ.  Waterloo,  CA     Andre  Hora,  UFMS,  BR   Marco  Tulio  Valente,  @mtov,  UFMG,  BR    OSS  2017  -­‐  Buenos  Aires,  ArgenOna  
  2. Authorship  is  precisely  documented  in   most  intellectual  work  

    2  
  3. Books   3  

  4. Songs   4  

  5. ScienOfic  papers   5  

  6. Source  code?     6   •  Author  names  are

     not  stamped  in  the  code   •  Authorship  can  evolve  with  Ome  
  7. Open-­‐source  soZware?   7   OSS  can  have  thousands  of

     contributors  
  8. In  this  paper   •  We  describe  the  use  of

     a  metric  for  idenOfying   source  code  authors,  from  commit  histories     •  We  use  this  metric  to   – idenOfy  the  Linux  kernel  authors     – reveal  many  properOes  of  the  teams  involved   in  the  Linux  project  over  Ome   •  We  focus  on  Linux  due  to  its  relevance     8  
  9. Part  #1:  IdenOfying  Linux  authors   9  

  10. We  all  know  the  main  author   But  the  kernel

     (ver.  4.7)  has  other  13,435   contributors.  Should  all  of  them  be  listed  as   Linux’s  authors?   10  
  11. Authors  definiOon   The  ones  who  made  significant  changes  

    to  at  least  one  Linux  source  code  file     11  
  12. Authors  in  our  study   The  ones  who  made  significant

     changes   to  at  least  one  Linux  source  code  file     What  is  a  significant  change?     12  
  13. Degree-­‐of-­‐Authorship  (DOA)  metric   •  Computed  for  a  developer  d

     on  a  file  f   •  if  d  created  f,     – DOA(d,f)  is  iniOalized  with  a  non-­‐zero  constant   – otherwise,  DOA(d,f)=  0   •  aZer  each  commit  on  f  by  d,     – DOA(d,f)  is  incremented  by  a  factor   •  aZer  each  commit  on  f  by  another  dev,   – DOA(d,f)  is  decremented  by  another  factor   13   Fritz,  T.,  et  al.  Degree-­‐of-­‐knowledge:  modeling  a  developer’s  knowledge  of  code.  ACM  TOSEM  2014.   Fritz,  T.,  et  al.  A  degree-­‐of-­‐knowledge  model  to  capture  source  code  familiarity,  ICSE  2010.  
  14. DOA  NormalizaOon   •  Suppose  a  file  f:   – DOA

     (Joao,  f)=  20   – DOA(Maria,  f)  =  15   – DOA(Jose,  f)=  10   •  We  use  normalized  DOA  values:   – DOA  (Joao,  f)=  20  /  20  =  1   – DOA(Maria,  f)  =  15  /  20  =  0.75   – DOA(Jose,  f)=  10  /  20  =  0.5   14  
  15. Authors   •  d  is  an  author  of  f,  

     if  DOA(d,f)  ≥  0.75   •  In  our  example,  only  Joao  and  Maria  are   authors  of  f   – DOA  (Joao,  f)=  20  /  20  =  1   – DOA(Maria,  f)  =  15  /  20  =  0.75   – DOA(Jose,  f)=  10  /  20  =  0.5     15  
  16. (a  note:  all  weights,  constants,  and   thresholds  are  validated

     elsewhere)   16   Fritz,  T.,  et  al.  Degree-­‐of-­‐knowledge:  modeling  a  developer’s  knowledge  of  code.  ACM  TOSEM  2014.   Fritz,  T.,  et  al.  A  degree-­‐of-­‐knowledge  model  to  capture  source  code  familiarity,  ICSE  2010.   Avelino,  G.,  et  al.  A  novel  approach  for  esOmaOng  truck  factors,  ICPC  2016.   Ferreira  M.,  et  al.  A  comparison  of  three  algorithms  for  compuOng  truck  factors.  ICPC  2017.  
  17. Part  #2:  we  analyze  the  evoluOon  of   the  Linux

     kernel  using  code  authorship   (i.e.,  DOA)  measures   17  
  18. Research  QuesOons   1.  What  is  the  proporOon  of  authors/developers?

        2.  What  is  the  distribuOon  of  files  per  author?   3.  How  specialized  is  the  work  of  Linux  authors?     4.  What  are  the  properOes  of  Linux  co-­‐authorship   network?   18  
  19. Linux  kernel  versions   •  56  stable  releases  (v2.6.12–  v4.7)

      •  Spanning  11  years  (June,  2005–July,  2016).   19  
  20. RQ1.  Authors/Developers   Linux  (ver.  4.7)    has  13K  developers,

        but  only  26%  are  authors   20   author   minor  collaborators  
  21. RQ2.  Files/Authors   •   Considering  only  authors:   – 50%  respond

     to  at  most  3  files   – 75%  respond  to  11  to  16  files       •  Authors  with  more  than  100  files:   – Always  lower  than  7%   21  
  22. RQ2.  Torvalds’  authorship  over  Ome   45%  (first  release)  to

     9%  (last  release)   22  
  23. RQ2.  Gini  coefficients   •   We  also  use  Gini  to

     reason  about  the   “inequality”  of  the  files/authors  distribuOon   •  Suppose  system  with  100  files  and  10  authors   – Each  author  has  exactly  10  files:  Gini  =  0.0     – One  author  has  91  files  and  the  others  have  only   one  file:  Gini  ~    1.0   23  
  24. RQ2.  Gini  coefficients   Linux  is  not  a  “perfect  society”

     in  terms  of  files/author,   but  it  is  slowly  becoming  less  centralized   24   Gini  ≥  0.78  
  25. RQ3.  How  specialized  is  the  work  of   Linux  authors?

      •   Specialists:   –   if  he/she  authors  files  in  a  single  subsystem     •  Generalists:   – If  he/she  authors  files  in  at  least  two  subsystems   25  
  26. RQ3.  Specialists  vs  Generalists   26  

  27. RQ3.  Results  per  subsystems   •  Core  has  the  highest

     raOo  of  generalists  (87%)   – They  have  experOse  on  Linux’s  central  features,   which  allows  them  to  work  on  other  subsystems     •  Drivers  has  the  highest  raOo  of  specialists   (+50%)   – Drivers  are  independent  from  other  subsystems     27  
  28. RQ4.  Linux  Co-­‐authorship  network   •  In  our  model,  files

     can  have  mulOple  authors       •  Co-­‐authorship  network   – Nodes  are  authors   – Edges  connect  co-­‐authors  in  at  least  one  file         28  
  29. RQ3.  Linux  Co-­‐authorship  network   Torvald’s  degree  =  215  

      29  
  30. RQ4.  Linux  Co-­‐authorship  network   •  Mean  degree:  3.64  

    30  
  31. RQ4.  Linux  Co-­‐authorship  network   •  AssortaOvity  coefficient    

        31  
  32. RQ4.  Linux  Co-­‐authorship  network   •  AssortaOvity  coefficient    

        32   LINUX   Expert  authors  (many  connecOons)  work  with   less  skilled  authors  (few  connecOons)  
  33. Conclusions   33  

  34. ContribuOons   •  We  revealed  many  properOes  and   characterisOcs

     of  the  Linux  project,  using   source  code  authorship  measures   •  We  proposed  a  conceptual  framework  for   assessing  authorship  of  soZware  projects   (authors,  specialists,  co-­‐authors  etc)   34  
  35. Future  Work   •  ReplicaOon  in  other  open-­‐  and  closed-­‐systems

          35  
  36. Thanks!     mtov@dcc.ufmg.br   @mtov   aserg.labsoZ.dcc.ufmg.br    

    36