Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Assessing Code Authorship: The Case of the Linux Kernel (OSS 2017)

Assessing Code Authorship: The Case of the Linux Kernel (OSS 2017)

Code authorship is a key information in large-scale open-source systems. Among others, it allows maintainers to assess division of work and identify key collaborators. Interestingly, open-source communities lack guidelines on how to manage authorship. This could be mitigated by setting to build an empirical body of knowledge on how authorship-related measures evolve in successful open-source communities. Towards that direction, we perform a case study on the Linux kernel. Our results show that: (a) only a small portion of developers (26%) makes significant contributions to the code base; (b) the distribution of the number of files per author is highly skewed—a small group of top-authors (3%) is responsible for hundreds of files, while most authors (75%) are responsible for at most 11 files; (c) most authors (62%) have a specialist profile; (d) authors with a high number of co-authorship connections tend to collaborate with others with less connections.

ASERG, DCC, UFMG

May 22, 2017
Tweet

More Decks by ASERG, DCC, UFMG

Other Decks in Research

Transcript

  1.  Assessing  Code  Authorship:  The  
    Case  of  the  Linux  Kernel  
    Guilherme    Avelino,  UFMG/UFPI,  BR  
     Leonardo  Passos,  Univ.  Waterloo,  CA    
    Andre  Hora,  UFMS,  BR  
    Marco  Tulio  Valente,  @mtov,  UFMG,  BR  
     OSS  2017  -­‐  Buenos  Aires,  ArgenOna  

    View Slide

  2. Authorship  is  precisely  documented  in  
    most  intellectual  work  
    2  

    View Slide

  3. Books  
    3  

    View Slide

  4. Songs  
    4  

    View Slide

  5. ScienOfic  papers  
    5  

    View Slide

  6. Source  code?    
    6  
    •  Author  names  are  not  stamped  in  the  code  
    •  Authorship  can  evolve  with  Ome  

    View Slide

  7. Open-­‐source  soZware?  
    7  
    OSS  can  have  thousands  of  contributors  

    View Slide

  8. In  this  paper  
    •  We  describe  the  use  of  a  metric  for  idenOfying  
    source  code  authors,  from  commit  histories  
     
    •  We  use  this  metric  to  
    – idenOfy  the  Linux  kernel  authors    
    – reveal  many  properOes  of  the  teams  involved  
    in  the  Linux  project  over  Ome  
    •  We  focus  on  Linux  due  to  its  relevance  
      8  

    View Slide

  9. Part  #1:  IdenOfying  Linux  authors  
    9  

    View Slide

  10. We  all  know  the  main  author  
    But  the  kernel  (ver.  4.7)  has  other  13,435  
    contributors.  Should  all  of  them  be  listed  as  
    Linux’s  authors?  
    10  

    View Slide

  11. Authors  definiOon  
    The  ones  who  made  significant  changes  
    to  at  least  one  Linux  source  code  file  
     
    11  

    View Slide

  12. Authors  in  our  study  
    The  ones  who  made  significant  changes  
    to  at  least  one  Linux  source  code  file  
     
    What  is  a  significant  change?  
     
    12  

    View Slide

  13. Degree-­‐of-­‐Authorship  (DOA)  metric  
    •  Computed  for  a  developer  d  on  a  file  f  
    •  if  d  created  f,    
    – DOA(d,f)  is  iniOalized  with  a  non-­‐zero  constant  
    – otherwise,  DOA(d,f)=  0  
    •  aZer  each  commit  on  f  by  d,    
    – DOA(d,f)  is  incremented  by  a  factor  
    •  aZer  each  commit  on  f  by  another  dev,  
    – DOA(d,f)  is  decremented  by  another  factor  
    13  
    Fritz,  T.,  et  al.  Degree-­‐of-­‐knowledge:  modeling  a  developer’s  knowledge  of  code.  ACM  TOSEM  2014.  
    Fritz,  T.,  et  al.  A  degree-­‐of-­‐knowledge  model  to  capture  source  code  familiarity,  ICSE  2010.  

    View Slide

  14. DOA  NormalizaOon  
    •  Suppose  a  file  f:  
    – DOA  (Joao,  f)=  20  
    – DOA(Maria,  f)  =  15  
    – DOA(Jose,  f)=  10  
    •  We  use  normalized  DOA  values:  
    – DOA  (Joao,  f)=  20  /  20  =  1  
    – DOA(Maria,  f)  =  15  /  20  =  0.75  
    – DOA(Jose,  f)=  10  /  20  =  0.5  
    14  

    View Slide

  15. Authors  
    •  d  is  an  author  of  f,    if  DOA(d,f)  ≥  0.75  
    •  In  our  example,  only  Joao  and  Maria  are  
    authors  of  f  
    – DOA  (Joao,  f)=  20  /  20  =  1  
    – DOA(Maria,  f)  =  15  /  20  =  0.75  
    – DOA(Jose,  f)=  10  /  20  =  0.5  
      15  

    View Slide

  16. (a  note:  all  weights,  constants,  and  
    thresholds  are  validated  elsewhere)  
    16  
    Fritz,  T.,  et  al.  Degree-­‐of-­‐knowledge:  modeling  a  developer’s  knowledge  of  code.  ACM  TOSEM  2014.  
    Fritz,  T.,  et  al.  A  degree-­‐of-­‐knowledge  model  to  capture  source  code  familiarity,  ICSE  2010.  
    Avelino,  G.,  et  al.  A  novel  approach  for  esOmaOng  truck  factors,  ICPC  2016.  
    Ferreira  M.,  et  al.  A  comparison  of  three  algorithms  for  compuOng  truck  factors.  ICPC  2017.  

    View Slide

  17. Part  #2:  we  analyze  the  evoluOon  of  
    the  Linux  kernel  using  code  authorship  
    (i.e.,  DOA)  measures  
    17  

    View Slide

  18. Research  QuesOons  
    1.  What  is  the  proporOon  of  authors/developers?  
     
    2.  What  is  the  distribuOon  of  files  per  author?  
    3.  How  specialized  is  the  work  of  Linux  authors?  
     
    4.  What  are  the  properOes  of  Linux  co-­‐authorship  
    network?  
    18  

    View Slide

  19. Linux  kernel  versions  
    •  56  stable  releases  (v2.6.12–  v4.7)  
    •  Spanning  11  years  (June,  2005–July,  2016).  
    19  

    View Slide

  20. RQ1.  Authors/Developers  
    Linux  (ver.  4.7)    has  13K  developers,    
    but  only  26%  are  authors  
    20  
    author   minor  collaborators  

    View Slide

  21. RQ2.  Files/Authors  
    •   Considering  only  authors:  
    – 50%  respond  to  at  most  3  files  
    – 75%  respond  to  11  to  16  files    
     
    •  Authors  with  more  than  100  files:  
    – Always  lower  than  7%  
    21  

    View Slide

  22. RQ2.  Torvalds’  authorship  over  Ome  
    45%  (first  release)  to  9%  (last  release)  
    22  

    View Slide

  23. RQ2.  Gini  coefficients  
    •   We  also  use  Gini  to  reason  about  the  
    “inequality”  of  the  files/authors  distribuOon  
    •  Suppose  system  with  100  files  and  10  authors  
    – Each  author  has  exactly  10  files:  Gini  =  0.0    
    – One  author  has  91  files  and  the  others  have  only  
    one  file:  Gini  ~    1.0  
    23  

    View Slide

  24. RQ2.  Gini  coefficients  
    Linux  is  not  a  “perfect  society”  in  terms  of  files/author,  
    but  it  is  slowly  becoming  less  centralized   24  
    Gini  ≥  0.78  

    View Slide

  25. RQ3.  How  specialized  is  the  work  of  
    Linux  authors?  
    •   Specialists:  
    –   if  he/she  authors  files  in  a  single  subsystem    
    •  Generalists:  
    – If  he/she  authors  files  in  at  least  two  subsystems  
    25  

    View Slide

  26. RQ3.  Specialists  vs  Generalists  
    26  

    View Slide

  27. RQ3.  Results  per  subsystems  
    •  Core  has  the  highest  raOo  of  generalists  (87%)  
    – They  have  experOse  on  Linux’s  central  features,  
    which  allows  them  to  work  on  other  subsystems  
     
    •  Drivers  has  the  highest  raOo  of  specialists  
    (+50%)  
    – Drivers  are  independent  from  other  subsystems  
     
    27  

    View Slide

  28. RQ4.  Linux  Co-­‐authorship  network  
    •  In  our  model,  files  can  have  mulOple  authors  
       
    •  Co-­‐authorship  network  
    – Nodes  are  authors  
    – Edges  connect  co-­‐authors  in  at  least  one  file  
       
     
    28  

    View Slide

  29. RQ3.  Linux  Co-­‐authorship  network  
    Torvald’s  degree  =  215    
    29  

    View Slide

  30. RQ4.  Linux  Co-­‐authorship  network  
    •  Mean  degree:  3.64  
    30  

    View Slide

  31. RQ4.  Linux  Co-­‐authorship  network  
    •  AssortaOvity  coefficient  
       
     
    31  

    View Slide

  32. RQ4.  Linux  Co-­‐authorship  network  
    •  AssortaOvity  coefficient  
       
     
    32  
    LINUX  
    Expert  authors  (many  connecOons)  work  with  
    less  skilled  authors  (few  connecOons)  

    View Slide

  33. Conclusions  
    33  

    View Slide

  34. ContribuOons  
    •  We  revealed  many  properOes  and  
    characterisOcs  of  the  Linux  project,  using  
    source  code  authorship  measures  
    •  We  proposed  a  conceptual  framework  for  
    assessing  authorship  of  soZware  projects  
    (authors,  specialists,  co-­‐authors  etc)  
    34  

    View Slide

  35. Future  Work  
    •  ReplicaOon  in  other  open-­‐  and  closed-­‐systems  
     
     
    35  

    View Slide

  36. Thanks!  
     
    [email protected]  
    @mtov  
    aserg.labsoZ.dcc.ufmg.br    
    36  

    View Slide