Assessing Code Authorship: The Case of the Linux Kernel (OSS 2017)

Assessing Code Authorship: The Case of the Linux Kernel
Guilherme Avelino, UFMG/UFPI, BR Leonardo Passos, Univ. Waterloo, CA Andre Hora, UFMS, BR Marco Tulio Valente, @mtov, UFMG, BR OSS 2017 -‐ Buenos Aires, ArgenOna

Authorship is precisely documented in most intellectual work
2

Books 3

Songs 4

ScienOﬁc papers 5

Source code? 6 •  Author names are
not stamped in the code •  Authorship can evolve with Ome

Open-‐source soZware? 7 OSS can have thousands of
contributors

In this paper •  We describe the use of
a metric for idenOfying source code authors, from commit histories •  We use this metric to – idenOfy the Linux kernel authors – reveal many properOes of the teams involved in the Linux project over Ome •  We focus on Linux due to its relevance 8

Part #1: IdenOfying Linux authors 9

We all know the main author But the kernel
(ver. 4.7) has other 13,435 contributors. Should all of them be listed as Linux’s authors? 10

Authors definiOon The ones who made significant changes
to at least one Linux source code file 11

Authors in our study The ones who made significant
changes to at least one Linux source code file What is a significant change? 12

Degree-‐of-‐Authorship (DOA) metric •  Computed for a developer d
on a ﬁle f •  if d created f, – DOA(d,f) is iniOalized with a non-‐zero constant – otherwise, DOA(d,f)= 0 •  aZer each commit on f by d, – DOA(d,f) is incremented by a factor •  aZer each commit on f by another dev, – DOA(d,f) is decremented by another factor 13 Fritz, T., et al. Degree-‐of-‐knowledge: modeling a developer’s knowledge of code. ACM TOSEM 2014. Fritz, T., et al. A degree-‐of-‐knowledge model to capture source code familiarity, ICSE 2010.

DOA NormalizaOon •  Suppose a ﬁle f: – DOA
(Joao, f)= 20 – DOA(Maria, f) = 15 – DOA(Jose, f)= 10 •  We use normalized DOA values: – DOA (Joao, f)= 20 / 20 = 1 – DOA(Maria, f) = 15 / 20 = 0.75 – DOA(Jose, f)= 10 / 20 = 0.5 14

Authors •  d is an author of f,
if DOA(d,f) ≥ 0.75 •  In our example, only Joao and Maria are authors of f – DOA (Joao, f)= 20 / 20 = 1 – DOA(Maria, f) = 15 / 20 = 0.75 – DOA(Jose, f)= 10 / 20 = 0.5 15

(a note: all weights, constants, and thresholds are validated
elsewhere) 16 Fritz, T., et al. Degree-‐of-‐knowledge: modeling a developer’s knowledge of code. ACM TOSEM 2014. Fritz, T., et al. A degree-‐of-‐knowledge model to capture source code familiarity, ICSE 2010. Avelino, G., et al. A novel approach for esOmaOng truck factors, ICPC 2016. Ferreira M., et al. A comparison of three algorithms for compuOng truck factors. ICPC 2017.

Part #2: we analyze the evoluOon of the Linux
kernel using code authorship (i.e., DOA) measures 17

Research QuesOons 1.  What is the proporOon of authors/developers?
2.  What is the distribuOon of ﬁles per author? 3.  How specialized is the work of Linux authors? 4.  What are the properOes of Linux co-‐authorship network? 18

Linux kernel versions •  56 stable releases (v2.6.12– v4.7)
•  Spanning 11 years (June, 2005–July, 2016). 19

RQ1. Authors/Developers Linux (ver. 4.7) has 13K developers,
but only 26% are authors 20 author minor collaborators

RQ2. Files/Authors •  Considering only authors: – 50% respond
to at most 3 files – 75% respond to 11 to 16 files •  Authors with more than 100 files: – Always lower than 7% 21

RQ2. Torvalds’ authorship over Ome 45% (ﬁrst release) to
9% (last release) 22

RQ2. Gini coefficients •  We also use Gini to
reason about the “inequality” of the files/authors distribuOon •  Suppose system with 100 files and 10 authors – Each author has exactly 10 files: Gini = 0.0 – One author has 91 files and the others have only one file: Gini ~ 1.0 23

RQ2. Gini coeﬃcients Linux is not a “perfect society”
in terms of ﬁles/author, but it is slowly becoming less centralized 24 Gini ≥ 0.78

RQ3. How specialized is the work of Linux authors?
•  Specialists: –  if he/she authors ﬁles in a single subsystem •  Generalists: – If he/she authors ﬁles in at least two subsystems 25

RQ3. Specialists vs Generalists 26

RQ3. Results per subsystems •  Core has the highest
raOo of generalists (87%) – They have experOse on Linux’s central features, which allows them to work on other subsystems •  Drivers has the highest raOo of specialists (+50%) – Drivers are independent from other subsystems 27

RQ4. Linux Co-‐authorship network •  In our model, ﬁles
can have mulOple authors •  Co-‐authorship network – Nodes are authors – Edges connect co-‐authors in at least one ﬁle 28

RQ3. Linux Co-‐authorship network Torvald’s degree = 215
29

RQ4. Linux Co-‐authorship network •  Mean degree: 3.64
30

RQ4. Linux Co-‐authorship network •  AssortaOvity coeﬃcient
31

RQ4. Linux Co-‐authorship network •  AssortaOvity coeﬃcient
32 LINUX Expert authors (many connecOons) work with less skilled authors (few connecOons)

Conclusions 33

ContribuOons •  We revealed many properOes and characterisOcs
of the Linux project, using source code authorship measures •  We proposed a conceptual framework for assessing authorship of soZware projects (authors, specialists, co-‐authors etc) 34

Future Work •  ReplicaOon in other open-‐ and closed-‐systems
35

Thanks! [email protected] @mtov aserg.labsoZ.dcc.ufmg.br
36

Assessing Code Authorship: The Case of the Linu...

Assessing Code Authorship: The Case of the Linux Kernel (OSS 2017)

More Decks by ASERG, DCC, UFMG

Other Decks in Research

Featured

Transcript