Large Scale Empirical Software Engineering Research using GitHub Data

Large Scale Empirical So/ware Engineering Research using GitHub
Data Marco Tulio Valente Applied So*ware Engineering Research Group Department of Computer Science Federal University of Minas Gerais Brazil

GitHub •  Largest collecFon of open source so*ware
– 9 million users and 17 million public repositories •  Public API •  It is more than a version control system – Social coding – Issue tracker – Code review

Outline 1.  MoFvaFons for refactoring – FSE 2016 (disFnguished
paper and arFfact) 2.  Popularity (# stars) of GitHub so*ware –  ICSME 2016, PROMISE 2016 3.  Code authorship measures –  ICPC 2016

STUDY #1 Why We Refactor? Confessions of
GitHub Contributors Danilo Silva, Nikolaos Tsantalis, Marco Tulio Valente

Why we refactor? [In theory]

Why we refactor? [In theory] Kim et al. An
Empirical Study of Refactoring Challenges and Beneﬁts at Microso*, IEEE TSE 2014

But why do we really refactor? In daily
programming ... 7

Dataset 8 1,000 most popular Java repositories
748 repositories SPRING-‐FRAMEWORK ELASTICSEARCH INTELLIJ-‐COMMUNITY ... Filter by number of commits

9 Select code repositories 1 Mine recent
refactorings 2 Inspect manually 3 Contact developers 4 Analyze and classify responses 5 Repeat daily

Tool Support Refactoring Miner •  An automated refactoring
detecFon tool •  12 well-‐known refactoring types 10 Move Class Extract Superclass Rename Package Extract Interface Extract Method Inline Method Pull Up Method Push Down Method Move Method Move Aeribute Pull Up Aeribute Push Down Aeribute

11 revision previous revision git repository
List of refactorings •  Move Class X to Y •  Extract Method a from b •  ... Refactoring Miner

ContacFng developers 12 Dear xxxx, (…)
I found that you recently performed the following refactoring on yyyy project: Move Class PropertyRule from org.yyyy.wfs.xml to org.yyyy.util This is the GitHub link to the commit: heps://github.com/yyyy/commit/abcd (…) I am wondering if you could answer the following brief quesFons: 1.  Could you describe why did you perform this refactoring? 2.  Did you perform this refactoring using automated refactoring support of your IDE?

Some numbers 185 repositories with conﬁrmed refactoring acFvity
1,411 conﬁrmed refactoring instances 465 e-‐mails sent 195 responses received (41.9% response rate) 27 commit messages explaining the moFvaFon 13

WHY DO DEVELOPERS REFACTOR? 14

We found 44 reasons 15

Extract Method moFvaFons 16

Lessons learned 17 •  Refactoring acFvity is mainly
driven by the need to add new features and ﬁx bugs, and much less by code smell resoluFon •  Extract Method is the “Swiss army knife of refactorings” (11 diﬀerent moFvaFons)

DO DEVELOPERS USE REFACTORING TOOLS? 18

Manual vs. automated refactorings 19

Refactoring automaFon per type

Reasons for not using Refactoring Tools

The inﬂuence of the IDE 22

Lessons learned 23 •  Refactoring tools are sFll
underused, as suggested by previous studies •  Results are diﬀerent for users of diﬀerent IDE’s

Dataset 24 hep://aserg-‐ufmg.github.io/why-‐we-‐refactor

STUDY # 2 Understanding and Predic\ng
the Popularity of GitHub Repositories Hudson Borges, Andre Hora, Marco Tulio Valente

Social Coding Features 26 “Stars are used to
show apprecia0on to the repository maintainer for their work”

“Our First 50,000 Stars” 27 heps://facebook.github.io/react/blog/2016/09/28/our-‐ﬁrst-‐50000-‐stars.html

What are the characterisFcs of highly successful so*ware?
28

Top-‐6 most starred repositories (Dec 14, 2016) source: hep://gierends.io

Data CollecFon •  Top-‐2,500 repositories (March, 2016)
•  Historical data on # stars 30

Top-‐10 programming languages

CorrelaFon Analysis 32 Age No correlaFon
Contributors Weak correlaFon Commits Weak correlaFon Forks Strong correlaFon

Popularity Growth Paeerns • K-‐Spectral Centroid (Fme series) clustering algorithm
33 slow moderate fast viral

Clustering Results

Developers Feedback 1.  Impact of account types (users vs
orgs) 2.  Reasons for viral growth

Do you plan to migrate to an organiza\on account?
All developers answered negaFvely 36 Repositories Owned by Users “I worked hard to create the project, and having it under my personal username is necessary to have proper credit for it.”

Do you agree that an organiza\on account would help to
a_ract more users? 80% answered negaFvely 37 Repositories Owned by Users “It depends on what organization it is. If it’s a well known org I’m sure it helps, otherwise I don’t think it makes a difference.”

Reasons for Viral Growth 38 “I posted about
this project on HackerNews. It quickly got a lot of attention ...” How do you explain the peaks in the number of stars?

ApplicaFon: Popularity PredicFon • Technique: MulFple Linear Regression
where: ◦  Yt → Predicted number of stars at week t ◦  bj → Regression coeﬃcients ◦  Xj → Stars at week j ( 0 ≤ j ≤ r < t ) 39

RQ #1 . PredicFon Examples 40

STUDY # 3 Measuring Code Authorship:
Algorithms and Applica\ons Guilherme Avelino, Leonardo Passos, Andre Hora, Marco Tulio Valente

Degree-‐of-‐Authorship (DOA) Metric •  DOA (d,f) depends on three
variables: – FA = 1 if d made the ﬁrst commit in f; 0, otherwise – DL = number of further commits to f by d – AC = number of commits in f by other devs T. Fritz, et al. “Degree-‐of-‐knowledge: modeling a developer’s knowledge of code,” ACM TOSEM, 2014

Author IdenFﬁcaFon …
Degree of Authorship Developers

Author IdenFﬁcaFon …
Degree of Authorship Developers Authors

Author Iden\fica\on …
Degree of Authorship Developers Authors Empirically defined: 6 systems of different languages 0.75

Example: Linux Kernel 46

Linux Kernel: Devs vs Authors 8x 8.5x

Authors RaFo

... 1 subsystem At least 2
subsystems Specialist Generalist Linux Subsystems

Specialists vs Generalists

ApplicaFon: EsFmaFng Truck/Bus Factor 51

Application: Truck/Bus Factor “The number of people on your team
that have to be hit by a truck (or quit; or win in the lottery) before the project is in serious trouble”

Estimating Truck Factor 53 A1 A1 A1 A1 A1 A2
A2 A3 A3 A4 A5 A6 A7 A8 A9 A10 System’s Files … Number of Files Authors A1 A2 A3 A4 An

Estimating Truck Factors 54 System’s Files X Authors … A1
A2 A3 A4 An A1 A1 A1 A1 A1 A2 A2 A3 A3 A4 A5 A6 A7 A8 A9 An Number of Files

Estimating Truck Factors 55 System’s Files X X Authors …
A1 A2 A3 A4 An A1 A1 A1 A1 A1 A2 A2 A3 A3 A4 A5 A6 A7 A8 A9 An Number of Files

Estimating Truck Factors 56 System’s Files 50% Authors X …
A1 A2 A3 A4 An A1 A1 A1 A1 A1 A2 A2 A3 A3 A4 A5 A6 A7 A8 A9 An Number of Files X X

Estimating Truck Factors 57 System’s Files 50% Authors X TF
= 3 … A1 A2 A3 A4 An A1 A1 A1 A1 A1 A2 A2 A3 A3 A4 A5 A6 A7 A8 A9 An Number of Files X X

Dataset

Results •  45 systems (34%) have TF = 1
–  mbostock/d3, less/less.js •  42 systems (31%) have TF = 2 –  clojure/clojure, cucumber/ cucumber, –  ashkenas/ backbone, elasFcsearch/elasFcsearch [ updated results: 12K out of the top-‐17K (72%) projects have TF=1 ]

Systems with highest TF

Survey with Developers ▪  GitHub issues ▪  Opened: 114 Response
ratio: 54%

Do developers agree that the TF authors are the main
developers of their projects? 62

63 Do developers agree that their projects will be in
trouble if they lose the truck factor authors?

What are the development practices that can attenuate the loss
of top-ranked authors? 64

h_p://gi_rends.io 65

hep://gierends.io 66

Thanks! Marco Tulio Valente Applied So*ware Engineering
Research Group Department of Computer Science Federal University of Minas Gerais Brazil

Ongoing study: why OSS fail? (“dual study”) •  Top-‐5000
systems (stars) •  540 systems without commits in the last year (fail?) •  342 mails sent to main developer (public mail available) •  94 answers (27.5%) •  Do you agree it is no longer under maintenance? Yes (78) •  Why did you stop maintainig the system? [37 answers] ◦  Lack of Fme: 13 ◦  Project is completed: 6 ◦  Usurped by compeFtor: 5 ◦  Lack of interest: 5

Large Scale Empirical Software Engineering Rese...

Large Scale Empirical Software Engineering Research using GitHub Data

More Decks by ASERG, DCC, UFMG

Other Decks in Research

Featured

Transcript