Slide 1

Slide 1 text

Identifying Unmaintained Projects in GitHub Jailton Coelho¹, Marco Tulio Valente¹, Luciana Silva², Emad Shihab³ ¹Federal University of Minas Gerais, Brazil ²Federal Institute of Minas Gerais, Brazil ³Concordia University,Canada ESEM 2018

Slide 2

Slide 2 text

85M repositories 28M developers 1.8M organizations 2

Slide 3

Slide 3 text

3 GitHub projects are often abandoned by developers

Slide 4

Slide 4 text

Reason Projects Usurped by competitor 25% Obsolete 18% Lack of time 16% Lack of interest 16% Outdated technologies 13% 4 Jailton Coelho, Marco Tulio Valente. Why Modern Open Source Projects Fail, FSE 2017. Why do open source projects fail?

Slide 5

Slide 5 text

5 We propose a machine learning approach to identify GitHub projects that are not actively maintained

Slide 6

Slide 6 text

6

Slide 7

Slide 7 text

7

Slide 8

Slide 8 text

8 4 years without commits. Trivial case!

Slide 9

Slide 9 text

9

Slide 10

Slide 10 text

10

Slide 11

Slide 11 text

11 Very few commits in 2018. Unmaintained?

Slide 12

Slide 12 text

12 Our goal: identify ⎼ as soon as possible ⎼ GitHub projects that are not actively maintained

Slide 13

Slide 13 text

13 Our goal: identify ⎼ as soon as possible ⎼ GitHub projects that are not actively maintained "as soon as possible": without having to wait for years of inactivity

Slide 14

Slide 14 text

14 Our goal: identify ⎼ as soon as possible ⎼ GitHub projects that are not actively maintained "as soon as possible": without having to wait for years of inactivity "not actively maintained": projects do not need to be fully dead, deprecated, or archived

Slide 15

Slide 15 text

15 Besides commits, we consider 12 other features (issues, forks, pulls requests etc)

Slide 16

Slide 16 text

MACHINE LEARNING MODEL 16

Slide 17

Slide 17 text

● 754 active projects (one release, last month) ● 248 unmaintained projects: ○ 104 abandoned projects [FSE 2017 paper] ○ 144 archived projects Dataset 1,002 projects 17

Slide 18

Slide 18 text

Features 18 Dimension Feature Forks Open issues Closed issues Open pull requests Project Closed pull requests Merged pull requests Commits Max days without commits Max contributions by developer Contributor New contributors Distinct contributors Owner Projects created by the owner Number of commits of the owner

Slide 19

Slide 19 text

Feature Collection 19 Months Last commit date 3 months 24 months

Slide 20

Slide 20 text

Random Forest ● 5-fold cross validation ● 100 rounds 20

Slide 21

Slide 21 text

21 Results (mean of 100 iterations)

Slide 22

Slide 22 text

22 Best Random Forest Model Metric Average Accuracy 0.92 Precision 0.86 Recall 0.81 F-measure 0.83 Kappa 0.78 AUC 0.88

Slide 23

Slide 23 text

EMPIRICAL VALIDATION 23

Slide 24

Slide 24 text

24 Methodology 5,783 projects (not used in the model construction)

Slide 25

Slide 25 text

25 Methodology ML Model (best) 5,783 projects

Slide 26

Slide 26 text

26 Methodology ML Model (best) projects classification 5,783 projects Projects Classification twbs/bootstrap active zensh/jsgen unmaintained torvalds/linux active

Slide 27

Slide 27 text

27 Results 2,856 unmaintained projects (49%) 2,927 active projects (51%)

Slide 28

Slide 28 text

28 1st validation: unmaintained projects 2,856 unmaintained projects 2,927 Active projects

Slide 29

Slide 29 text

29 Survey with Developers ● Pilot Study (75 developers): ○ Do you confirm your project is unmaintained? ● Final Survey (227 developers): ○ What is the status of your project? (a) under maintenance; (b) finished; (c) deprecated; (d) other

Slide 30

Slide 30 text

30 Survey Answers ● 112 answers (response rate of 37%) ● 21 answers from READMEs

Slide 31

Slide 31 text

31 Example of Project’s README This repository (deis/deis) is no longer developed or maintained.

Slide 32

Slide 32 text

32 Survey Results

Slide 33

Slide 33 text

Finished Projects (41%) 33 “It’s just complete, at least for now. I still fix bugs on the rare occasion they are reported.” “I view it as basically “done”. I don’t think it needs any new features for the foreseeable future...”

Slide 34

Slide 34 text

Deprecated Projects (31%) 34 “The project is unmaintained and I’ll archive it.” “It is deprecated and I do not plan to implement new features or fix bugs.”

Slide 35

Slide 35 text

Maintained Projects (15%) 35 “It’s under maintenance and new features are under implementation.” “This project is under maintenance all the time and open for any new features.”

Slide 36

Slide 36 text

Other Answers (13%) 36 “I’m planning to fix most important issues, then the project will be finished...I would call this status limbo.”

Slide 37

Slide 37 text

37 Final Results ● Precision: 80% ● Recall: 96% (based on 112 project’s README)

Slide 38

Slide 38 text

38 When was the last commit? (days before dataset collection date)

Slide 39

Slide 39 text

39 How early can we detect unmaintained projects? Most projects are classified as unmaintained despite having recent commits (median 81 days, before dataset collection date)

Slide 40

Slide 40 text

LEVEL of MAINTENANCE ACTIVITY (LMA) 40

Slide 41

Slide 41 text

41 Now, the focus is on active projects 2,856 unmaintained projects 2,927 active projects

Slide 42

Slide 42 text

42 Level of Maintenance Activity (LMA) ● Random Forest probability ● : proportion of trees’ votes ● ranges from 0.5 (very close to unmaintained) to 1.0 (very active) LMA = 2 * ( - 0.5) * 100

Slide 43

Slide 43 text

43 Level of Maintenance Activity (LMA) Predicted as active, but very close to unmainted very active 0 25 100 75 50

Slide 44

Slide 44 text

44 Level of Maintenance Activity (LMA)

Slide 45

Slide 45 text

45 Level of Maintenance Activity (LMA) Most projects predicted as active are indeed under constant maintenance (median 82)

Slide 46

Slide 46 text

46 Spearman’s rank correlation

Slide 47

Slide 47 text

47 Spearman’s rank correlation

Slide 48

Slide 48 text

48 Spearman’s rank correlation LMA vs STARS ρ = 0.10 very weak

Slide 49

Slide 49 text

49 Spearman’s rank correlation

Slide 50

Slide 50 text

50 Spearman’s rank correlation LMA vs CONTRIBUTORS ρ = 0.44 moderate

Slide 51

Slide 51 text

51 Spearman’s rank correlation

Slide 52

Slide 52 text

52 Spearman’s rank correlation LMA vs CORE CONTRIBUTORS ρ = 0.15 very weak

Slide 53

Slide 53 text

53 Spearman’s rank correlation

Slide 54

Slide 54 text

54 Spearman’s rank correlation LMA vs LOC ρ = 0.38 weak

Slide 55

Slide 55 text

CONCLUSION 55

Slide 56

Slide 56 text

56 Conclusion We proposed a machine learning model to identify unmaintained GitHub projects.

Slide 57

Slide 57 text

57 Conclusion We define a metric to express the level of maintenance activity of GitHub projects.

Slide 58

Slide 58 text

Identifying Unmaintained Projects in GitHub Jailton Coelho¹, Marco Tulio Valente¹, Luciana Silva², Emad Shihab³ ¹Federal University of Minas Gerais, Brazil ²Federal Institute of Minas Gerais, Brazil ³Concordia University,Canada ESEM 2018