Identifying Unmaintained Projects in GitHub (ESEM 2018)

Identifying Unmaintained Projects in GitHub (ESEM 2018)

Background: Open source software has an increasing importance in modern software development. However, there is also a growing concern on the sustainability of such projects, which are usually managed by a small number of developers, frequently working as volunteers. Aims: In this paper, we propose an approach to identify GitHub projects that are not actively maintained. Our goal is to alert users about the risks of using these projects and possibly motivate other developers to assume the maintenance of the projects. Method: We train machine learning models to identify unmaintained or sparsely maintained projects, based on a set of features about project activity (commits, forks, issues, etc). We empirically validate the model with the best performance with the principal developers of 129 GitHub projects. Results: The proposed machine learning approach has a precision of 80%, based on the feedback of real open source developers; and a recall of 96%. We also show that our approach can be used to assess the risks of projects becoming unmaintained. Conclusions: The model proposed in this paper can be used by open source users and developers to identify GitHub projects that are not actively maintained anymore.

13beaa3b7239eca3319d54c6a9f3a85a?s=128

ASERG, DCC, UFMG

October 11, 2018
Tweet

Transcript

  1. Identifying Unmaintained Projects in GitHub Jailton Coelho¹, Marco Tulio Valente¹,

    Luciana Silva², Emad Shihab³ ¹Federal University of Minas Gerais, Brazil ²Federal Institute of Minas Gerais, Brazil ³Concordia University,Canada ESEM 2018
  2. 85M repositories 28M developers 1.8M organizations 2

  3. 3 GitHub projects are often abandoned by developers

  4. Reason Projects Usurped by competitor 25% Obsolete 18% Lack of

    time 16% Lack of interest 16% Outdated technologies 13% 4 Jailton Coelho, Marco Tulio Valente. Why Modern Open Source Projects Fail, FSE 2017. Why do open source projects fail?
  5. 5 We propose a machine learning approach to identify GitHub

    projects that are not actively maintained
  6. 6

  7. 7

  8. 8 4 years without commits. Trivial case!

  9. 9

  10. 10

  11. 11 Very few commits in 2018. Unmaintained?

  12. 12 Our goal: identify ⎼ as soon as possible ⎼

    GitHub projects that are not actively maintained
  13. 13 Our goal: identify ⎼ as soon as possible ⎼

    GitHub projects that are not actively maintained "as soon as possible": without having to wait for years of inactivity
  14. 14 Our goal: identify ⎼ as soon as possible ⎼

    GitHub projects that are not actively maintained "as soon as possible": without having to wait for years of inactivity "not actively maintained": projects do not need to be fully dead, deprecated, or archived
  15. 15 Besides commits, we consider 12 other features (issues, forks,

    pulls requests etc)
  16. MACHINE LEARNING MODEL 16

  17. • 754 active projects (one release, last month) • 248

    unmaintained projects: ◦ 104 abandoned projects [FSE 2017 paper] ◦ 144 archived projects Dataset 1,002 projects 17
  18. Features 18 Dimension Feature Forks Open issues Closed issues Open

    pull requests Project Closed pull requests Merged pull requests Commits Max days without commits Max contributions by developer Contributor New contributors Distinct contributors Owner Projects created by the owner Number of commits of the owner
  19. Feature Collection 19 Months Last commit date 3 months 24

    months
  20. Random Forest • 5-fold cross validation • 100 rounds 20

  21. 21 Results (mean of 100 iterations)

  22. 22 Best Random Forest Model Metric Average Accuracy 0.92 Precision

    0.86 Recall 0.81 F-measure 0.83 Kappa 0.78 AUC 0.88
  23. EMPIRICAL VALIDATION 23

  24. 24 Methodology 5,783 projects (not used in the model construction)

  25. 25 Methodology ML Model (best) 5,783 projects

  26. 26 Methodology ML Model (best) projects classification 5,783 projects Projects

    Classification twbs/bootstrap active zensh/jsgen unmaintained torvalds/linux active
  27. 27 Results 2,856 unmaintained projects (49%) 2,927 active projects (51%)

  28. 28 1st validation: unmaintained projects 2,856 unmaintained projects 2,927 Active

    projects
  29. 29 Survey with Developers • Pilot Study (75 developers): ◦

    Do you confirm your project is unmaintained? • Final Survey (227 developers): ◦ What is the status of your project? (a) under maintenance; (b) finished; (c) deprecated; (d) other
  30. 30 Survey Answers • 112 answers (response rate of 37%)

    • 21 answers from READMEs
  31. 31 Example of Project’s README This repository (deis/deis) is no

    longer developed or maintained.
  32. 32 Survey Results

  33. Finished Projects (41%) 33 “It’s just complete, at least for

    now. I still fix bugs on the rare occasion they are reported.” “I view it as basically “done”. I don’t think it needs any new features for the foreseeable future...”
  34. Deprecated Projects (31%) 34 “The project is unmaintained and I’ll

    archive it.” “It is deprecated and I do not plan to implement new features or fix bugs.”
  35. Maintained Projects (15%) 35 “It’s under maintenance and new features

    are under implementation.” “This project is under maintenance all the time and open for any new features.”
  36. Other Answers (13%) 36 “I’m planning to fix most important

    issues, then the project will be finished...I would call this status limbo.”
  37. 37 Final Results • Precision: 80% • Recall: 96% (based

    on 112 project’s README)
  38. 38 When was the last commit? (days before dataset collection

    date)
  39. 39 How early can we detect unmaintained projects? Most projects

    are classified as unmaintained despite having recent commits (median 81 days, before dataset collection date)
  40. LEVEL of MAINTENANCE ACTIVITY (LMA) 40

  41. 41 Now, the focus is on active projects 2,856 unmaintained

    projects 2,927 active projects
  42. 42 Level of Maintenance Activity (LMA) • Random Forest probability

    • : proportion of trees’ votes • ranges from 0.5 (very close to unmaintained) to 1.0 (very active) LMA = 2 * ( - 0.5) * 100
  43. 43 Level of Maintenance Activity (LMA) Predicted as active, but

    very close to unmainted very active 0 25 100 75 50
  44. 44 Level of Maintenance Activity (LMA)

  45. 45 Level of Maintenance Activity (LMA) Most projects predicted as

    active are indeed under constant maintenance (median 82)
  46. 46 Spearman’s rank correlation

  47. 47 Spearman’s rank correlation

  48. 48 Spearman’s rank correlation LMA vs STARS ρ = 0.10

    very weak
  49. 49 Spearman’s rank correlation

  50. 50 Spearman’s rank correlation LMA vs CONTRIBUTORS ρ = 0.44

    moderate
  51. 51 Spearman’s rank correlation

  52. 52 Spearman’s rank correlation LMA vs CORE CONTRIBUTORS ρ =

    0.15 very weak
  53. 53 Spearman’s rank correlation

  54. 54 Spearman’s rank correlation LMA vs LOC ρ = 0.38

    weak
  55. CONCLUSION 55

  56. 56 Conclusion We proposed a machine learning model to identify

    unmaintained GitHub projects.
  57. 57 Conclusion We define a metric to express the level

    of maintenance activity of GitHub projects.
  58. Identifying Unmaintained Projects in GitHub Jailton Coelho¹, Marco Tulio Valente¹,

    Luciana Silva², Emad Shihab³ ¹Federal University of Minas Gerais, Brazil ²Federal Institute of Minas Gerais, Brazil ³Concordia University,Canada ESEM 2018