Identifying Unmaintained Projects in GitHub (ESEM 2018)

Identifying Unmaintained Projects in GitHub (ESEM 2018)

Background: Open source software has an increasing importance in modern software development. However, there is also a growing concern on the sustainability of such projects, which are usually managed by a small number of developers, frequently working as volunteers. Aims: In this paper, we propose an approach to identify GitHub projects that are not actively maintained. Our goal is to alert users about the risks of using these projects and possibly motivate other developers to assume the maintenance of the projects. Method: We train machine learning models to identify unmaintained or sparsely maintained projects, based on a set of features about project activity (commits, forks, issues, etc). We empirically validate the model with the best performance with the principal developers of 129 GitHub projects. Results: The proposed machine learning approach has a precision of 80%, based on the feedback of real open source developers; and a recall of 96%. We also show that our approach can be used to assess the risks of projects becoming unmaintained. Conclusions: The model proposed in this paper can be used by open source users and developers to identify GitHub projects that are not actively maintained anymore.

13beaa3b7239eca3319d54c6a9f3a85a?s=128

ASERG, DCC, UFMG

October 11, 2018
Tweet

Transcript

  1. 1.

    Identifying Unmaintained Projects in GitHub Jailton Coelho¹, Marco Tulio Valente¹,

    Luciana Silva², Emad Shihab³ ¹Federal University of Minas Gerais, Brazil ²Federal Institute of Minas Gerais, Brazil ³Concordia University,Canada ESEM 2018
  2. 4.

    Reason Projects Usurped by competitor 25% Obsolete 18% Lack of

    time 16% Lack of interest 16% Outdated technologies 13% 4 Jailton Coelho, Marco Tulio Valente. Why Modern Open Source Projects Fail, FSE 2017. Why do open source projects fail?
  3. 5.

    5 We propose a machine learning approach to identify GitHub

    projects that are not actively maintained
  4. 6.

    6

  5. 7.

    7

  6. 9.

    9

  7. 10.

    10

  8. 12.

    12 Our goal: identify ⎼ as soon as possible ⎼

    GitHub projects that are not actively maintained
  9. 13.

    13 Our goal: identify ⎼ as soon as possible ⎼

    GitHub projects that are not actively maintained "as soon as possible": without having to wait for years of inactivity
  10. 14.

    14 Our goal: identify ⎼ as soon as possible ⎼

    GitHub projects that are not actively maintained "as soon as possible": without having to wait for years of inactivity "not actively maintained": projects do not need to be fully dead, deprecated, or archived
  11. 17.

    • 754 active projects (one release, last month) • 248

    unmaintained projects: ◦ 104 abandoned projects [FSE 2017 paper] ◦ 144 archived projects Dataset 1,002 projects 17
  12. 18.

    Features 18 Dimension Feature Forks Open issues Closed issues Open

    pull requests Project Closed pull requests Merged pull requests Commits Max days without commits Max contributions by developer Contributor New contributors Distinct contributors Owner Projects created by the owner Number of commits of the owner
  13. 22.

    22 Best Random Forest Model Metric Average Accuracy 0.92 Precision

    0.86 Recall 0.81 F-measure 0.83 Kappa 0.78 AUC 0.88
  14. 26.

    26 Methodology ML Model (best) projects classification 5,783 projects Projects

    Classification twbs/bootstrap active zensh/jsgen unmaintained torvalds/linux active
  15. 29.

    29 Survey with Developers • Pilot Study (75 developers): ◦

    Do you confirm your project is unmaintained? • Final Survey (227 developers): ◦ What is the status of your project? (a) under maintenance; (b) finished; (c) deprecated; (d) other
  16. 33.

    Finished Projects (41%) 33 “It’s just complete, at least for

    now. I still fix bugs on the rare occasion they are reported.” “I view it as basically “done”. I don’t think it needs any new features for the foreseeable future...”
  17. 34.

    Deprecated Projects (31%) 34 “The project is unmaintained and I’ll

    archive it.” “It is deprecated and I do not plan to implement new features or fix bugs.”
  18. 35.

    Maintained Projects (15%) 35 “It’s under maintenance and new features

    are under implementation.” “This project is under maintenance all the time and open for any new features.”
  19. 36.

    Other Answers (13%) 36 “I’m planning to fix most important

    issues, then the project will be finished...I would call this status limbo.”
  20. 39.

    39 How early can we detect unmaintained projects? Most projects

    are classified as unmaintained despite having recent commits (median 81 days, before dataset collection date)
  21. 42.

    42 Level of Maintenance Activity (LMA) • Random Forest probability

    • : proportion of trees’ votes • ranges from 0.5 (very close to unmaintained) to 1.0 (very active) LMA = 2 * ( - 0.5) * 100
  22. 43.

    43 Level of Maintenance Activity (LMA) Predicted as active, but

    very close to unmainted very active 0 25 100 75 50
  23. 45.

    45 Level of Maintenance Activity (LMA) Most projects predicted as

    active are indeed under constant maintenance (median 82)
  24. 57.

    57 Conclusion We define a metric to express the level

    of maintenance activity of GitHub projects.
  25. 58.

    Identifying Unmaintained Projects in GitHub Jailton Coelho¹, Marco Tulio Valente¹,

    Luciana Silva², Emad Shihab³ ¹Federal University of Minas Gerais, Brazil ²Federal Institute of Minas Gerais, Brazil ³Concordia University,Canada ESEM 2018