Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RapidRelease - A Dataset of Projects and Issues on Github with Rapid Releases

RapidRelease - A Dataset of Projects and Issues on Github with Rapid Releases

A dataset of 994 Github repos with release frequency of 5 to 35 days across 17 programming languages with 2 million issues presented at MSR 2019, Montreal, Canada. To facilitate empirical research such as software quality and evolution in the context of rapid releases.

Sridhar Chimalakonda

May 27, 2019
Tweet

Other Decks in Research

Transcript

  1. RapidRelease A Dataset of Projects and Issues on Github with

    Rapid Releases Saket Dattatray Joshi, Sridhar Chimalakonda Indian Institute of Technology, Tirupati, India [email protected]
  2. •  23 IITs •  Admission acceptance rate of < 1%,

    •  Approx. 11,000 students get in out of 1.3 million aspirants 2
  3. Motivation from industry Source: 13th Annual State of Agile Report,

    CollabNet Version One, May 2019 https://www.stateofagile.com/#ufh-c-473508-state-of-agile-report Do they? 3
  4. Release Engineering A fundamental principle •  “Release early, release often”

    •  "early and continuous delivery of software” Months?! Weeks?! Days? Hours?! Raymond, E. (1999). The cathedral and the bazaar. Knowledge, Technology & Policy, 12(3), 23-49. Fowler, M., & Highsmith, J. (2001). The agile manifesto. Software Development, 9(8), 28-35. 4
  5. 5 Motivation from literature “The main take-home message is that,

    while release engineering technology has flourished tremendously due to industry, empirical validation of best practices and the impact of the release engineering process on (amongst others) software quality is largely missing and provides major research opportunities.”! Adams, B., & McIntosh, S. (2016, March). Modern release engineering in a nutshell- why researchers should care. In 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER) (Vol. 5, pp. 78-90). IEEE.! “Although many blogs and papers and some books discuss release engineering for large cloud applications and (to some extent) mobile apps, no thorough treatment exists of today's challenges and solutions for release engineering of the “other 80 percent” of software systems.”! ! Adams, B., Bellomo, S., Bird, C., Debić, B., Khomh, F., Moir, K., & O’Duinn, J. (2018). Release Engineering 3.0. IEEE Software, 35(2), 22-25.! ! “The analysis revealed that 33 out of 71 primary studies were casual experience reports that had neither an explicit research method nor a data collection approach specified, and 23 out of 38 empirical studies applied qualitative methods, such as interviews, among practitioners. Additionally, 12 studies applied quantitative methods, such as mining of software repositories. Only three empirical studies combined these research approaches”! Karvonen, T., Behutiye, W., Oivo, M., & Kuvaja, P. (2017). Systematic literature review on the impacts of agile release engineering practices. Information and Software Technology, 86, 87-100! !
  6. Two specific examples •  Khomh, F., Dhaliwal, T., Zou, Y.,

    & Adams, B. (2012, June). Do faster releases improve software quality?: an empirical case study of Mozilla Firefox. In Proceedings of the 9th IEEE Working Conference on Mining Software Repositories (pp. 179-188). IEEE Press. •  da Costa, D. A., McIntosh, S., Treude, C., Kulesza, U., & Hassan, A. E. (2018). The impact of rapid release cycles on the integration delay of fixed issues. Empirical Software Engineering, 1-70. 6
  7. Rapid Release Dataset •  The RapidRelease dataset hosts – 994

    high-release frequency, open-source Github projects with over 2 million issues Number of projects 994 Avg number of issues 2,365 Avg contributors 112 Avg distinct releases 48 Avg total time 976 Mean time between releases 22 Number of issue reports 2,351,072 8
  8. 17 programming languages Language Number of repositories Number of Issues

    C 48 99336 Clojure 3 711 Java 69 147551 Scala 40 50638 Python 70 192138 Swift 50 58065 Javascript 196 568405 Viml 2 3424 C++ 67 208357 Language Number of Issues Number of repositories Perl 6010 2 Lua 15472 12 Objective-C 42350 23 R 0 0 Haskell 6163 7 C# 177908 86 Go 379438 157 PHP 298902 123 Ruby 96204 39 Total 2351072 994 9
  9. Dataset Construction Process STAGE 1 - Extracting projects from Github

    •  Using the Github API, we search for repositories from top 18 programming languages with filters to form base candidate list of 11,980 repositories. •  Filter out repositories with less than 5 releases to get intermediary set R1 of repositories. STAGE II – Data Cleansing •  Releases within 2 days are labelled as non- distinct. Repositories with lower than 1st Quartile of distinct releases and contributors are discarded. •  RapidRelease criteria - Retain repositories with mean release time between 5 and 35 days. STAGE III - Mining Github •  Mine all data including prominently issue reports 11!
  10. How to use the dataset? • Release engineering • What changes in

    between releases? – code, issues? • How many and what type of issues are raised and resolved? • How long does it take for an issue to be resolved, integrated and released? •  Agile Software development - Exploring viability, effects of fast development release cycle model in OSD. •  Software Evolution - How software evolves in terms of code quality, change metrics, architecture, documentation in the context of rapid releases? 12
  11. Limitations & Future Work •  Extend the dataset by incorporating

    other data, such as pull- requests and issue comments (now it is metadata/external) •  Methodology could include issues from external issue trackers along with public issues available on Github •  The dataset can be augmented to ensure conformity to further agile principles along with the rapid release criteria 13
  12. Code: https://github.com/saketrule/RapidRelease Dataset : https://zenodo.org/record/2561335 Rapid! Release! Software Quality! Software

    Evolution! Issues/ Triaging! Pull Requests! Builds! 14 Comments & Collaborations [email protected] 994 repos 17 languages 2 million issues release frequency of 5-35 days