Predicting the Popularity of GitHub Repositories (PROMISE 2016)

Predicting the Popularity of GitHub Repositories Hudson Borges, Andre Hora,
Marco Tulio Valente {hsborges, hora, mtov}@dcc.ufmg.br

Introduction 15M users 36M repositories 2

Social Coding Features 3

Our Goal • Goal ◦ Predict the popularity of GitHub
repositories • Why? 4 Signal of stagnation Comparison

Research Questions 1. What is the accuracy of the generic
prediction models? 2. What is the accuracy of the specific prediction models? 3. What is the accuracy of the repositories rank as predicted using the generic and specific models? 5

Data Collection • April 2016 • Top-5,000 repositories ◦ Stars,
creation date, language, etc. • Stars Historical data ◦ User and date 6 Official GitHub API

Data Collection - Filter • Removed: ➖ More than 40K
stars ➖ No programming language ➖ Less than 52 weeks • Total of 4,248 repositories • Top and Bottom repositories: ◦ jquery/jquery ➡ 39,149 stars ◦ mikeflynn/egg.js ➡ 1,248 stars 7

Examples of Time Series 8

9 Examples of Time Series

Prediction Technique • Multiple Linear Regression • Where: ◦ Yt
→ Predicted number of stars at week t ◦ bj → Regression coefficients ◦ Xj → Stars at week j ( 0 ≤ j ≤ r < t ) 10

Study Design • Multiple Linear Regression ◦ t = 52
• Relative Squared Error - RSE • 10-fold cross-validation 11

Prediction models • Generic models ◦ Produced from the complete
dataset • Specific models ◦ Produced from repositories with similar growth trends 12

Specific prediction models • K-Spectral Centroid clustering algorithm ◦ Clusters
time series with similar shapes ◦ Invariant to scaling and shifting 13 Linear Linear Linear Linear Nonlinear

• K-Spectral Centroid clustering algorithm ◦ Clusters time series with
similar shapes ◦ Invariant to scaling and shifting 14 Specific prediction models Linear Linear Linear Linear Nonlinear

• K-Spectral Centroid clustering algorithm ◦ Clusters time series with
similar shapes ◦ Invariant to scaling and shifting 15 Specific prediction models Linear Linear Linear Linear Nonlinear

RQ #1 • What is the accuracy of the generic
prediction models? 16 Different training windows Prediction week = 52 26 26 20 32 13 39 8 44

RQ #1 • What is the accuracy of the generic
prediction models? 17 Different training windows 26 26 20 32 13 39 8 44 = 9.62 = 4.03 = 1.04 = 0.43 Mean RSE

Stars: 565 Predicted: 667 Difference: +18.05% Stars: 18,443 Predicted: 19,373
Difference: +5.04% RQ #1 . Prediction Examples 18 Stars: 6,160 Predicted: 5,369 Difference: -12.84% Stars: 9,919 Predicted: 10,082 Difference: +1.64% Stars: 584 Predicted: 793 Difference: +35.79% Stars: 890 Predicted: 712 Difference: -20.00%

RQ #2 • What is the accuracy of the specific
prediction models? ➕ Less data to produce reliable predictions 19 Generic model 26 weeks mRSE: 0.4 Cluster 1 26 weeks mRSE: 0.03 Cluster 2 26 weeks mRSE: 0.03 Cluster 3 26 weeks mRSE: 0.09 Cluster 4 26 weeks mRSE: 0.27

RQ #2 . Improvement Median • Cluster 1 ➡ 15.72%
• Cluster 2 ➡ 1.08% • Cluster 3 ➡ 2% • Cluster 4 ➡ 6.66% 20

RQ #2 . Prediction Examples 21 Stars: 565 Predicted: 583
Improve: +82.35% Stars: 18,443 Predicted: 18,432 Improve: +4.98% Stars: 6,160 Predicted: 5,578 Improve: +3.39% Stars: 9,919 Predicted: 10,571 Improve: -4.93% Stars: 584 Predicted: 523 Improve: +25.34% Stars: 890 Predicted: 853 Improve: +14.29%

RQ #3 • What is the accuracy of the repositories
rank as predicted using the generic and specific models? 22 Dataset Predictions 1 (Generic) Predictions 2 (Specific) Build and Predict 26 weeks ➡ 26 weeks Ranking 1 (Generic) Ranking 2 (Specific) Sort Repositories by number of stars

RQ #3 • What is the accuracy of the repositories
rank as predicted using the generic and specific models? 23 Ranking 1 (Generic) Ranking 2 (Specific) Ranking 3 (GitHub - April, 2016)

RQ #3 . Rank prediction Generic models 24 ➖ Predicted
= Real Repo: googlesamples/android-testing-templates Real rank: 4,681 Predicted rank: 1,188

RQ #3 . Correlation test Specific models present slightly better
results in all cases 25

Conclusion ➕ 6 months ➡ 6 months ➕ Highly popular
repositories Generic Models - RQ #1 26

Conclusion ➕ Prediction error reduction ➕ Less data to predict
➕ Common growth trend Specific Models - RQ #2 27

Conclusion ➕ Accurate results ➖ Tend to overestimate Rank Prediction
- RQ #3 28

Conclusion ➕ Different prediction approaches ➕ Other measures (e.g., forks)
Future Work 29

Thank you! 30

Predicting the Popularity of GitHub Repositorie...

Predicting the Popularity of GitHub Repositories (PROMISE 2016)

ASERG, DCC, UFMG

More Decks by ASERG, DCC, UFMG

Other Decks in Research

Featured

Transcript

Predicting the Popularity of GitHub Repositories Hudson Borges, Andre Hora,

Introduction 15M users 36M repositories 2

Social Coding Features 3

Our Goal • Goal ◦ Predict the popularity of GitHub

Research Questions 1. What is the accuracy of the generic

Data Collection • April 2016 • Top-5,000 repositories ◦ Stars,

Data Collection - Filter • Removed: ➖ More than 40K

Examples of Time Series 8

9 Examples of Time Series

Prediction Technique • Multiple Linear Regression • Where: ◦ Yt

Study Design • Multiple Linear Regression ◦ t = 52

Prediction models • Generic models ◦ Produced from the complete

Specific prediction models • K-Spectral Centroid clustering algorithm ◦ Clusters

• K-Spectral Centroid clustering algorithm ◦ Clusters time series with

• K-Spectral Centroid clustering algorithm ◦ Clusters time series with

RQ #1 • What is the accuracy of the generic

RQ #1 • What is the accuracy of the generic

Stars: 565 Predicted: 667 Difference: +18.05% Stars: 18,443 Predicted: 19,373

RQ #2 • What is the accuracy of the specific

RQ #2 . Improvement Median • Cluster 1 ➡ 15.72%

RQ #2 . Prediction Examples 21 Stars: 565 Predicted: 583

RQ #3 • What is the accuracy of the repositories

RQ #3 • What is the accuracy of the repositories

RQ #3 . Rank prediction Generic models 24 ➖ Predicted

RQ #3 . Correlation test Specific models present slightly better

Conclusion ➕ 6 months ➡ 6 months ➕ Highly popular

Conclusion ➕ Prediction error reduction ➕ Less data to predict

Conclusion ➕ Accurate results ➖ Tend to overestimate Rank Prediction

Conclusion ➕ Different prediction approaches ➕ Other measures (e.g., forks)

Thank you! 30