Workload-Aware Reviewer Recommendation using a Multi-objective Search-Based Approach

Workload-Aware Reviewer Recommendation using a Multi-objective Search-Based Approach Wisam Haitham
Abbood Al-Zubaidi Patanamon (Pick) Thongtanunam Hoa Khanh Dam Chakkrit (Kla) Tantithamthavorn Aditya Ghose [email protected] @patanamon 1

Author Code Review: A method to improve the overall quality
of a patch through manual examination 2

Author Code Review: A method to improve the overall quality
of a patch through manual examination A code review tool (Ex. Gerrit) A patch 2

Shouldn't console.log() call the toString() method (where appropriate) on objects?
Identifying a defect I think it’s better to do var s = "{}" console.log(s) Suggesting a solution Author Reviewer Reviewer Code Review: A method to improve the overall quality of a patch through manual examination A code review tool (Ex. Gerrit) A patch 2

Eﬀective code review requires active participation [Balachandran ICSE2013; Rigby and
Storey ICSE2011] Shouldn't console.log() call the toString() method (where appropriate) on objects? Identifying a defect I think it’s better to do var s = "{}" console.log(s) Suggesting a solution Author Reviewer Reviewer Code Review: A method to improve the overall quality of a patch through manual examination A code review tool (Ex. Gerrit) A patch 2

Storey ICSE2011] A patch tends to be less defective when it was reviewed and discussed extensively by many reviewers [Thongtanunam et al MSR2015; Kononenko et al. ICSME2015] Shouldn't console.log() call the toString() method (where appropriate) on objects? Identifying a defect I think it’s better to do var s = "{}" console.log(s) Suggesting a solution Author Reviewer Reviewer Code Review: A method to improve the overall quality of a patch through manual examination A code review tool (Ex. Gerrit) A patch 2

Storey ICSE2011] A patch tends to be less defective when it was reviewed and discussed extensively by many reviewers [Thongtanunam et al MSR2015; Kononenko et al. ICSME2015] Shouldn't console.log() call the toString() method (where appropriate) on objects? Identifying a defect I think it’s better to do var s = "{}" console.log(s) Suggesting a solution Author Reviewer Reviewer Code Review: A method to improve the overall quality of a patch through manual examination A code review tool (Ex. Gerrit) A patch Finding suitable reviewers is not a trivial task [Thongtanunam et al SANER2015] 2

Several Reviewer Recommendation Approaches have been Developed to Improve Code
Review Process 3

Review Process Expertise/Experience-based Approaches Finding reviewers who review many similar patches in the past [Balachandran ICSE2013,   Thongtanunam et al SANER2015,   Zanjani et al TSE2016, Xia et al ICSME2016] 3

Review Process Expertise/Experience-based Approaches Finding reviewers who review many similar patches in the past [Balachandran ICSE2013,   Thongtanunam et al SANER2015,   Zanjani et al TSE2016, Xia et al ICSME2016] Exp. + Past Collaboration Approaches Finding reviewers who often work with the author in the past [Yu et al ICSME2014, Ouni et al IST2017] 3

Review Process Expertise/Experience-based Approaches Finding reviewers who review many similar patches in the past [Balachandran ICSE2013,   Thongtanunam et al SANER2015,   Zanjani et al TSE2016, Xia et al ICSME2016] Exp. + Past Collaboration Approaches Finding reviewers who often work with the author in the past [Yu et al ICSME2014, Ouni et al IST2017] ! Requesting only experts or active reviewers for a review could potentially burden them 3

Review Process Expertise/Experience-based Approaches Finding reviewers who review many similar patches in the past [Balachandran ICSE2013,   Thongtanunam et al SANER2015,   Zanjani et al TSE2016, Xia et al ICSME2016] Exp. + Past Collaboration Approaches Finding reviewers who often work with the author in the past [Yu et al ICSME2014, Ouni et al IST2017] ! Requesting only experts or active reviewers for a review could potentially burden them Invited reviewers often consider their workload when accepting new invitations [Ruangwan et al EMSE 2019] 3

Review Process Expertise/Experience-based Approaches Finding reviewers who review many similar patches in the past [Balachandran ICSE2013,   Thongtanunam et al SANER2015,   Zanjani et al TSE2016, Xia et al ICSME2016] Exp. + Past Collaboration Approaches Finding reviewers who often work with the author in the past [Yu et al ICSME2014, Ouni et al IST2017] ! Requesting only experts or active reviewers for a review could potentially burden them Invited reviewers often consider their workload when accepting new invitations [Ruangwan et al EMSE 2019] At Google, review tasks are assigned in a round-robin manner [Sadowski et al. ICSE 2018] 3

WLRRec: Workload-aware Reviewer Recommendation 4

WLRRec: Workload-aware Reviewer Recommendation A new patch 4

WLRRec: Workload-aware Reviewer Recommendation A new patch 4 Measure Reviewer
Metrics

WLRRec: Workload-aware Reviewer Recommendation A multi-objective evolutionary search (NSGA-II) A
new patch 4 Measure Reviewer Metrics

WLRRec: Workload-aware Reviewer Recommendation A multi-objective evolutionary search (NSGA-II) A
new patch Experience & Activeness Past Collaboration Obj 1: Maximize the chance of participating a review Workload Obj 2: Mimize the Skewness of the Workload 4 Measure Reviewer Metrics

WLRRec Uses 4+1 Key Reviewer Metrics Experience & Activeness Past
Collaboration Workload 5

Collaboration Workload Code Ownership %Commits authored Reviewing Experience %Patches reviewed Review Participation Rate %Invitations Accepted 5

Collaboration Workload Code Ownership %Commits authored Reviewing Experience %Patches reviewed Review Participation Rate %Invitations Accepted Familiarity with the Patch Author Co-reviewing Freq. 5

Collaboration Workload Code Ownership %Commits authored Reviewing Experience %Patches reviewed Review Participation Rate %Invitations Accepted Familiarity with the Patch Author Co-reviewing Freq. Remaining Reviews #Pending Review Requests 5

Collaboration Workload Code Ownership %Commits authored Reviewing Experience %Patches reviewed Review Participation Rate %Invitations Accepted Familiarity with the Patch Author Co-reviewing Freq. Remaining Reviews #Pending Review Requests Fitness func. for Obj 1: Weighted Summation Identify reviewers with maximum experience, activeness and past collaboration Fitness func. for Obj 2: Shanon’s Entropy Identify reviewers with minimal skewed workload 6

WLRRec identiﬁes reviewers with maximum experience activeness, past collaboration (Obj
1) Example Fitness func. for Obj. 1 7

1) Example Fitness func. for Obj. 1 Code Ownership COPick COHoa COKla COAditya Rev. Experience REPick REHoa REKla REAditya Rev. Participate RPPick RPHoa RPKla RPAditya Fam. w/ Patch Author FPPick FPHoa FPKla FPAditya 7

1) Example Fitness func. for Obj. 1 Code Ownership COPick COHoa COKla COAditya Rev. Experience REPick REHoa REKla REAditya Rev. Participate RPPick RPHoa RPKla RPAditya Fam. w/ Patch Author FPPick FPHoa FPKla FPAditya Weighted Sum ScorePick ScoreHoa ScoreKla ScoreAditya 7

1) Example Fitness func. for Obj. 1 Code Ownership COPick COHoa COKla COAditya Rev. Experience REPick REHoa REKla REAditya Rev. Participate RPPick RPHoa RPKla RPAditya Fam. w/ Patch Author FPPick FPHoa FPKla FPAditya Weighted Sum ScorePick ScoreHoa ScoreKla ScoreAditya Solution Candidate 7

1) Example Fitness func. for Obj. 1 Code Ownership COPick COHoa COKla COAditya Rev. Experience REPick REHoa REKla REAditya Rev. Participate RPPick RPHoa RPKla RPAditya Fam. w/ Patch Author FPPick FPHoa FPKla FPAditya Weighted Sum ScorePick ScoreHoa ScoreKla ScoreAditya Solution Candidate Objective 1 score ScorePick + ScoreKla 7

WLRRec identiﬁes reviewers with minimal skewed workload (Obj 2) Example
Fitness func. for Obj. 2 8

#Pending Review Requests WLRRec identiﬁes reviewers with minimal skewed workload
(Obj 2) Example Fitness func. for Obj. 2 8

#Pending Review Requests Solution Candidate WLRRec identiﬁes reviewers with minimal
skewed workload (Obj 2) Example Fitness func. for Obj. 2 8

#Pending Review Requests Solution Candidate Total Workload WLRRec identiﬁes reviewers
with minimal skewed workload (Obj 2) Example Fitness func. for Obj. 2 8

#Pending Review Requests Solution Candidate Total Workload Objective 2 score
(Shanon’s entropy) WLRRec identiﬁes reviewers with minimal skewed workload (Obj 2) Example Fitness func. for Obj. 2 -0.81 8 1 log2 4 ( 5 10 log2 5 10 + 2 * 1 10 log2 1 10 + 2 10 log2 2 10 )

#Pending Review Requests Solution Candidate Total Workload Objective 2 score
(Shanon’s entropy) WLRRec identiﬁes reviewers with minimal skewed workload (Obj 2) Example Fitness func. for Obj. 2 -0.81 8 1 log2 4 ( 5 10 log2 5 10 + 2 * 1 10 log2 1 10 + 2 10 log2 2 10 ) The lower the score, the lower skewed workload (the better distribution of workload)

WLRRec selects the solution that is closet to the reference
point S1 S2 S3 S4 9

Pareto optimal solutions of selected reviewers generated by NSGA-II WLRRec
selects the solution that is closet to the reference point S1 S2 S3 S4 9

selects the solution that is closet to the reference point S1 S2 S3 S4 S1 S2 S3 Objective 1: Maximize chance of participating a review S4 Objective 2: Minimize skewness of the workload distribution Reference point Dist(S4 ) Dist(S 3) Dist(S2) Dist(S2) The Knee Point Approach 9

selects the solution that is closet to the reference point S1 S2 S3 S4 S1 S2 S3 Objective 1: Maximize chance of participating a review S4 Objective 2: Minimize skewness of the workload distribution Reference point Dist(S4 ) Dist(S 3) Dist(S2) Dist(S2) Measure the distance between the solution and the reference point The Knee Point Approach 9

selects the solution that is closet to the reference point S1 S2 S3 S4 S1 S2 S3 Objective 1: Maximize chance of participating a review S4 Objective 2: Minimize skewness of the workload distribution Reference point Dist(S4 ) Dist(S 3) Dist(S2) Dist(S2) Measure the distance between the solution and the reference point The Knee Point Approach Select S3 as it has the closest distance 9

How well can our WLRRec (a multi-objective approach) recommend reviewers
for a newly-submitted patch? 10

for a newly-submitted patch? Datasets 10

for a newly-submitted patch? 36K Patches   2K Reviewers 65K Patches   1.2K Reviewers 108K Patches 3.7K Reviewers 19K Patches  410 Reviewers Datasets 10

for a newly-submitted patch? 36K Patches   2K Reviewers 65K Patches   1.2K Reviewers 108K Patches 3.7K Reviewers 19K Patches  410 Reviewers Datasets Investigation 10

for a newly-submitted patch? 36K Patches   2K Reviewers 65K Patches   1.2K Reviewers 108K Patches 3.7K Reviewers 19K Patches  410 Reviewers Datasets Investigation Genetic Algorithm (GA) Obj1: Maximize chance of participating a review Genetic Algorithm (GA) Obj2: Minimize the skewed workload Single-Objective vs. Multiple-Objective 10

for a newly-submitted patch? 36K Patches   2K Reviewers 65K Patches   1.2K Reviewers 108K Patches 3.7K Reviewers 19K Patches  410 Reviewers Datasets Investigation Genetic Algorithm (GA) Obj1: Maximize chance of participating a review Genetic Algorithm (GA) Obj2: Minimize the skewed workload Single-Objective vs. Multiple-Objective Multi-Objective Cellular Genetic Algorithm (MOCell) NSGA-II vs. Other Multi-Objective Algorithms Strength-based Evolutionary Algo- rithm (SPEA2) 10

for a newly-submitted patch? 36K Patches   2K Reviewers 65K Patches   1.2K Reviewers 108K Patches 3.7K Reviewers 19K Patches  410 Reviewers Datasets Investigation Genetic Algorithm (GA) Obj1: Maximize chance of participating a review Genetic Algorithm (GA) Obj2: Minimize the skewed workload Single-Objective vs. Multiple-Objective Multi-Objective Cellular Genetic Algorithm (MOCell) NSGA-II vs. Other Multi-Objective Algorithms Strength-based Evolutionary Algo- rithm (SPEA2) Performance Measures 10

for a newly-submitted patch? 36K Patches   2K Reviewers 65K Patches   1.2K Reviewers 108K Patches 3.7K Reviewers 19K Patches  410 Reviewers Datasets Investigation Genetic Algorithm (GA) Obj1: Maximize chance of participating a review Genetic Algorithm (GA) Obj2: Minimize the skewed workload Single-Objective vs. Multiple-Objective Multi-Objective Cellular Genetic Algorithm (MOCell) NSGA-II vs. Other Multi-Objective Algorithms Strength-based Evolutionary Algo- rithm (SPEA2) Performance Measures Precision Recall F-Measure Hypervolume 10

for a newly-submitted patch? 36K Patches   2K Reviewers 65K Patches   1.2K Reviewers 108K Patches 3.7K Reviewers 19K Patches  410 Reviewers Datasets Investigation Genetic Algorithm (GA) Obj1: Maximize chance of participating a review Genetic Algorithm (GA) Obj2: Minimize the skewed workload Single-Objective vs. Multiple-Objective Multi-Objective Cellular Genetic Algorithm (MOCell) NSGA-II vs. Other Multi-Objective Algorithms Strength-based Evolutionary Algo- rithm (SPEA2) Performance Measures Precision Recall F-Measure Hypervolume %Gain = WLRRecpm - Ypm Ypm pm = Performance Measures Y = Alternative approaches 10

Our WLRRec outperforms the single-objective approaches 0% 45% 90% 135%
180% Precision Recall F1 0% 35% 70% 105% 140% Precision Recall F1 %Gain WLRRec vs GA-Obj1 Precision Recall F-Measure Precision Recall F-Measure %Gain WLRRec vs GA-Obj2 11

180% Precision Recall F1 0% 35% 70% 105% 140% Precision Recall F1 %Gain WLRRec vs GA-Obj1 Precision Recall F-Measure Precision Recall F-Measure %Gain WLRRec vs GA-Obj2 WLRRec achieves 88%-142% higher precision, 111%-178% higher recall than GA-Obj1 WLRRec achieves 55%-101% higher precision, 96%-138% higher recall than GA-Obj2 11

180% Precision Recall F1 0% 35% 70% 105% 140% Precision Recall F1 %Gain WLRRec vs GA-Obj1 Precision Recall F-Measure Precision Recall F-Measure %Gain WLRRec vs GA-Obj2 Considering multiple objectives at the same time allows us to better ﬁnd reviewers 11

Our WLRRec with NSGA-II is better than other two multi-objective
approaches 0% 25% 50% 75% 100% Precision Recall F1 HV 0% 25% 50% 75% 100% Precision Recall F1 HV %Gain WLRRec with NSGA-II vs MOCell Precision Recall F-Measure %Gain WLRRec with NSGA-II vs SPEA2 Hypervolume Precision Recall F-Measure Hypervolume 12

approaches 0% 25% 50% 75% 100% Precision Recall F1 HV 0% 25% 50% 75% 100% Precision Recall F1 HV %Gain WLRRec with NSGA-II vs MOCell Precision Recall F-Measure WLRRec achieves 31%-95% higher F-measure, 21%-31% higher hypervolume than MOCell %Gain WLRRec with NSGA-II vs SPEA2 Hypervolume Precision Recall F-Measure Hypervolume 12

approaches 0% 25% 50% 75% 100% Precision Recall F1 HV 0% 25% 50% 75% 100% Precision Recall F1 HV %Gain WLRRec with NSGA-II vs MOCell Precision Recall F-Measure WLRRec achieves 31%-95% higher F-measure, 21%-31% higher hypervolume than MOCell WLRRec achieves 19%-95% higher F-measure, 29%-47% higher hypervolume than SPEA2 %Gain WLRRec with NSGA-II vs SPEA2 Hypervolume Precision Recall F-Measure Hypervolume 12

approaches 0% 25% 50% 75% 100% Precision Recall F1 HV 0% 25% 50% 75% 100% Precision Recall F1 HV %Gain WLRRec with NSGA-II vs MOCell Precision Recall F-Measure The NSGA-II algorithm leveraged by our WLRRec is an appropriate multi-objective approach to ﬁnd solutions in this problem domain %Gain WLRRec with NSGA-II vs SPEA2 Hypervolume Precision Recall F-Measure Hypervolume 12

13 Several Reviewer Recommendation Approaches have been Developed to Improve
Code Review Process Expertise/Experience-based Approaches Finding reviewers who review many similar patches in the past [Balachandran ICSE2013,   Thongtanunam et al SANER2015,   Zanjani et al TSE2016, Xia et al ICSME2016] Exp. + Past Collaboration Approaches Finding reviewers who often work with the author in the past [Yu et al ICSME2014, Ouni et al IST2017] ! Requesting only experts or active reviewers for a review could potentially burden them Invited reviewers often consider their workload when accepting new invitations [Ruangwan et al EMSE 2019] At Google, review tasks are assigned in a round-robin manner [Sadowski et al. ICSE 2018]

Code Review Process Expertise/Experience-based Approaches Finding reviewers who review many similar patches in the past [Balachandran ICSE2013,   Thongtanunam et al SANER2015,   Zanjani et al TSE2016, Xia et al ICSME2016] Exp. + Past Collaboration Approaches Finding reviewers who often work with the author in the past [Yu et al ICSME2014, Ouni et al IST2017] ! Requesting only experts or active reviewers for a review could potentially burden them Invited reviewers often consider their workload when accepting new invitations [Ruangwan et al EMSE 2019] At Google, review tasks are assigned in a round-robin manner [Sadowski et al. ICSE 2018] WLRRec: Workload-aware Reviewer Recommendation NSGA-II A new patch Experience & Activeness Past Collaboration Obj 1: Maximize the chance of participating a review Workload Obj 2: Mimize the Skewness of the Reviewing Workload Distribution

Code Review Process Expertise/Experience-based Approaches Finding reviewers who review many similar patches in the past [Balachandran ICSE2013,   Thongtanunam et al SANER2015,   Zanjani et al TSE2016, Xia et al ICSME2016] Exp. + Past Collaboration Approaches Finding reviewers who often work with the author in the past [Yu et al ICSME2014, Ouni et al IST2017] ! Requesting only experts or active reviewers for a review could potentially burden them Invited reviewers often consider their workload when accepting new invitations [Ruangwan et al EMSE 2019] At Google, review tasks are assigned in a round-robin manner [Sadowski et al. ICSE 2018] WLRRec: Workload-aware Reviewer Recommendation NSGA-II A new patch Experience & Activeness Past Collaboration Obj 1: Maximize the chance of participating a review Workload Obj 2: Mimize the Skewness of the Reviewing Workload Distribution Our WLRRec outperforms the single-objective approaches 0% 45% 90% 135% 180% Precision Recall F1 0% 35% 70% 105% 140% Precision Recall F1 %Gain WLRRec vs GA-Obj1 Precision Recall F-Measure Precision Recall F-Measure %Gain WLRRec vs GA-Obj2 WLRRec is 88%-142% higher precision, 111%-178% higher recall than GA-Obj1 WLRRec is 55%-101% higher precision, 96%-138% higher recall than GA-Obj2 Our WLRRec with NSGA-II is better than other two multi-objective approaches 0% 25% 50% 75% 100% Precision Recall F1 HV 0% 25% 50% 75% 100% Precision Recall F1 HV %Gain WLRRec with NSGA-II vs MOCell Precision Recall F-Measure NSGA-II is 31%-95% higher F-measure, NSGA-II is 19%-95% higher F-measure, %Gain WLRRec with NSGA-II vs SPEA2 Hypervolume Precision Recall F-Measure Hypervolume Our WLRRec with NSGA-II is better than other two multi-objective approaches 0% 25% 50% 75% 100% Precision Recall F1 HV 0% 25% 50% 75% 100% Precision Recall F1 HV %Gain WLRRec with NSGA-II vs MOCell Precision Recall F-Measure NSGA-II is 31%-95% higher F-measure, 21%-31% higher hypervolume than MOCell NSGA-II is 19%-95% higher F-measure, 29%-47% higher hypervolume than SPEA2 %Gain WLRRec with NSGA-II vs SPEA2 Hypervolume Precision Recall F-Measure Hypervolume Our WLRRec outperforms the four alternative approaches

Code Review Process Expertise/Experience-based Approaches Finding reviewers who review many similar patches in the past [Balachandran ICSE2013,   Thongtanunam et al SANER2015,   Zanjani et al TSE2016, Xia et al ICSME2016] Exp. + Past Collaboration Approaches Finding reviewers who often work with the author in the past [Yu et al ICSME2014, Ouni et al IST2017] ! Requesting only experts or active reviewers for a review could potentially burden them Invited reviewers often consider their workload when accepting new invitations [Ruangwan et al EMSE 2019] At Google, review tasks are assigned in a round-robin manner [Sadowski et al. ICSE 2018] WLRRec: Workload-aware Reviewer Recommendation NSGA-II A new patch Experience & Activeness Past Collaboration Obj 1: Maximize the chance of participating a review Workload Obj 2: Mimize the Skewness of the Reviewing Workload Distribution Our WLRRec outperforms the single-objective approaches 0% 45% 90% 135% 180% Precision Recall F1 0% 35% 70% 105% 140% Precision Recall F1 %Gain WLRRec vs GA-Obj1 Precision Recall F-Measure Precision Recall F-Measure %Gain WLRRec vs GA-Obj2 WLRRec is 88%-142% higher precision, 111%-178% higher recall than GA-Obj1 WLRRec is 55%-101% higher precision, 96%-138% higher recall than GA-Obj2 Our WLRRec with NSGA-II is better than other two multi-objective approaches 0% 25% 50% 75% 100% Precision Recall F1 HV 0% 25% 50% 75% 100% Precision Recall F1 HV %Gain WLRRec with NSGA-II vs MOCell Precision Recall F-Measure NSGA-II is 31%-95% higher F-measure, NSGA-II is 19%-95% higher F-measure, %Gain WLRRec with NSGA-II vs SPEA2 Hypervolume Precision Recall F-Measure Hypervolume Our WLRRec with NSGA-II is better than other two multi-objective approaches 0% 25% 50% 75% 100% Precision Recall F1 HV 0% 25% 50% 75% 100% Precision Recall F1 HV %Gain WLRRec with NSGA-II vs MOCell Precision Recall F-Measure NSGA-II is 31%-95% higher F-measure, 21%-31% higher hypervolume than MOCell NSGA-II is 19%-95% higher F-measure, 29%-47% higher hypervolume than SPEA2 %Gain WLRRec with NSGA-II vs SPEA2 Hypervolume Precision Recall F-Measure Hypervolume Our work highlights the potential of leveraging the multi-objective algorithm that consider review workload and other important information to ﬁnd reviewers [email protected] @patanamon http://patanamon.com Our WLRRec outperforms the four alternative approaches

Workload-Aware Reviewer Recommendation using a ...

Workload-Aware Reviewer Recommendation using a Multi-objective Search-Based Approach

More Decks by Patanamon (Pick) Thongtanunam

Other Decks in Research

Featured

Transcript