Code review at speed: How can we use data to help developers do code review faster?

Patanamon (Pick) Thongtanunam [email protected] @patanamon ARC DECRA & Lecturer at
School of Computing and Information Systems (CIS) http://patanamon.com Code Review at Speed: How can we use data to help developers do code review faster? 1

Create tasks Write code Build & test code Integrate Release/
Deploy A General View of Continuous Integration Code Review: A QA practice that manually examines a new code change Code Review A bug is here… Improve the overall quality of software systems [Thongtanunam et al. 2015, McIntosh et al. 2016] Increase team awareness, transfer knowledge & share code ownership [Bachelli and Bird 2013, Thongtanunam et al., 2016, Sadowski et al, 2018] 2

Code Review: A QA practice that manually examine a new
code change An author A code change (1) Uploading the change (2) Inviting reviewers Reviewers (3) Examining the change (4) Automated testings The approved change is integrated into the software system Rejected Accepted Fail Changes are abandoned A revision is required Pass A collaborative code review tool 3

A large number of new code changes can pose challenges
to perform effective code reviews 100 - 1,000 reviews were performed in a month, and each review took 1 day on average [Rigby and Bird, 2013] ~600 ~400 ~550 #Reviews/ month [Thongtanunam and Hassan, 2020] 4

to perform effective code reviews Non-responding invited reviewers [Ruangwan et al, 2019] Challenges: 5

Reviewers may not respond to the review invitation An author
A code change (1) Uploading the change (2) Inviting reviewers Reviewers (3) Examining the change (4) Automated testings The approved change is integrated into the software system Rejected Accepted Fail Changes are abandoned A revision is required Pass A collaborative code review tool 6

16% - 66% of the studied code changes have at
least one invited reviewer who did not respond to the invitation %Non-responding invited reviewers in a patch The more the reviewers that were invited, the higher the chance of having a non-responding reviewer 7

Investigating the factors that can be associated with the participation
decision Experience & Activeness Past Collaboration Workload 13 studied metrics RespondInvitation ∼ x1 + x2 + ….. + xn Use a non-linear logistic regression model Analyze the relationship with the likelihood of respond the invitation Code Ownership %Commits authored Reviewing Experience %Patches reviewed Review Participation Rate %Invitations Accepted Familiarity with the Patch Author Co-reviewing Freq. Remaining Reviews #Pending Review Requests 5 Signiﬁcant factors: 8

to perform effective code reviews Non-responding invited reviewers [Ruangwan et al, 2019] Workload-aware Reviewer recommendation [Al-Zubaidi et al, 2020] Challenges: (Possible) Solutions: 9

WLRRec: Workload-aware Reviewer Recommendation A multi-objective evolutionary search (NSGA-II) Experience
& Activeness Past Collaboration Obj 1: Maximize the chance of participating a review Workload Obj 2: Mimize the Skewness of the Workload Measure Reviewer Metrics A new code change 10

WLRRec Uses 4+1 Key Reviewer Metrics Experience & Activeness Past
Collaboration Workload Code Ownership %Commits authored Reviewing Experience %Patches reviewed Review Participation Rate %Invitations Accepted Familiarity with the Patch Author Co-reviewing Freq. Remaining Reviews #Pending Review Requests Fitness func. for Obj 1: Weighted Summation Identify reviewers with maximum experience, activeness and past collaboration Fitness func. for Obj 2: Shanon’s Entropy Identify reviewers with minimal skewed workload 11

WLRRec identiﬁes reviewers with maximum experience activeness, past collaboration (Obj
1) Example Fitness func. for Obj. 1 Code Ownership COPick COHoa COKla COAditya Rev. Experience REPick REHoa REKla REAditya Rev. Participate RPPick RPHoa RPKla RPAditya Fam. w/ Patch Author FPPick FPHoa FPKla FPAditya Weighted Sum ScorePick ScoreHoa ScoreKla ScoreAditya Solution Candidate Objective 1 score ScorePick + ScoreKla 12

#Pending Review Requests Solution Candidate Total Workload Objective 2 score
(Shanon’s entropy) WLRRec identiﬁes reviewers with minimal skewed workload (Obj 2) Example Fitness func. for Obj. 2 -0.81 1 log2 4 ( 5 10 log 2 5 10 + 2 * 1 10 log 2 1 10 + 2 10 log 2 2 10 ) The lower the score, the lower skewed workload (the better distribution of workload) 13

Our WLRRec outperforms the single-objective approaches 0% 45% 90% 135%
180% Precision Recall F1 0% 35% 70% 105% 140% Precision Recall F1 %Gain WLRRec vs GA-Obj1 Precision Recall F-Measure Precision Recall F-Measure %Gain WLRRec vs GA-Obj2 WLRRec achieves 88%-142% higher precision, 111%-178% higher recall than GA-Obj1 WLRRec achieves 55%-101% higher precision, 96%-138% higher recall than GA-Obj2 Considering multiple objectives at the same time allows us to better ﬁnd reviewers 14

to perform effective code reviews Non-responding invited reviewers [Ruangwan et al, 2019] Workload-aware Reviewer recommendation [Al-Zubaidi et al, 2020] Suboptimal reviewing [Thongtanunam and Hassan, 2020; Chouchen et al, 2021] Challenges: (Possible) Solutions: 15

Reviewers may have subconscious biases due to the visible information
in a code review tool An author A code change (1) Uploading the change (2) Inviting reviewers Reviewers (3) Examining the change (4) Automated testings The approved change is integrated into the software system Rejected Accepted Fail Changes are abandoned A revision is required Pass A collaborative code review tool Code review tools often provide a transparent environment 16

Reviewers may have subconscious biases due to the visible information
in a code review tool Ahmed usually writes good code 17

Investigating the signals of visible information that are associated with
the review decision of a reviewer Analyze the relationship with the likelihood of giving a positive vote Use a mixed-eﬀects logistic regression model IsPositiveVote ∼ x1 + x2 + ….. + xn + (1 | ReviewerId ) 8 Studied metrics Relationship Status Prior Feedback Confounding factors:  Code Changes Characteristics as e.g., #Added Lines 18

In addition to patch characteristics, other visible information is associated
with the review decision Relationship Status Prior Feedback Patch Characteristics Explanatory Power (Log-likelihood ratio test) Association Direction Higher %Reviewed past patches for the patch author More likely to Higher %Prior positive votes Lower %Prior comments More likely to More likely to Visible information has a stronger association with the review decision than patch characteristics 19

Other suboptimal reviewing practices also exist in the contemporary code
review process [Chouchen et al, 2021] Identify anti-patterns in code reviews Manually examine code reviews of 100 code changes Confused reviewers Divergent reviewers Shallow review Toxic review 21% 20% 14% 5% Low review participation 32% 20

to perform effective code reviews Non-responding invited reviewers [Ruangwan et al, 2019] Workload-aware Reviewer recommendation [Al-Zubaidi et al, 2020] Suboptimal reviewing [Thongtanunam and Hassan, 2020; Chouchen et al, 2021] Line-level defect prediction [Wattanakriengkrai et al, 2020] Challenges: (Possible) Solutions: 21

We find that as little as 1%-3% of the lines
of code in a file are actually defective after release 6 Studied systems Activemq Camel Derby Groovy Hbase Hive Jruby Lucene Wicket %Defective files 2-7% 2-8% 6-28% 2-4% 7-11% 6-19% 2-13% 2-8% 2-16% %Defective lines in defective files (at the median) 2% 2% 2% 2% 1% 2% 2% 3% 3% **Detective lines are the source code lines that will be changed by bug-fixing commits to fix post-release defects Only 1%-3% of the lines of code in a ﬁle are actually defective Predicting defective lines would potentially save reviewer eﬀort on inspecting code 22

if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call();
current = oldCurrent; } Identified defect-prone lines oldCurrent current node closure Defective (LIME Score >0) Clean (LIME score < 0) 0.8 0.1 -0.3 -0.7 Ranking tokens based on LIME scores Mapping tokens to lines if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } A model-agnostic technique (LIME) Line-DP: Predicting defective lines using a model- agnostic technique (LIME) A file-level defect prediction model Files of Interest Defect-prone files A model-agnostic technique (LIME) Defect-prone lines 23

Our approach achieves an overall predictive accuracy better than baseline
approaches 12 Our Approach Line-level Baseline Approaches Recall 0.61 – 0.62 0.01 - 0.51 MCC 0.04 – 0.05 -0.01 – 0.03 False Alarm 0.47 – 0.48 o.01 -0.54 Distance to Heaven (the root mean square of the recall and false alarm values) 0.43– 0.44 0.52 – 0.70 LIME The higher the better The lower the better Our Line-DP achieves an overall predictive accuracy better than baseline approaches Line-DP Baselines: Static Analysis tools, N-gram Our Line-DP can eﬀectively identify defective lines while requiring a smaller amount of reviewing eﬀort 24

to perform effective code reviews Non-responding invited reviewers [Ruangwan et al, 2019] Workload-aware Reviewer recommendation [Al-Zubaidi et al, 2020] Suboptimal reviewing [Thongtanunam and Hassan, 2020; Chouchen et al, 2021] Line-level defect prediction [Wattanakriengkrai et al, 2020] Challenges: (Possible) Solutions: Our techniques and empirical ﬁndings should help teams speed up the process and save eﬀort, while maintaining the quality of reviews 25

Code Review at Speed: How can we use data to
help developers do code review faster? Non-responding invited reviewers [Ruangwan et al, 2019] Workload-aware Reviewer recommendation [Al-Zubaidi et al, 2020] Suboptimal reviewing [Thongtanunam and Hassan, 2020; Chouchen et al, 2021] Line-level defect prediction [Wattanakriengkrai et al, 2020] The Impact of Human Factors on the Participation Decision of Reviewers in Modern Code Review S. Ruangwan, P. Thongtanunam, A. Ihara, K. Matsumoto at Journal of EMSE 2019 Workload-Aware Reviewer Recommendation using a Multi-objective Search-Based Approach W. Al-Zubaidi, P. Thongtanunam, H. K. Dam, C. Tantithamthavorn, A. Ghose at PROMISE2020 Review Dynamics and Their Impact on Software Quality P. Thongtanunam and A. E. Hassan at TSE 2020 M. Chouchen, A. Ouni, R. Kula, D. Wang, P. Thongtanunam, M. Mkaouer, K. Matsumoto at SANER2021 Anti-patterns in Modern Code Review: Symptoms and Prevalence Predicting Defective Lines Using a Model-Agnostic Technique S. Wattanakriengkrai, P. Thongtanunam, C. Tantithamthavorn, H. Hata, K. Matsumoto at TSE2020 http://patanamon.com 26

Code review at speed: How can we use data to he...

Code review at speed: How can we use data to help developers do code review faster?

Patanamon (Pick) Thongtanunam

More Decks by Patanamon (Pick) Thongtanunam

Other Decks in Research

Featured

Transcript

Patanamon (Pick) Thongtanunam [email protected] @patanamon ARC DECRA & Lecturer at

Create tasks Write code Build & test code Integrate Release/

Code Review: A QA practice that manually examine a new

A large number of new code changes can pose challenges

A large number of new code changes can pose challenges

Reviewers may not respond to the review invitation An author

16% - 66% of the studied code changes have at

Investigating the factors that can be associated with the participation

A large number of new code changes can pose challenges

WLRRec: Workload-aware Reviewer Recommendation A multi-objective evolutionary search (NSGA-II) Experience

WLRRec Uses 4+1 Key Reviewer Metrics Experience & Activeness Past

WLRRec identiﬁes reviewers with maximum experience activeness, past collaboration (Obj

#Pending Review Requests Solution Candidate Total Workload Objective 2 score

Our WLRRec outperforms the single-objective approaches 0% 45% 90% 135%

A large number of new code changes can pose challenges

Reviewers may have subconscious biases due to the visible information

Reviewers may have subconscious biases due to the visible information

Investigating the signals of visible information that are associated with

In addition to patch characteristics, other visible information is associated

Other suboptimal reviewing practices also exist in the contemporary code

A large number of new code changes can pose challenges

We find that as little as 1%-3% of the lines

if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call();

Our approach achieves an overall predictive accuracy better than baseline

A large number of new code changes can pose challenges

Code Review at Speed: How can we use data to