Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Code review at speed: How can we use data to help developers do code review faster?

Code review at speed: How can we use data to help developers do code review faster?

Abstract: Code review has now become a mandatory practice in the software engineering workflow of many software organizations. While code review has been shown that it provides numerous benefits to the software teams, it is also considered expensive. Due to the manual and human-intensive nature of code review, it can delay the software development workflow. In this talk, I will summarize some of our work and others about how can we help developers perform effectively and what is the remaining challenges in code reviews that need to be addressed.

The talk was given at the MSR2021 Mini-keynote https://2021.msrconf.org/track/msr-2021-keynotes
The short video presentation is available at https://youtu.be/K9BfndLKDyY

Transcript

  1. Patanamon (Pick) Thongtanunam patanamon.t@unimelb.edu.au @patanamon ARC DECRA & Lecturer at

    School of Computing and Information Systems (CIS) http://patanamon.com Code Review at Speed: How can we use data to help developers do code review faster? 1
  2. Create tasks Write code Build & test code Integrate Release/

    Deploy A General View of Continuous Integration Code Review: A QA practice that manually examines a new code change Code Review A bug is here… Improve the overall quality of software systems [Thongtanunam et al. 2015, McIntosh et al. 2016] Increase team awareness, transfer knowledge & share code ownership [Bachelli and Bird 2013, Thongtanunam et al., 2016, Sadowski et al, 2018] 2
  3. Code Review: A QA practice that manually examine a new

    code change An author A code change (1) Uploading the change (2) Inviting reviewers Reviewers (3) Examining the change (4) Automated testings The approved change is integrated into the software system Rejected Accepted Fail Changes are abandoned A revision is required Pass A collaborative code review tool 3
  4. A large number of new code changes can pose challenges

    to perform effective code reviews 100 - 1,000 reviews were performed in a month, and each review took 1 day on average [Rigby and Bird, 2013] ~600 ~400 ~550 #Reviews/ month [Thongtanunam and Hassan, 2020] 4
  5. A large number of new code changes can pose challenges

    to perform effective code reviews Non-responding invited reviewers [Ruangwan et al, 2019] Challenges: 5
  6. Reviewers may not respond to the review invitation An author

    A code change (1) Uploading the change (2) Inviting reviewers Reviewers (3) Examining the change (4) Automated testings The approved change is integrated into the software system Rejected Accepted Fail Changes are abandoned A revision is required Pass A collaborative code review tool 6
  7. 16% - 66% of the studied code changes have at

    least one invited reviewer who did not respond to the invitation %Non-responding invited reviewers in a patch The more the reviewers that were invited, the higher the chance of having a non-responding reviewer 7
  8. Investigating the factors that can be associated with the participation

    decision Experience & Activeness Past Collaboration Workload 13 studied metrics RespondInvitation ∼ x1 + x2 + ….. + xn Use a non-linear logistic regression model Analyze the relationship with the likelihood of respond the invitation Code Ownership %Commits authored Reviewing Experience %Patches reviewed Review Participation Rate %Invitations Accepted Familiarity with the Patch Author Co-reviewing Freq. Remaining Reviews #Pending Review Requests 5 Significant factors: 8
  9. A large number of new code changes can pose challenges

    to perform effective code reviews Non-responding invited reviewers [Ruangwan et al, 2019] Workload-aware Reviewer recommendation [Al-Zubaidi et al, 2020] Challenges: (Possible) Solutions: 9
  10. WLRRec: Workload-aware Reviewer Recommendation A multi-objective evolutionary search (NSGA-II) Experience

    & Activeness Past Collaboration Obj 1: Maximize the chance of participating a review Workload Obj 2: Mimize the Skewness of the Workload Measure Reviewer Metrics A new code change 10
  11. WLRRec Uses 4+1 Key Reviewer Metrics Experience & Activeness Past

    Collaboration Workload Code Ownership %Commits authored Reviewing Experience %Patches reviewed Review Participation Rate %Invitations Accepted Familiarity with the Patch Author Co-reviewing Freq. Remaining Reviews #Pending Review Requests Fitness func. for Obj 1: Weighted Summation Identify reviewers with maximum experience, activeness and past collaboration Fitness func. for Obj 2: Shanon’s Entropy Identify reviewers with minimal skewed workload 11
  12. WLRRec identifies reviewers with maximum experience activeness, past collaboration (Obj

    1) Example Fitness func. for Obj. 1 Code Ownership COPick COHoa COKla COAditya Rev. Experience REPick REHoa REKla REAditya Rev. Participate RPPick RPHoa RPKla RPAditya Fam. w/ Patch Author FPPick FPHoa FPKla FPAditya Weighted Sum ScorePick ScoreHoa ScoreKla ScoreAditya Solution Candidate Objective 1 score ScorePick + ScoreKla 12
  13. #Pending Review Requests Solution Candidate Total Workload Objective 2 score

    (Shanon’s entropy) WLRRec identifies reviewers with minimal skewed workload (Obj 2) Example Fitness func. for Obj. 2 -0.81 1 log2 4 ( 5 10 log 2 5 10 + 2 * 1 10 log 2 1 10 + 2 10 log 2 2 10 ) The lower the score, the lower skewed workload (the better distribution of workload) 13
  14. Our WLRRec outperforms the single-objective approaches 0% 45% 90% 135%

    180% Precision Recall F1 0% 35% 70% 105% 140% Precision Recall F1 %Gain WLRRec vs GA-Obj1 Precision Recall F-Measure Precision Recall F-Measure %Gain WLRRec vs GA-Obj2 WLRRec achieves 88%-142% higher precision, 111%-178% higher recall than GA-Obj1 WLRRec achieves 55%-101% higher precision, 96%-138% higher recall than GA-Obj2 Considering multiple objectives at the same time allows us to better find reviewers 14
  15. A large number of new code changes can pose challenges

    to perform effective code reviews Non-responding invited reviewers [Ruangwan et al, 2019] Workload-aware Reviewer recommendation [Al-Zubaidi et al, 2020] Suboptimal reviewing [Thongtanunam and Hassan, 2020; Chouchen et al, 2021] Challenges: (Possible) Solutions: 15
  16. Reviewers may have subconscious biases due to the visible information

    in a code review tool An author A code change (1) Uploading the change (2) Inviting reviewers Reviewers (3) Examining the change (4) Automated testings The approved change is integrated into the software system Rejected Accepted Fail Changes are abandoned A revision is required Pass A collaborative code review tool Code review tools often provide a transparent environment 16
  17. Reviewers may have subconscious biases due to the visible information

    in a code review tool Ahmed usually writes good code 17
  18. Investigating the signals of visible information that are associated with

    the review decision of a reviewer Analyze the relationship with the likelihood of giving a positive vote Use a mixed-effects logistic regression model IsPositiveVote ∼ x1 + x2 + ….. + xn + (1 | ReviewerId ) 8 Studied metrics Relationship Status Prior Feedback Confounding factors:
 Code Changes Characteristics as e.g., #Added Lines 18
  19. In addition to patch characteristics, other visible information is associated

    with the review decision Relationship Status Prior Feedback Patch Characteristics Explanatory Power (Log-likelihood ratio test) Association Direction Higher %Reviewed past patches for the patch author More likely to Higher %Prior positive votes Lower %Prior comments More likely to More likely to Visible information has a stronger association with the review decision than patch characteristics 19
  20. Other suboptimal reviewing practices also exist in the contemporary code

    review process [Chouchen et al, 2021] Identify anti-patterns in code reviews Manually examine code reviews of 100 code changes Confused reviewers Divergent reviewers Shallow review Toxic review 21% 20% 14% 5% Low review participation 32% 20
  21. A large number of new code changes can pose challenges

    to perform effective code reviews Non-responding invited reviewers [Ruangwan et al, 2019] Workload-aware Reviewer recommendation [Al-Zubaidi et al, 2020] Suboptimal reviewing [Thongtanunam and Hassan, 2020; Chouchen et al, 2021] Line-level defect prediction [Wattanakriengkrai et al, 2020] Challenges: (Possible) Solutions: 21
  22. We find that as little as 1%-3% of the lines

    of code in a file are actually defective after release 6 Studied systems Activemq Camel Derby Groovy Hbase Hive Jruby Lucene Wicket %Defective files 2-7% 2-8% 6-28% 2-4% 7-11% 6-19% 2-13% 2-8% 2-16% %Defective lines in defective files (at the median) 2% 2% 2% 2% 1% 2% 2% 3% 3% **Detective lines are the source code lines that will be changed by bug-fixing commits to fix post-release defects Only 1%-3% of the lines of code in a file are actually defective Predicting defective lines would potentially save reviewer effort on inspecting code 22
  23. if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call();

    current = oldCurrent; } Identified defect-prone lines oldCurrent current node closure Defective (LIME Score >0) Clean (LIME score < 0) 0.8 0.1 -0.3 -0.7 Ranking tokens based on LIME scores Mapping tokens to lines if(closure != null){ Object oldCurrent = current; setClosure(closure, node); closure.call(); current = oldCurrent; } A model-agnostic technique (LIME) Line-DP: Predicting defective lines using a model- agnostic technique (LIME) A file-level defect prediction model Files of Interest Defect-prone files A model-agnostic technique (LIME) Defect-prone lines 23
  24. Our approach achieves an overall predictive accuracy better than baseline

    approaches 12 Our Approach Line-level Baseline Approaches Recall 0.61 – 0.62 0.01 - 0.51 MCC 0.04 – 0.05 -0.01 – 0.03 False Alarm 0.47 – 0.48 o.01 -0.54 Distance to Heaven (the root mean square of the recall and false alarm values) 0.43– 0.44 0.52 – 0.70 LIME The higher the better The lower the better Our Line-DP achieves an overall predictive accuracy better than baseline approaches Line-DP Baselines: Static Analysis tools, N-gram Our Line-DP can effectively identify defective lines while requiring a smaller amount of reviewing effort 24
  25. A large number of new code changes can pose challenges

    to perform effective code reviews Non-responding invited reviewers [Ruangwan et al, 2019] Workload-aware Reviewer recommendation [Al-Zubaidi et al, 2020] Suboptimal reviewing [Thongtanunam and Hassan, 2020; Chouchen et al, 2021] Line-level defect prediction [Wattanakriengkrai et al, 2020] Challenges: (Possible) Solutions: Our techniques and empirical findings should help teams speed up the process and save effort, while maintaining the quality of reviews 25
  26. Code Review at Speed: How can we use data to

    help developers do code review faster? Non-responding invited reviewers [Ruangwan et al, 2019] Workload-aware Reviewer recommendation [Al-Zubaidi et al, 2020] Suboptimal reviewing [Thongtanunam and Hassan, 2020; Chouchen et al, 2021] Line-level defect prediction [Wattanakriengkrai et al, 2020] The Impact of Human Factors on the Participation Decision of Reviewers in Modern Code Review S. Ruangwan, P. Thongtanunam, A. Ihara, K. Matsumoto at Journal of EMSE 2019 Workload-Aware Reviewer Recommendation using a Multi-objective Search-Based Approach W. Al-Zubaidi, P. Thongtanunam, H. K. Dam, C. Tantithamthavorn, A. Ghose at PROMISE2020 Review Dynamics and Their Impact on Software Quality P. Thongtanunam and A. E. Hassan at TSE 2020 M. Chouchen, A. Ouni, R. Kula, D. Wang, P. Thongtanunam, M. Mkaouer, K. Matsumoto at SANER2021 Anti-patterns in Modern Code Review: Symptoms and Prevalence Predicting Defective Lines Using a Model-Agnostic Technique S. Wattanakriengkrai, P. Thongtanunam, C. Tantithamthavorn, H. Hata, K. Matsumoto at TSE2020 http://patanamon.com 26