Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Large-scale Heterogeneous Graph Representation Learning for Code Review Recommendations at Microsoft

Using Large-scale Heterogeneous Graph Representation Learning for Code Review Recommendations at Microsoft

Presentation at ICSE SEIP 2023, the Software Engineering in Practice (SEIP) track of the ACM/IEEE International Conference on Software Engineering (ICSE). Melbourne, May 18, 2023.

Authors: Jiyang Zhang, Chandra Maddila, Ram Bairi, Christian Bird, Ujjwal Raizada, Apoorva Agrawal, Yamini Jhawar, Kim Herzig, Arie van Deursen

Code review is an integral part of any mature software development process, and identifying the best reviewer for a code change is a well-accepted problem within the software engineering community. Selecting a reviewer who lacks expertise and understanding can slow development or result in more defects.

To date, most reviewer recommendation systems rely primarily on historical file change and review information; those who changed or reviewed a file in the past are the best positioned to review in the future. We posit that while these approaches are able to identify and suggest qualified reviewers, they may be blind to reviewers who have the needed expertise and have simply never interacted with the changed files before. Fortunately, at Microsoft, we have a wealth of work artifacts across many repositories that can yield valuable information about our developers.

To address the aforementioned problem, we present CORAL, a novel approach to reviewer recommendation that leverages a socio-technical graph built from the rich set of entities (developers, repositories, files, pull requests (PRs), work items, etc.) and their relationships in modern source code management systems. We employ a graph convolutional neural network on this graph and train it on two and a half years of history on 332 repositories within Microsoft. We show that CORAL is able to model the manual history of reviewer selection remarkably well. Further, based on an extensive user study, we demonstrate that this approach identifies relevant and qualified reviewers who traditional reviewer recommenders miss, and that these developers desire to be included in the review process. Finally, we find that "classical" reviewer recommendation systems perform better on smaller (in terms of developers) software projects while CORAL excels on larger projects, suggesting that there is "no one model to rule them all."

Paper on Arxiv: https://arxiv.org/abs/2202.02385
Session at ICSE SEIP: https://conf.researchr.org/details/icse-2023/icse-2023-SEIP/24/Using-Large-scale-Heterogeneous-Graph-Representation-Learning-for-Code-Review-Recomme

Arie van Deursen

May 18, 2023
Tweet

More Decks by Arie van Deursen

Other Decks in Research

Transcript

  1. Using Large-scale Heterogeneous Graph Representation Learning for Code Review Recommendations

    at Microsoft Jiyang Zhang University of Texas at Austin, USA Chandra Maddila Microsoft Research -> Meta, USA Ram Bairi Microsoft Research, India Christian Bird Microsoft Research, USA Ujjwal Raizada Microsoft Research, India Apoorva Agrawal Microsoft Research, India Yamini Jhawar Microsoft Research, India Kim Herzig Microsoft, USA Arie van Deursen Delft University of Technology, The Netherlands 1
  2. Reviewer Recommendation • Given a pull request, get better reviewers,

    faster, to ship better code, faster • Hard at scale: large teams, large (mono) repos, people moving around • State of the art: • Heuristics based on (earlier) authorship and reviewership • 😢 Does not consider semantic information (pull request title, descriptions, linked tasks, etc.) • 😢 Cold-start problem • 😢 Insufficient diversity when picking up reviewers based on reviewership 2
  3. 3

  4. pull request user user work item repository file iteration comment

    comment id, RepoId, PullRequestId, Status, Title, Iterations, Url, SourceRefName, TargetRefName FilesEditedAAlot, FilesEditedConcurrently id, RepoId, FilePath, Type, IsEditedALot, IsConcurrentlyEdit edInLastNMonths id, RepoId, Name, OrganizationNa me, ProjectName, SourceControlS ystem id, RepoId, WorkItemId, Type, Title, Status, UpdatedDate contains contains creates reviews linked # of directories Creation date, Closed date Review date, Vote id, RepoId, ContentThreadId,CommentI d,IsDeleted,ParentComment Id id, repoId, Description,Co mmonRefCom mitId,PushId, PullRequestIte rationId replyTo reportsTo addsIteration has has comments PublishedDate, LastUpdatedDate CreatedDate iterations The Nalanda Graph at Microsoft 5
  5. Problem formulation: Link Prediction Pull Request 2 reviews ? reviews

    ? File B Pull Request 1 User 1 File A User 2 reviews changes creates changes changes changes 7
  6. Training CORAL: A Two-Layer Graph Convolutional Network 1st Layer 2nd

    Layer • Each layer, each node: get feature information from node and its neighbours and aggregate into representation. • During training make nodes connected in actual graph semantically similar (large inner product) 8
  7. CORAL’s Inductive Inference For Recommendations 9 … Pull Request Users

    New Pull Request New Pull Request New Pull Request • Given a new pull request, plug a new in the graph • Connect edges to its files, authors and words. • Obtain pull request node embedding by going through the two Graph Convolutional Network layers • User nodes with highest inner products with pull request node will be recommended by our model 1st Layer 2nd Layer
  8. Dataset for Training & Evaluation • Graph: • File: 2.8M

    • Pull Request: 1.3M • Text: 1.1M • User: 48.5K • Work item: 540K • Training Dataset: • 7M <pull request, reviewer> pairs • 700M pairs randomly sampled from graph • Testing Dataset (not in Training): • 250K <pull request, reviewer> pairs 10
  9. How well does CORAL model reviewing history? • 250K historic

    pull requests not in training data • See what CORAL would have predicted • Top-k accuracy: Recommend at least one correct reviewer in top k • Mean Reciprocal Rank: Is correct recommendation at top of list? • In 73% of cases, top 3 contains correct reviewer 11
  10. Ablation: Are All CORAL Features Needed? • Just the graph

    not much good • Words and files both contribute individually • Files alone get a long way • Their combination yields best performance 12
  11. How does CORAL compare to a rule-based model? • Currently

    in production at Microsoft • Zanjani, Kagdi, Bird: “Automatically recommending peer reviewers in modern code review”, TSE 2015 • Model of expertise based on author interactions with files and time decay • Two datasets of 500 pull requests each from differently sized repositories: • 220 large (> 100 devs) • 200 medium (25 < devs < 100) • 80 small (devs < 25) • Ask devs about relevance (irrelevant / like to be informed / will act) for pull requests they did not do 13
  12. CORAL vs Rule-Based Accuracy • Accuracy of actual interactions (change

    status, add comment) • Accuracy of devs saying it is relevant • No single clear winner: “no model to rule them all” • More training data for large repos • Social graph less relevant for small repos 14 Repo size Rule-based Model CORAL Large 0.19 0.37 Medium 0.31 0.36 Small 0.35 0.23
  13. What do Users Think (I)? 15 “ I am lead

    of this area and would like to review these kinds of PRs which are likely fixing some regressions ” This is a PR worked on by my sister team. We have a dependency on them. So, I’d love to review this PR. I was not added when the PR was created. I would have loved to be added when it was active. Yes! This PR needs a careful review. I'd love to spend time on this PR.
  14. What do Users Think (II) 16 No longer relevant because

    this is a repo my team transferred in 2020 to another team. I am a PM, so this PR is not relevant to me. Not relevant since I no longer work on the team that manages this service.
  15. Conclusion • Explored combining social graph and semantic information for

    recommending reviewers • Conducted both offline (historic) analysis and online (asking devs) analyis of impact • Offline accuracy of 73% in top 3 • Online recommendations appreciated by devs (67%) • Works better than rule based recommendations for larger repos • Todo? Decay, take node / edge specific features into account, effect of hyper parameters, applicability to open source, … 17
  16. Using Large-scale Heterogeneous Graph Representation Learning for Code Review Recommendations

    at Microsoft Jiyang Zhang University of Texas at Austin, USA Chandra Maddila Microsoft Research -> Meta, USA Ram Bairi Microsoft Research, India Christian Bird Microsoft Research, USA Ujjwal Raizada Microsoft Research, India Apoorva Agrawal Microsoft Research, India Yamini Jhawar Microsoft Research, India Kim Herzig Microsoft, USA Arie van Deursen Delft University of Technology, The Netherlands 18