Using Large-scale HeterogeneousGraph Representation Learning for Code Review Recommendations at Microsoft

Slide 1

Slide 1 text

Using Large-scale Heterogeneous Graph Representation Learning for Code Review Recommendations at Microsoft Jiyang Zhang University of Texas at Austin, USA Chandra Maddila Microsoft Research -> Meta, USA Ram Bairi Microsoft Research, India Christian Bird Microsoft Research, USA Ujjwal Raizada Microsoft Research, India Apoorva Agrawal Microsoft Research, India Yamini Jhawar Microsoft Research, India Kim Herzig Microsoft, USA Arie van Deursen Delft University of Technology, The Netherlands 1

Slide 2

Slide 2 text

Reviewer Recommendation • Given a pull request, get better reviewers, faster, to ship better code, faster • Hard at scale: large teams, large (mono) repos, people moving around • State of the art: • Heuristics based on (earlier) authorship and reviewership • 😢 Does not consider semantic information (pull request title, descriptions, linked tasks, etc.) • 😢 Cold-start problem • 😢 Insufficient diversity when picking up reviewers based on reviewership 2

Slide 3

Slide 3 text

Slide 4

Slide 4 text

ESEC FSE 2022 MSR MIP 2023 4

Slide 5

Slide 5 text

pull request user user work item repository file iteration comment comment id, RepoId, PullRequestId, Status, Title, Iterations, Url, SourceRefName, TargetRefName FilesEditedAAlot, FilesEditedConcurrently id, RepoId, FilePath, Type, IsEditedALot, IsConcurrentlyEdit edInLastNMonths id, RepoId, Name, OrganizationNa me, ProjectName, SourceControlS ystem id, RepoId, WorkItemId, Type, Title, Status, UpdatedDate contains contains creates reviews linked # of directories Creation date, Closed date Review date, Vote id, RepoId, ContentThreadId,CommentI d,IsDeleted,ParentComment Id id, repoId, Description,Co mmonRefCom mitId,PushId, PullRequestIte rationId replyTo reportsTo addsIteration has has comments PublishedDate, LastUpdatedDate CreatedDate iterations The Nalanda Graph at Microsoft 5

Slide 6

Slide 6 text

Nalanda’s Augmented Socio- Technical Graph 6

Slide 7

Slide 7 text

Problem formulation: Link Prediction Pull Request 2 reviews ? reviews ? File B Pull Request 1 User 1 File A User 2 reviews changes creates changes changes changes 7

Slide 8

Slide 8 text

Training CORAL: A Two-Layer Graph Convolutional Network 1st Layer 2nd Layer • Each layer, each node: get feature information from node and its neighbours and aggregate into representation. • During training make nodes connected in actual graph semantically similar (large inner product) 8

Slide 9

Slide 9 text

CORAL’s Inductive Inference For Recommendations 9 … Pull Request Users New Pull Request New Pull Request New Pull Request • Given a new pull request, plug a new in the graph • Connect edges to its files, authors and words. • Obtain pull request node embedding by going through the two Graph Convolutional Network layers • User nodes with highest inner products with pull request node will be recommended by our model 1st Layer 2nd Layer

Slide 10

Slide 10 text

Dataset for Training & Evaluation • Graph: ● File: 2.8M ● Pull Request: 1.3M ● Text: 1.1M ● User: 48.5K ● Work item: 540K • Training Dataset: ● 7M pairs ● 700M pairs randomly sampled from graph • Testing Dataset (not in Training): ● 250K pairs 10

Slide 11

Slide 11 text

How well does CORAL model reviewing history? • 250K historic pull requests not in training data • See what CORAL would have predicted • Top-k accuracy: Recommend at least one correct reviewer in top k • Mean Reciprocal Rank: Is correct recommendation at top of list? • In 73% of cases, top 3 contains correct reviewer 11

Slide 12

Slide 12 text

Ablation: Are All CORAL Features Needed? • Just the graph not much good • Words and files both contribute individually • Files alone get a long way • Their combination yields best performance 12

Slide 13

Slide 13 text

How does CORAL compare to a rule-based model? • Currently in production at Microsoft • Zanjani, Kagdi, Bird: “Automatically recommending peer reviewers in modern code review”, TSE 2015 • Model of expertise based on author interactions with files and time decay • Two datasets of 500 pull requests each from differently sized repositories: • 220 large (> 100 devs) • 200 medium (25 < devs < 100) • 80 small (devs < 25) • Ask devs about relevance (irrelevant / like to be informed / will act) for pull requests they did not do 13

Slide 14

Slide 14 text

CORAL vs Rule-Based Accuracy • Accuracy of actual interactions (change status, add comment) • Accuracy of devs saying it is relevant • No single clear winner: “no model to rule them all” • More training data for large repos • Social graph less relevant for small repos 14 Repo size Rule-based Model CORAL Large 0.19 0.37 Medium 0.31 0.36 Small 0.35 0.23

Slide 15

Slide 15 text

What do Users Think (I)? 15 “ I am lead of this area and would like to review these kinds of PRs which are likely fixing some regressions ” This is a PR worked on by my sister team. We have a dependency on them. So, I’d love to review this PR. I was not added when the PR was created. I would have loved to be added when it was active. Yes! This PR needs a careful review. I'd love to spend time on this PR.

Slide 16

Slide 16 text

What do Users Think (II) 16 No longer relevant because this is a repo my team transferred in 2020 to another team. I am a PM, so this PR is not relevant to me. Not relevant since I no longer work on the team that manages this service.

Slide 17

Slide 17 text

Conclusion • Explored combining social graph and semantic information for recommending reviewers • Conducted both offline (historic) analysis and online (asking devs) analyis of impact • Offline accuracy of 73% in top 3 • Online recommendations appreciated by devs (67%) • Works better than rule based recommendations for larger repos • Todo? Decay, take node / edge specific features into account, effect of hyper parameters, applicability to open source, … 17