Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Large-scale Heterogeneous Graph Representation Learning for Code Review Recommendations at Microsoft

Using Large-scale Heterogeneous Graph Representation Learning for Code Review Recommendations at Microsoft

Presentation at ICSE SEIP 2023, the Software Engineering in Practice (SEIP) track of the ACM/IEEE International Conference on Software Engineering (ICSE). Melbourne, May 18, 2023.

Authors: Jiyang Zhang, Chandra Maddila, Ram Bairi, Christian Bird, Ujjwal Raizada, Apoorva Agrawal, Yamini Jhawar, Kim Herzig, Arie van Deursen

Code review is an integral part of any mature software development process, and identifying the best reviewer for a code change is a well-accepted problem within the software engineering community. Selecting a reviewer who lacks expertise and understanding can slow development or result in more defects.

To date, most reviewer recommendation systems rely primarily on historical file change and review information; those who changed or reviewed a file in the past are the best positioned to review in the future. We posit that while these approaches are able to identify and suggest qualified reviewers, they may be blind to reviewers who have the needed expertise and have simply never interacted with the changed files before. Fortunately, at Microsoft, we have a wealth of work artifacts across many repositories that can yield valuable information about our developers.

To address the aforementioned problem, we present CORAL, a novel approach to reviewer recommendation that leverages a socio-technical graph built from the rich set of entities (developers, repositories, files, pull requests (PRs), work items, etc.) and their relationships in modern source code management systems. We employ a graph convolutional neural network on this graph and train it on two and a half years of history on 332 repositories within Microsoft. We show that CORAL is able to model the manual history of reviewer selection remarkably well. Further, based on an extensive user study, we demonstrate that this approach identifies relevant and qualified reviewers who traditional reviewer recommenders miss, and that these developers desire to be included in the review process. Finally, we find that "classical" reviewer recommendation systems perform better on smaller (in terms of developers) software projects while CORAL excels on larger projects, suggesting that there is "no one model to rule them all."

Paper on Arxiv: https://arxiv.org/abs/2202.02385
Session at ICSE SEIP: https://conf.researchr.org/details/icse-2023/icse-2023-SEIP/24/Using-Large-scale-Heterogeneous-Graph-Representation-Learning-for-Code-Review-Recomme

Arie van Deursen

May 18, 2023
Tweet

More Decks by Arie van Deursen

Other Decks in Research

Transcript

  1. Using Large-scale Heterogeneous
    Graph Representation Learning
    for Code Review Recommendations
    at Microsoft
    Jiyang Zhang
    University of Texas at Austin, USA
    Chandra Maddila
    Microsoft Research -> Meta, USA
    Ram Bairi
    Microsoft Research, India
    Christian Bird
    Microsoft Research, USA
    Ujjwal Raizada
    Microsoft Research, India
    Apoorva Agrawal
    Microsoft Research, India
    Yamini Jhawar
    Microsoft Research, India
    Kim Herzig
    Microsoft, USA
    Arie van Deursen
    Delft University of Technology,
    The Netherlands
    1

    View Slide

  2. Reviewer Recommendation
    • Given a pull request, get better reviewers, faster,
    to ship better code, faster
    • Hard at scale: large teams, large (mono) repos, people moving around
    • State of the art:
    • Heuristics based on (earlier) authorship and reviewership
    • 😢 Does not consider semantic information
    (pull request title, descriptions, linked tasks, etc.)
    • 😢 Cold-start problem
    • 😢 Insufficient diversity when picking up reviewers based on reviewership
    2

    View Slide

  3. 3

    View Slide

  4. ESEC FSE 2022
    MSR MIP 2023
    4

    View Slide

  5. pull
    request
    user
    user
    work
    item
    repository
    file iteration
    comment
    comment
    id, RepoId, PullRequestId,
    Status, Title, Iterations, Url,
    SourceRefName,
    TargetRefName
    FilesEditedAAlot,
    FilesEditedConcurrently
    id, RepoId,
    FilePath, Type,
    IsEditedALot,
    IsConcurrentlyEdit
    edInLastNMonths
    id, RepoId,
    Name,
    OrganizationNa
    me,
    ProjectName,
    SourceControlS
    ystem
    id, RepoId,
    WorkItemId, Type,
    Title, Status,
    UpdatedDate
    contains
    contains creates
    reviews
    linked
    # of directories Creation date,
    Closed date
    Review date,
    Vote
    id, RepoId,
    ContentThreadId,CommentI
    d,IsDeleted,ParentComment
    Id
    id, repoId,
    Description,Co
    mmonRefCom
    mitId,PushId,
    PullRequestIte
    rationId
    replyTo
    reportsTo
    addsIteration
    has
    has
    comments
    PublishedDate,
    LastUpdatedDate
    CreatedDate
    iterations
    The Nalanda Graph at Microsoft
    5

    View Slide

  6. Nalanda’s
    Augmented
    Socio-
    Technical
    Graph
    6

    View Slide

  7. Problem formulation: Link Prediction
    Pull
    Request 2
    reviews ?
    reviews ?
    File B
    Pull
    Request 1
    User 1
    File A
    User 2
    reviews
    changes
    creates
    changes
    changes
    changes
    7

    View Slide

  8. Training CORAL: A Two-Layer
    Graph Convolutional Network
    1st Layer 2nd Layer
    • Each layer, each node: get feature information from node and its neighbours and aggregate into representation.
    • During training make nodes connected in actual graph semantically similar (large inner product)
    8

    View Slide

  9. CORAL’s
    Inductive Inference For Recommendations
    9

    Pull Request
    Users
    New Pull Request
    New Pull Request
    New Pull Request
    • Given a new pull request, plug a new in the graph
    • Connect edges to its files, authors and words.
    • Obtain pull request node embedding by going through the two Graph Convolutional Network layers
    • User nodes with highest inner products with pull request node will be recommended by our model
    1st Layer 2nd Layer

    View Slide

  10. Dataset for Training & Evaluation

    Graph:
    ● File: 2.8M
    ● Pull Request: 1.3M
    ● Text: 1.1M
    ● User: 48.5K
    ● Work item: 540K

    Training Dataset:
    ● 7M pairs
    ● 700M pairs randomly sampled from graph

    Testing Dataset (not in Training):
    ● 250K pairs
    10

    View Slide

  11. How well does CORAL model
    reviewing history?
    • 250K historic pull requests not in training data
    • See what CORAL would have predicted
    • Top-k accuracy: Recommend at least one correct reviewer in top k
    • Mean Reciprocal Rank: Is correct recommendation at top of list?
    • In 73% of cases,
    top 3 contains
    correct reviewer
    11

    View Slide

  12. Ablation: Are All CORAL Features Needed?
    • Just the graph not much good
    • Words and files both contribute individually
    • Files alone get a long way
    • Their combination yields best performance
    12

    View Slide

  13. How does CORAL compare to a
    rule-based model?
    • Currently in production at Microsoft
    • Zanjani, Kagdi, Bird: “Automatically recommending peer reviewers in modern code
    review”, TSE 2015
    • Model of expertise based on author interactions with files and time decay
    • Two datasets of 500 pull requests each from differently sized repositories:
    • 220 large (> 100 devs)
    • 200 medium (25 < devs < 100)
    • 80 small (devs < 25)
    • Ask devs about relevance (irrelevant / like to be informed / will act) for pull
    requests they did not do
    13

    View Slide

  14. CORAL vs Rule-Based Accuracy
    • Accuracy of actual interactions
    (change status, add comment)
    • Accuracy of devs saying it is relevant
    • No single clear winner:
    “no model to rule them all”
    • More training data for large repos
    • Social graph less relevant for small repos
    14
    Repo
    size
    Rule-based
    Model
    CORAL
    Large 0.19 0.37
    Medium 0.31 0.36
    Small 0.35 0.23

    View Slide

  15. What do Users Think (I)?
    15
    “ I am lead of this area and would like to
    review these kinds of PRs which are likely
    fixing some regressions ”
    This is a PR worked on by my sister team.
    We have a dependency on them. So, I’d
    love to review this PR.
    I was not added when the PR was created. I
    would have loved to be added when it was
    active.
    Yes! This PR needs a careful
    review. I'd love to spend
    time on this PR.

    View Slide

  16. What do Users Think (II)
    16
    No longer relevant because
    this is a repo my team
    transferred in 2020 to another
    team.
    I am a PM, so this PR is not
    relevant to me. Not relevant since I no longer work
    on the team that manages this
    service.

    View Slide

  17. Conclusion
    • Explored combining social graph and semantic information for
    recommending reviewers
    • Conducted both offline (historic) analysis and online (asking devs)
    analyis of impact
    • Offline accuracy of 73% in top 3
    • Online recommendations appreciated by devs (67%)
    • Works better than rule based recommendations for larger repos
    • Todo? Decay, take node / edge specific features into account, effect of
    hyper parameters, applicability to open source, …
    17

    View Slide

  18. Using Large-scale Heterogeneous
    Graph Representation Learning
    for Code Review Recommendations
    at Microsoft
    Jiyang Zhang
    University of Texas at Austin, USA
    Chandra Maddila
    Microsoft Research -> Meta, USA
    Ram Bairi
    Microsoft Research, India
    Christian Bird
    Microsoft Research, USA
    Ujjwal Raizada
    Microsoft Research, India
    Apoorva Agrawal
    Microsoft Research, India
    Yamini Jhawar
    Microsoft Research, India
    Kim Herzig
    Microsoft, USA
    Arie van Deursen
    Delft University of Technology,
    The Netherlands
    18

    View Slide