Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploring the Potential of Large Language Model...

Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification

Code review is a crucial practice in software development. As code review nowadays is lightweight, various issues can be identified, and sometimes, they can be trivial. Research has investigated automated approaches to classify review comments to gauge the effectiveness of code reviews. However, previous studies have primarily relied on supervised machine learning, which requires extensive manual annotation to train the models effectively. To address this limitation, we explore the potential of using Large Language Models (LLMs) to classify code review comments. We assess the performance of LLMs to classify 17 categories of code review comments. Our results show that LLMs can classify code review comments, outperforming the state-of-the-art approach using a trained deep learning model. In particular, LLMs achieve better accuracy in classifying the five most useful categories, which the state-of-the-art approach struggles with due to low training examples. Rather than relying solely on a specific small training data distribution, our results show that LLMs provide balanced performance across high- and low-frequency categories. These results suggest that the LLMs could offer a scalable solution for code review analytics to improve the effectiveness of the code review process.

The paper has been accepted at the 2025 IEEE International Conference on Source Code Analysis & Manipulation (SCAM)

Avatar for Patanamon (Pick) Thongtanunam

Patanamon (Pick) Thongtanunam

September 15, 2025
Tweet

More Decks by Patanamon (Pick) Thongtanunam

Transcript

  1. Patanamon (Pick) Thongtanunam [email protected] http://patanamon.com Exploring the Potential of Large

    Language Models in Fine-Grained Review Comment Classi fi cation The University of Melbourne 1 Hong Yi (Tom) Lin Chunhua Liu Linh Nguyen
  2. 2 Code review serves as a quality assurance gateway for

    new code changes Code review is a process where developers inspect other's code to identify potential issues
  3. 2 Code review serves as a quality assurance gateway for

    new code changes Code review is a process where developers inspect other's code to identify potential issues Constructive, quality-focused reviews would improve the overall quality of code
  4. 2 Code review serves as a quality assurance gateway for

    new code changes Code review is a process where developers inspect other's code to identify potential issues Constructive, quality-focused reviews would improve the overall quality of code Trivial comments can waste developers’ time without improving code changes
  5. 3 Various types of review comments can be raised, but

    their usefulness varies Recent work identi fi ed 17 types of review comments and perceived usefulness [Turzo & Bosu; 2024]
  6. 3 Various types of review comments can be raised, but

    their usefulness varies Recent work identi fi ed 17 types of review comments and perceived usefulness [Turzo & Bosu; 2024] High-level
  7. 3 Various types of review comments can be raised, but

    their usefulness varies Recent work identi fi ed 17 types of review comments and perceived usefulness [Turzo & Bosu; 2024] High-level Fine-grained
  8. 3 Various types of review comments can be raised, but

    their usefulness varies Recent work identi fi ed 17 types of review comments and perceived usefulness [Turzo & Bosu; 2024] Functional defect High-level Fine-grained
  9. 3 Various types of review comments can be raised, but

    their usefulness varies Recent work identi fi ed 17 types of review comments and perceived usefulness [Turzo & Bosu; 2024] Functional defect Documentation High-level Fine-grained
  10. 4 Various types of review comments can be raised, but

    their usefulness varies Recent work identi fi ed 17 types of review comments and perceived usefulness [Turzo & Bosu; 2024] High-level Fine-grained
  11. 4 Various types of review comments can be raised, but

    their usefulness varies Recent work identi fi ed 17 types of review comments and perceived usefulness [Turzo & Bosu; 2024] Automating comment classification could gauge the quality of code review practices High-level Fine-grained
  12. 5 Several studies explored approaches for review comment classi fi

    cation Classi fi cation: Useful or not Approach: Feature-based Techniques: Text similarity, SVM, Random Forest [Pangsakulyanont et al, IWESEP2024; Bosu et al, MSR 2015; Fregnan et al, EMSE2022]
  13. 5 Several studies explored approaches for review comment classi fi

    cation Classi fi cation: Useful or not Approach: Feature-based Techniques: Text similarity, SVM, Random Forest [Pangsakulyanont et al, IWESEP2024; Bosu et al, MSR 2015; Fregnan et al, EMSE2022] Classi fi cation: Five high-level types Approach: Fine-tune deep learning models Techniques: CodeBERT + LSTM [Turzo et al, ESEM2023]
  14. 5 Several studies explored approaches for review comment classi fi

    cation Classi fi cation: Useful or not Approach: Feature-based Techniques: Text similarity, SVM, Random Forest [Pangsakulyanont et al, IWESEP2024; Bosu et al, MSR 2015; Fregnan et al, EMSE2022] Classi fi cation: Five high-level types Approach: Fine-tune deep learning models Techniques: CodeBERT + LSTM [Turzo et al, ESEM2023] Accurate classi fi cation
  15. 5 Several studies explored approaches for review comment classi fi

    cation Classi fi cation: Useful or not Approach: Feature-based Techniques: Text similarity, SVM, Random Forest [Pangsakulyanont et al, IWESEP2024; Bosu et al, MSR 2015; Fregnan et al, EMSE2022] Classi fi cation: Five high-level types Approach: Fine-tune deep learning models Techniques: CodeBERT + LSTM [Turzo et al, ESEM2023] Accurate classi fi cation Requires manual annotation for training
  16. 5 Several studies explored approaches for review comment classi fi

    cation Classi fi cation: Useful or not Approach: Feature-based Techniques: Text similarity, SVM, Random Forest [Pangsakulyanont et al, IWESEP2024; Bosu et al, MSR 2015; Fregnan et al, EMSE2022] Classi fi cation: Five high-level types Approach: Fine-tune deep learning models Techniques: CodeBERT + LSTM [Turzo et al, ESEM2023] Accurate classi fi cation Requires manual annotation for training Limited to the high-level comment types
  17. 5 Several studies explored approaches for review comment classi fi

    cation Classi fi cation: Useful or not Approach: Feature-based Techniques: Text similarity, SVM, Random Forest [Pangsakulyanont et al, IWESEP2024; Bosu et al, MSR 2015; Fregnan et al, EMSE2022] Classi fi cation: Five high-level types Approach: Fine-tune deep learning models Techniques: CodeBERT + LSTM [Turzo et al, ESEM2023] Accurate classi fi cation Requires manual annotation for training Limited to the high-level comment types Can Large Language Models (LLMs) address these limitations?
  18. 6 Exploring the capability of Large Language Models (LLMs) to

    analyze and classify code review comments
  19. 6 Exploring the capability of Large Language Models (LLMs) to

    analyze and classify code review comments
  20. 6 Exploring the capability of Large Language Models (LLMs) to

    analyze and classify code review comments Flat Strategy Input Context + 17 fi ne-grained categories LLM Predicted category
  21. 6 Exploring the capability of Large Language Models (LLMs) to

    analyze and classify code review comments Flat Strategy Input Context + 17 fi ne-grained categories LLM Predicted category
  22. 6 Exploring the capability of Large Language Models (LLMs) to

    analyze and classify code review comments Flat Strategy Input Context + 17 fi ne-grained categories LLM Predicted category
  23. 6 Exploring the capability of Large Language Models (LLMs) to

    analyze and classify code review comments Flat Strategy Input Context + 17 fi ne-grained categories LLM Predicted category Hierachical Strategy Input Context + 5 high-level categories LLM Predicted high- level category Retrieve subcategories Input Context + Predicted category LLM
  24. 6 Exploring the capability of Large Language Models (LLMs) to

    analyze and classify code review comments Flat Strategy Input Context + 17 fi ne-grained categories LLM Predicted category Hierachical Strategy Input Context + 5 high-level categories LLM Predicted high- level category Retrieve subcategories Input Context + Predicted category LLM
  25. 6 Exploring the capability of Large Language Models (LLMs) to

    analyze and classify code review comments Flat Strategy Input Context + 17 fi ne-grained categories LLM Predicted category Hierachical Strategy Input Context + 5 high-level categories LLM Predicted high- level category Retrieve subcategories Input Context + Predicted category LLM
  26. 7 Experimental Design Manually annotated data [Turzo and Bosu; EMSE2023]

    - 2,500 review comments - Openstack Nova - Each category includes a usefulness score ranked by Open source developers
  27. 7 Experimental Design Manually annotated data [Turzo and Bosu; EMSE2023]

    - 2,500 review comments - Openstack Nova - Each category includes a usefulness score ranked by Open source developers LLM 2 Families of Open Source LLMs Qwen 2: - Small: 7B, - Medium: 72B Llama 3: - Small: 8B, - Medium: 70B, - Large: 405B
  28. 7 Experimental Design Manually annotated data [Turzo and Bosu; EMSE2023]

    - 2,500 review comments - Openstack Nova - Each category includes a usefulness score ranked by Open source developers LLM 2 Families of Open Source LLMs Qwen 2: - Small: 7B, - Medium: 72B Llama 3: - Small: 8B, - Medium: 70B, - Large: 405B Evaluation metrics - Precision - Recall - F1-score
  29. 8 RQ1: Can we use LLMs to classify code review

    comments? Flat Strategy Input Context + 17 fi ne-grained categories LLM Predicted category Hierachical Strategy Input Context + 5 high-level categories LLM Predicted high- level category Input Context + Predicted category LLM Using all review comments as a test set
  30. 9 RQ1: Can we use LLMs to classify code review

    comments? The weighted average of classi fi cation results on 17 categories
  31. 9 RQ1: Can we use LLMs to classify code review

    comments? The weighted average of classi fi cation results on 17 categories
  32. 9 RQ1: Can we use LLMs to classify code review

    comments? The weighted average of classi fi cation results on 17 categories
  33. 9 RQ1: Can we use LLMs to classify code review

    comments? The weighted average of classi fi cation results on 17 categories
  34. 9 RQ1: Can we use LLMs to classify code review

    comments? The weighted average of classi fi cation results on 17 categories
  35. 9 RQ1: Can we use LLMs to classify code review

    comments? The weighted average of classi fi cation results on 17 categories LLMs can be used to classify review comments into 17 categories. Using the hierarchical approach could help boost the performance of small and medium LLMs,
  36. 10 RQ2: Do LLMs outperform the state-of-the-art approach? Flat Strategy

    Input Context + 17 fi ne-grained categories LLM Predicted category Hierachical Strategy Input Context + 5 high-level categories LLM Predicted high- level category Input Context + Predicted category LLM Classi fi cation: Five high-level types Approach: Fine-tune deep learning models Techniques: CodeBERT + LSTM [Turzo et al, ESEM2023] Using 10-fold Cross-validation - 9 folds for training CodeBERT + LSTM - 1 fold for testing both
  37. 11 RQ2: Do LLMs outperform the state-of-the-art approach? The weighted

    average of classi fi cation results on 17 categories based on 10-fold cross-validation.
  38. 11 RQ2: Do LLMs outperform the state-of-the-art approach? SOTA -

    w/ fi netuning The weighted average of classi fi cation results on 17 categories based on 10-fold cross-validation.
  39. 11 RQ2: Do LLMs outperform the state-of-the-art approach? SOTA -

    w/ fi netuning LLM LLM - w/o fi netuning The weighted average of classi fi cation results on 17 categories based on 10-fold cross-validation.
  40. 11 RQ2: Do LLMs outperform the state-of-the-art approach? SOTA -

    w/ fi netuning LLM LLM - w/o fi netuning The weighted average of classi fi cation results on 17 categories based on 10-fold cross-validation. LLMs can outperform the supervised DL approach to classify review comments into 17 categories. Yet, the small LLMs still have lower performance.
  41. 12 RQ3: Which categories can LLMs accurately classify? The category-wise

    classi fi cation results on 17 categories based on 10-fold cross-validation.
  42. 12 RQ3: Which categories can LLMs accurately classify? The category-wise

    classi fi cation results on 17 categories based on 10-fold cross-validation.
  43. 12 RQ3: Which categories can LLMs accurately classify? The category-wise

    classi fi cation results on 17 categories based on 10-fold cross-validation. SOTA - w/ fi netuning
  44. 12 RQ3: Which categories can LLMs accurately classify? The category-wise

    classi fi cation results on 17 categories based on 10-fold cross-validation. SOTA - w/ fi netuning LLM - w/o fi netuning
  45. 12 RQ3: Which categories can LLMs accurately classify? The category-wise

    classi fi cation results on 17 categories based on 10-fold cross-validation. SOTA - w/ fi netuning Low training resouces LLM - w/o fi netuning
  46. 12 RQ3: Which categories can LLMs accurately classify? The category-wise

    classi fi cation results on 17 categories based on 10-fold cross-validation. SOTA - w/ fi netuning Low training resouces LLM - w/o fi netuning
  47. 12 RQ3: Which categories can LLMs accurately classify? The category-wise

    classi fi cation results on 17 categories based on 10-fold cross-validation. SOTA - w/ fi netuning Low training resouces LLM - w/o fi netuning LLMs can classify despite a low number of examples, outperforming the supervised DL in the most useful category and the least useful category.
  48. 13 Potential Bene fi ts of LLMs for Fine-Grained Comment

    Classi fi cation Code Review Analytics
  49. 13 Potential Bene fi ts of LLMs for Fine-Grained Comment

    Classi fi cation Code Review Analytics Help teams assess code review e ff ectiveness and identify common software quality concerns
  50. 13 Potential Bene fi ts of LLMs for Fine-Grained Comment

    Classi fi cation Code Review Analytics Help teams assess code review e ff ectiveness and identify common software quality concerns Helps developers prioritise which review comments to address fi rst - Filter out less useful/urgent comments - Pinpoint the critical/important comments - Less time sift through all the comments
  51. 14 Potential Bene fi ts of LLMs for Fine-Grained Comment

    Classi fi cation Enhance Code Review Automation
  52. 14 Potential Bene fi ts of LLMs for Fine-Grained Comment

    Classi fi cation Enhance Code Review Automation Curate the training dataset, removing the less useful review comments
  53. 14 Potential Bene fi ts of LLMs for Fine-Grained Comment

    Classi fi cation Enhance Code Review Automation Curate the training dataset, removing the less useful review comments Evaluate and monitor generated reviews, gauging whether the AI provides useful comments or not
  54. 15 Exploring the Potential of Large Language Models in Fine-Grained

    Review Comment Classi fi cation Explore the capabilities of LLMs for 17 fi ne-grained review comment classi fi cation
  55. 15 Exploring the Potential of Large Language Models in Fine-Grained

    Review Comment Classi fi cation Explore the capabilities of LLMs for 17 fi ne-grained review comment classi fi cation Input Context + LLM LLMs can classify review comments with an average F1 score 46.2%
  56. 15 Exploring the Potential of Large Language Models in Fine-Grained

    Review Comment Classi fi cation Explore the capabilities of LLMs for 17 fi ne-grained review comment classi fi cation Input Context + LLM LLMs can classify review comments with an average F1 score 46.2% LLMs could o ff er a scalable solution for code review analytics and enhancing code review automation.
  57. 15 Exploring the Potential of Large Language Models in Fine-Grained

    Review Comment Classi fi cation Explore the capabilities of LLMs for 17 fi ne-grained review comment classi fi cation Input Context + LLM LLMs can classify review comments with an average F1 score 46.2% LLMs could o ff er a scalable solution for code review analytics and enhancing code review automation. L. Nguyen, C. Liu, H. Y. Lin, P. Thongtanunam, “Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classi fi cation”, SCAM2025