Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification

Patanamon (Pick) Thongtanunam [email protected] http://patanamon.com Exploring the Potential of Large
Language Models in Fine-Grained Review Comment Classi fi cation The University of Melbourne 1 Hong Yi (Tom) Lin Chunhua Liu Linh Nguyen

2 Code review serves as a quality assurance gateway for
new code changes

new code changes Code review is a process where developers inspect other's code to identify potential issues

new code changes Code review is a process where developers inspect other's code to identify potential issues Constructive, quality-focused reviews would improve the overall quality of code

new code changes Code review is a process where developers inspect other's code to identify potential issues Constructive, quality-focused reviews would improve the overall quality of code Trivial comments can waste developers’ time without improving code changes

3 Various types of review comments can be raised, but
their usefulness varies

their usefulness varies Recent work identi fi ed 17 types of review comments and perceived usefulness [Turzo & Bosu; 2024]

their usefulness varies Recent work identi fi ed 17 types of review comments and perceived usefulness [Turzo & Bosu; 2024] High-level

their usefulness varies Recent work identi fi ed 17 types of review comments and perceived usefulness [Turzo & Bosu; 2024] High-level Fine-grained

their usefulness varies Recent work identi fi ed 17 types of review comments and perceived usefulness [Turzo & Bosu; 2024] Functional defect High-level Fine-grained

their usefulness varies Recent work identi fi ed 17 types of review comments and perceived usefulness [Turzo & Bosu; 2024] Functional defect Documentation High-level Fine-grained

their usefulness varies Recent work identi fi ed 17 types of review comments and perceived usefulness [Turzo & Bosu; 2024] High-level Fine-grained

their usefulness varies Recent work identi fi ed 17 types of review comments and perceived usefulness [Turzo & Bosu; 2024] Automating comment classification could gauge the quality of code review practices High-level Fine-grained

5 Several studies explored approaches for review comment classi fi
cation

cation Classi fi cation: Useful or not Approach: Feature-based Techniques: Text similarity, SVM, Random Forest [Pangsakulyanont et al, IWESEP2024; Bosu et al, MSR 2015; Fregnan et al, EMSE2022]

cation Classi fi cation: Useful or not Approach: Feature-based Techniques: Text similarity, SVM, Random Forest [Pangsakulyanont et al, IWESEP2024; Bosu et al, MSR 2015; Fregnan et al, EMSE2022] Classi fi cation: Five high-level types Approach: Fine-tune deep learning models Techniques: CodeBERT + LSTM [Turzo et al, ESEM2023]

cation Classi fi cation: Useful or not Approach: Feature-based Techniques: Text similarity, SVM, Random Forest [Pangsakulyanont et al, IWESEP2024; Bosu et al, MSR 2015; Fregnan et al, EMSE2022] Classi fi cation: Five high-level types Approach: Fine-tune deep learning models Techniques: CodeBERT + LSTM [Turzo et al, ESEM2023] Accurate classi fi cation

cation Classi fi cation: Useful or not Approach: Feature-based Techniques: Text similarity, SVM, Random Forest [Pangsakulyanont et al, IWESEP2024; Bosu et al, MSR 2015; Fregnan et al, EMSE2022] Classi fi cation: Five high-level types Approach: Fine-tune deep learning models Techniques: CodeBERT + LSTM [Turzo et al, ESEM2023] Accurate classi fi cation Requires manual annotation for training

cation Classi fi cation: Useful or not Approach: Feature-based Techniques: Text similarity, SVM, Random Forest [Pangsakulyanont et al, IWESEP2024; Bosu et al, MSR 2015; Fregnan et al, EMSE2022] Classi fi cation: Five high-level types Approach: Fine-tune deep learning models Techniques: CodeBERT + LSTM [Turzo et al, ESEM2023] Accurate classi fi cation Requires manual annotation for training Limited to the high-level comment types

cation Classi fi cation: Useful or not Approach: Feature-based Techniques: Text similarity, SVM, Random Forest [Pangsakulyanont et al, IWESEP2024; Bosu et al, MSR 2015; Fregnan et al, EMSE2022] Classi fi cation: Five high-level types Approach: Fine-tune deep learning models Techniques: CodeBERT + LSTM [Turzo et al, ESEM2023] Accurate classi fi cation Requires manual annotation for training Limited to the high-level comment types Can Large Language Models (LLMs) address these limitations?

6 Exploring the capability of Large Language Models (LLMs) to
analyze and classify code review comments

analyze and classify code review comments Flat Strategy Input Context + 17 fi ne-grained categories LLM Predicted category

analyze and classify code review comments Flat Strategy Input Context + 17 fi ne-grained categories LLM Predicted category Hierachical Strategy Input Context + 5 high-level categories LLM Predicted high- level category Retrieve subcategories Input Context + Predicted category LLM

7 Experimental Design

7 Experimental Design Manually annotated data [Turzo and Bosu; EMSE2023]
- 2,500 review comments - Openstack Nova - Each category includes a usefulness score ranked by Open source developers

- 2,500 review comments - Openstack Nova - Each category includes a usefulness score ranked by Open source developers LLM 2 Families of Open Source LLMs Qwen 2: - Small: 7B, - Medium: 72B Llama 3: - Small: 8B, - Medium: 70B, - Large: 405B

- 2,500 review comments - Openstack Nova - Each category includes a usefulness score ranked by Open source developers LLM 2 Families of Open Source LLMs Qwen 2: - Small: 7B, - Medium: 72B Llama 3: - Small: 8B, - Medium: 70B, - Large: 405B Evaluation metrics - Precision - Recall - F1-score

8 RQ1: Can we use LLMs to classify code review
comments? Flat Strategy Input Context + 17 fi ne-grained categories LLM Predicted category Hierachical Strategy Input Context + 5 high-level categories LLM Predicted high- level category Input Context + Predicted category LLM Using all review comments as a test set

comments? The weighted average of classi fi cation results on 17 categories

comments? The weighted average of classi fi cation results on 17 categories LLMs can be used to classify review comments into 17 categories. Using the hierarchical approach could help boost the performance of small and medium LLMs,

10 RQ2: Do LLMs outperform the state-of-the-art approach? Flat Strategy
Input Context + 17 fi ne-grained categories LLM Predicted category Hierachical Strategy Input Context + 5 high-level categories LLM Predicted high- level category Input Context + Predicted category LLM Classi fi cation: Five high-level types Approach: Fine-tune deep learning models Techniques: CodeBERT + LSTM [Turzo et al, ESEM2023] Using 10-fold Cross-validation - 9 folds for training CodeBERT + LSTM - 1 fold for testing both

11 RQ2: Do LLMs outperform the state-of-the-art approach? The weighted
average of classi fi cation results on 17 categories based on 10-fold cross-validation.

11 RQ2: Do LLMs outperform the state-of-the-art approach? SOTA -
w/ fi netuning The weighted average of classi fi cation results on 17 categories based on 10-fold cross-validation.

w/ fi netuning LLM LLM - w/o fi netuning The weighted average of classi fi cation results on 17 categories based on 10-fold cross-validation.

w/ fi netuning LLM LLM - w/o fi netuning The weighted average of classi fi cation results on 17 categories based on 10-fold cross-validation. LLMs can outperform the supervised DL approach to classify review comments into 17 categories. Yet, the small LLMs still have lower performance.

12 RQ3: Which categories can LLMs accurately classify? The category-wise
classi fi cation results on 17 categories based on 10-fold cross-validation.

classi fi cation results on 17 categories based on 10-fold cross-validation. SOTA - w/ fi netuning

classi fi cation results on 17 categories based on 10-fold cross-validation. SOTA - w/ fi netuning LLM - w/o fi netuning

classi fi cation results on 17 categories based on 10-fold cross-validation. SOTA - w/ fi netuning Low training resouces LLM - w/o fi netuning

classi fi cation results on 17 categories based on 10-fold cross-validation. SOTA - w/ fi netuning Low training resouces LLM - w/o fi netuning LLMs can classify despite a low number of examples, outperforming the supervised DL in the most useful category and the least useful category.

13 Potential Bene fi ts of LLMs for Fine-Grained Comment
Classi fi cation

Classi fi cation Code Review Analytics

Classi fi cation Code Review Analytics Help teams assess code review e ff ectiveness and identify common software quality concerns

Classi fi cation Code Review Analytics Help teams assess code review e ff ectiveness and identify common software quality concerns Helps developers prioritise which review comments to address fi rst - Filter out less useful/urgent comments - Pinpoint the critical/important comments - Less time sift through all the comments

Classi fi cation

Classi fi cation Enhance Code Review Automation

Classi fi cation Enhance Code Review Automation Curate the training dataset, removing the less useful review comments

Classi fi cation Enhance Code Review Automation Curate the training dataset, removing the less useful review comments Evaluate and monitor generated reviews, gauging whether the AI provides useful comments or not

15 Exploring the Potential of Large Language Models in Fine-Grained
Review Comment Classi fi cation

Review Comment Classi fi cation Explore the capabilities of LLMs for 17 fi ne-grained review comment classi fi cation

Review Comment Classi fi cation Explore the capabilities of LLMs for 17 fi ne-grained review comment classi fi cation Input Context + LLM LLMs can classify review comments with an average F1 score 46.2%

Review Comment Classi fi cation Explore the capabilities of LLMs for 17 fi ne-grained review comment classi fi cation Input Context + LLM LLMs can classify review comments with an average F1 score 46.2% LLMs could o ff er a scalable solution for code review analytics and enhancing code review automation.

Review Comment Classi fi cation Explore the capabilities of LLMs for 17 fi ne-grained review comment classi fi cation Input Context + LLM LLMs can classify review comments with an average F1 score 46.2% LLMs could o ff er a scalable solution for code review analytics and enhancing code review automation. L. Nguyen, C. Liu, H. Y. Lin, P. Thongtanunam, “Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classi fi cation”, SCAM2025

Exploring the Potential of Large Language Model...

Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification

More Decks by Patanamon (Pick) Thongtanunam

Featured

Transcript