Honours Thesis Defense

A Systematic Review of Automated Program Repair using Large Language
Models Callum MacNeil 2 Callum MacNeil BCS student Faculty of Computer Science, Dalhousie University, Canada Supervisor: Dr. Masud Rahman RAISE Lab Intelligent Automation in Software EngineeRing

Agenda Callum MacNeil 3 Motivation Research Methodology Research Questions Empirical
Findings Threats to Validity Take Home Messages Q & A

Motivation Callum MacNeil 4 Up to 50% of development time
is spent fixing bugs LLMs have proven effective for code: GitHub Copilot increases developer productivity by 55% Code Generation without fixing bugs does not provide productivity gains No existing systematic review on this topic covers all design choices

Research Methodology Callum MacNeil 5 Ask Research Questions Find related
existing reviews Read Guidelines from Kitchenham and Charters Design Review Protocol Study Collection Study Filtration Data Extraction & Analysis

Research Questions • RQ1:  (a) Which types of pretraining
methods are used for LLM based APR?  (b) Which types of input/prompts are used for LLM based APR?  (c) Which types of datasets are used for LLM based APR? • RQ2: Which evaluation methods are used for LLM based APR? Callum MacNeil 6 Focused Empirical • RQ3:  (a) What are the strengths and weaknesses of existing approaches?  (b) What should be the future direction of LLM based APR?

Search Keywords Callum MacNeil 7 “With ‘Population’ what is the
effect of ‘Intervention’ on ‘Outcome’ ” • (p1 OR p2 OR…) AND (i1 OR i2 OR…) AND (o1 OR o2) • Depending on the database, slight modifications are made

Review Protocol Callum MacNeil 8 Automatic Exclusion Keywords: • IR-based
• Genetic Manual Exclusion Criteria: • Does not mention LLMs used for an APR task • Not focused on APR leveraging LLMs Automatic Exclusion Criteria: • Non-English • Written before 2017 • Unpublished • Paid access/unavailable

Empirical Findings Callum MacNeil 9 RQ1a: Pre-training and Finetuning RQ1b:
Inputs and Prompts RQ1c: Datasets RQ2: Evaluation Metrics

RQ1a:Which types of pretraining methods are used for LLM based
APR? Callum MacNeil 10

Pretraining and Finetuning Examples Callum MacNeil 11 Pretraining Finetuning Neither
• “CoCoNuT” [4]: • Ensemble Learning • Has separate encoders for context and buggy line • Aggressive Abstraction • Reduced input vocabulary from 1.1M to 110K • Requires model to learn new encodings and understand heavily abstracted code • “Conversational Automated Program Repair” [6] • Multi-Shot Prompting • Leverages conversational models finetuned for chat: ChatGPT and Codex • “An empirical study of deep transfer learning- based program repair for Kotlin projects” [5] • Converting their codebase from Java to Kotlin • Leverages pretrained models, transfer learning and a small Kotlin dataset for finetuning

Pretraining and Finetuning Callum MacNeil 12 19 13 21 Pretraining
Methods Used Across Studies Distribution of Pretraining Methods Over Time

RQ1b:Which types of input/prompts are used for LLM based APR?
Callum MacNeil 13

Input Representations Callum MacNeil 14 Abstract Syntax Tree Byte Pair
Encoding Abstraction • A new perspective • Represents the structure of the code • Assists in generalization and vocabulary reduction • Addresses vocabulary problem isRunning is Running Image from [1] Image from [2]

Callum MacNeil 15 BPE Abstraction AST 6 4 12 11
Input Representations

Prompt Techniques Callum MacNeil 16 Providing Hints • Labelling bug
location • Including Buggy line Context • Costs input size • Lines before & after the bug Multi-Shot • Input/Output examples • Using models output as the next input from [7]

RQ1c: Which types of datasets are used for LLM based
APR? Callum MacNeil 17

Callum MacNeil 18 Experimental Datasets 1.)Defects4J 21 Studies 2.) QuixBugs
13 Studies 5.) Bugs2Fix 4 Studies 3.) CodeXGlue 7 Studies 4.) Bugs.jar 5 Studies

RQ2: Which evaluation methods are used for LLM based APR?
Callum MacNeil 19

Evaluation Metrics • Majority of studies use accuracy • Exact
metric for patch correctness • Black and White • Studies using non- accuracy • Classification such as FixEval • Gives a ‘grey area’ of correctness Callum MacNeil 20 49 9 Accuracy Other Evaluation Metrics Used 49 5 9 Overlap

RQ3a:What are the strengths and weaknesses of existing approaches? Callum
MacNeil 21

Strengths • Datasets  FixEval: Patch correctness classifier and dataset
 RunBugRun: Bug categories labeled and grouped by difficulty • Reinforcement Learning Approaches  Execution-based backpropagation  Feedback in a conversational approach • Input Representation Diversity  Primary studies are introducing new input representations  EX: TARE: T-grammar & T-graph Callum MacNeil 22

Weaknesses • Lack of Extensive Pretraining:  Only 36% of
collected primary studies pretrain  Newer studies more often use existing models • Lack of Certainty  Hard to gain insight from successful and unsuccessful approaches Callum MacNeil 23

RQ3b:What should be the future direction of LLM based APR?
Callum MacNeil 24

Future Direction (Roadmap) Callum MacNeil 25 • Advanced Reward Functions
• Fine-grained feedback • Combining measures of success: (I.E. BLEU score, code compiling, accuracy) • Dataset Curation  Metadata enables advanced reward functions and curriculum learning • Expanding Input Representations • Input representations highlight specific aspects of input • Many input representations used => More specific aspects of input shown to model from [3]

Threats to Validity Callum MacNeil 26 External Validity Internal Validity
Conclusion Validity - Search String changes from database to database - automatic filtering + 170 papers manually filtered - Surface level information extracted + Categorization of techniques based on extraction tables + Conclusion backed by extracted data + Followed widely used guidelines of Kitchenham and Charters - Solo data extraction

Take Home Messages • LLMs for PL allows for the
use of methods that are unavailable to natural language tasks • APR has the potential to significantly increase productivity • LLM APR requirements:  Datasets equipped with appropriate metadata (i.e. executable test suites)  Pretrained, state of the art model(s)  Advanced prompting, insightful input representations, fine-grained loss functions Callum MacNeil 27

References [1] Q. Zhu, Z. Sun, W. Zhang, Y. Xiong,
and L. Zhang, “Tare: Type-aware neural program repair,” 2023. [Online]. Available: https://ezproxy.library.dal.ca/login?url=https://www.proquest.com/ conference-papers-proceedings/tare-type-aware-neural-program-repair/ docview/2837138632/se-2 [2] K. Kim, M. Kim, and E. Lee, “Systematic analysis of defect-specific code abstraction for neural program repair,” 2022. [Online]. Available: https://ezproxy.library.dal.ca/ login?url=https://www.proquest.com/conference-papers-proceedings/ systematic-analysis-defect-specific-code/docview/2777587963/se-2 [3] S. Bhatt, “Reinforcement learning 101,” Medium, https://towardsdatascience.com/reinforcement-learning-101-e24b50e1d292 (accessed Dec. 11, 2023). [4] T. Lutellier, H. V. Pham, L. Pang, Y. Li, M. Wei, and L. Tan, “Coconut: combining context-aware neural translation models using ensemble for program repair,” in Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. Virtual Event USA: ACM, Jul. 2020, p. 101–114. [Online]. Available: https://dl.acm.org/doi/10.1145/3395363.3397369 [5] M. Kim, Y. Kim, H. Jeong, J. Heo, S. Kim, H. Chung, and E. Lee, “An empirical study of deep transfer learning-based program repair for kotlin projects,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Singapore Singapore: ACM, Nov. 2022, p. 1441–1452. [Online]. Available: https://dl.acm.org/doi/10.1145/3540250.3558967 [6] C. S. Xia and L. Zhang, “Conversational automated program repair,” no. arXiv:2301.13246, Jan. 2023, arXiv:2301.13246 [cs]. [Online]. Available: http://arxiv.org/abs/2301.13246 [7] R. Paul, M. M. Hossain, M. L. Siddiq, M. Hasan, A. Iqbal, and J. C. S. Santos, “Enhancing automated program repair through fine-tuning and prompt engineering,” no. arXiv:2304.07840, Jul. 2023, arXiv:2304.07840 [cs]. [Online]. Available: http://arxiv.org/abs/2304.07840 Callum MacNeil 28

Questions? Callum MacNeil 29

Honours Thesis Defense

Honours Thesis Defense

Callum MacNeil

Featured

Transcript

Honours Thesis Defense

A Systematic Review of Automated Program Repair using Large Language

Agenda Callum MacNeil 3 Motivation Research Methodology Research Questions Empirical

Motivation Callum MacNeil 4 Up to 50% of development time

Research Methodology Callum MacNeil 5 Ask Research Questions Find related

Research Questions • RQ1:  (a) Which types of pretraining

Search Keywords Callum MacNeil 7 “With ‘Population’ what is the

Review Protocol Callum MacNeil 8 Automatic Exclusion Keywords: • IR-based

Empirical Findings Callum MacNeil 9 RQ1a: Pre-training and Finetuning RQ1b:

RQ1a:Which types of pretraining methods are used for LLM based

Pretraining and Finetuning Examples Callum MacNeil 11 Pretraining Finetuning Neither

Pretraining and Finetuning Callum MacNeil 12 19 13 21 Pretraining

RQ1b:Which types of input/prompts are used for LLM based APR?

Input Representations Callum MacNeil 14 Abstract Syntax Tree Byte Pair

Callum MacNeil 15 BPE Abstraction AST 6 4 12 11

Prompt Techniques Callum MacNeil 16 Providing Hints • Labelling bug

RQ1c: Which types of datasets are used for LLM based

Callum MacNeil 18 Experimental Datasets 1.)Defects4J 21 Studies 2.) QuixBugs

RQ2: Which evaluation methods are used for LLM based APR?

Evaluation Metrics • Majority of studies use accuracy • Exact

RQ3a:What are the strengths and weaknesses of existing approaches? Callum

Strengths • Datasets  FixEval: Patch correctness classifier and dataset

Weaknesses • Lack of Extensive Pretraining:  Only 36% of

RQ3b:What should be the future direction of LLM based APR?

Future Direction (Roadmap) Callum MacNeil 25 • Advanced Reward Functions

Threats to Validity Callum MacNeil 26 External Validity Internal Validity

Take Home Messages • LLMs for PL allows for the

References [1] Q. Zhu, Z. Sun, W. Zhang, Y. Xiong,

Questions? Callum MacNeil 29