Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Honours Thesis Defense

Callum MacNeil
January 03, 2024
16

Honours Thesis Defense

Honours Thesis Defense Slides from December 2023

Callum MacNeil

January 03, 2024
Tweet

Transcript

  1. A Systematic Review of Automated Program Repair using Large Language

    Models Callum MacNeil 2 Callum MacNeil BCS student Faculty of Computer Science, Dalhousie University, Canada Supervisor: Dr. Masud Rahman RAISE Lab Intelligent Automation in Software EngineeRing
  2. Agenda Callum MacNeil 3 Motivation Research Methodology Research Questions Empirical

    Findings Threats to Validity Take Home Messages Q & A
  3. Motivation Callum MacNeil 4 Up to 50% of development time

    is spent fixing bugs LLMs have proven effective for code: GitHub Copilot increases developer productivity by 55% Code Generation without fixing bugs does not provide productivity gains No existing systematic review on this topic covers all design choices
  4. Research Methodology Callum MacNeil 5 Ask Research Questions Find related

    existing reviews Read Guidelines from Kitchenham and Charters Design Review Protocol Study Collection Study Filtration Data Extraction & Analysis
  5. Research Questions • RQ1:  (a) Which types of pretraining

    methods are used for LLM based APR?  (b) Which types of input/prompts are used for LLM based APR?  (c) Which types of datasets are used for LLM based APR? • RQ2: Which evaluation methods are used for LLM based APR? Callum MacNeil 6 Focused Empirical • RQ3:  (a) What are the strengths and weaknesses of existing approaches?  (b) What should be the future direction of LLM based APR?
  6. Search Keywords Callum MacNeil 7 “With ‘Population’ what is the

    effect of ‘Intervention’ on ‘Outcome’ ” • (p1 OR p2 OR…) AND (i1 OR i2 OR…) AND (o1 OR o2) • Depending on the database, slight modifications are made
  7. Review Protocol Callum MacNeil 8 Automatic Exclusion Keywords: • IR-based

    • Genetic Manual Exclusion Criteria: • Does not mention LLMs used for an APR task • Not focused on APR leveraging LLMs Automatic Exclusion Criteria: • Non-English • Written before 2017 • Unpublished • Paid access/unavailable
  8. Empirical Findings Callum MacNeil 9 RQ1a: Pre-training and Finetuning RQ1b:

    Inputs and Prompts RQ1c: Datasets RQ2: Evaluation Metrics
  9. Pretraining and Finetuning Examples Callum MacNeil 11 Pretraining Finetuning Neither

    • “CoCoNuT” [4]: • Ensemble Learning • Has separate encoders for context and buggy line • Aggressive Abstraction • Reduced input vocabulary from 1.1M to 110K • Requires model to learn new encodings and understand heavily abstracted code • “Conversational Automated Program Repair” [6] • Multi-Shot Prompting • Leverages conversational models finetuned for chat: ChatGPT and Codex • “An empirical study of deep transfer learning- based program repair for Kotlin projects” [5] • Converting their codebase from Java to Kotlin • Leverages pretrained models, transfer learning and a small Kotlin dataset for finetuning
  10. Pretraining and Finetuning Callum MacNeil 12 19 13 21 Pretraining

    Methods Used Across Studies Distribution of Pretraining Methods Over Time
  11. Input Representations Callum MacNeil 14 Abstract Syntax Tree Byte Pair

    Encoding Abstraction • A new perspective • Represents the structure of the code • Assists in generalization and vocabulary reduction • Addresses vocabulary problem isRunning is Running Image from [1] Image from [2]
  12. Prompt Techniques Callum MacNeil 16 Providing Hints • Labelling bug

    location • Including Buggy line Context • Costs input size • Lines before & after the bug Multi-Shot • Input/Output examples • Using models output as the next input from [7]
  13. Callum MacNeil 18 Experimental Datasets 1.)Defects4J 21 Studies 2.) QuixBugs

    13 Studies 5.) Bugs2Fix 4 Studies 3.) CodeXGlue 7 Studies 4.) Bugs.jar 5 Studies
  14. Evaluation Metrics • Majority of studies use accuracy • Exact

    metric for patch correctness • Black and White • Studies using non- accuracy • Classification such as FixEval • Gives a ‘grey area’ of correctness Callum MacNeil 20 49 9 Accuracy Other Evaluation Metrics Used 49 5 9 Overlap
  15. Strengths • Datasets  FixEval: Patch correctness classifier and dataset

     RunBugRun: Bug categories labeled and grouped by difficulty • Reinforcement Learning Approaches  Execution-based backpropagation  Feedback in a conversational approach • Input Representation Diversity  Primary studies are introducing new input representations  EX: TARE: T-grammar & T-graph Callum MacNeil 22
  16. Weaknesses • Lack of Extensive Pretraining:  Only 36% of

    collected primary studies pretrain  Newer studies more often use existing models • Lack of Certainty  Hard to gain insight from successful and unsuccessful approaches Callum MacNeil 23
  17. Future Direction (Roadmap) Callum MacNeil 25 • Advanced Reward Functions

    • Fine-grained feedback • Combining measures of success: (I.E. BLEU score, code compiling, accuracy) • Dataset Curation  Metadata enables advanced reward functions and curriculum learning • Expanding Input Representations • Input representations highlight specific aspects of input • Many input representations used => More specific aspects of input shown to model from [3]
  18. Threats to Validity Callum MacNeil 26 External Validity Internal Validity

    Conclusion Validity - Search String changes from database to database - automatic filtering + 170 papers manually filtered - Surface level information extracted + Categorization of techniques based on extraction tables + Conclusion backed by extracted data + Followed widely used guidelines of Kitchenham and Charters - Solo data extraction
  19. Take Home Messages • LLMs for PL allows for the

    use of methods that are unavailable to natural language tasks • APR has the potential to significantly increase productivity • LLM APR requirements:  Datasets equipped with appropriate metadata (i.e. executable test suites)  Pretrained, state of the art model(s)  Advanced prompting, insightful input representations, fine-grained loss functions Callum MacNeil 27
  20. References [1] Q. Zhu, Z. Sun, W. Zhang, Y. Xiong,

    and L. Zhang, “Tare: Type-aware neural program repair,” 2023. [Online]. Available: https://ezproxy.library.dal.ca/login?url=https://www.proquest.com/ conference-papers-proceedings/tare-type-aware-neural-program-repair/ docview/2837138632/se-2 [2] K. Kim, M. Kim, and E. Lee, “Systematic analysis of defect-specific code abstraction for neural program repair,” 2022. [Online]. Available: https://ezproxy.library.dal.ca/ login?url=https://www.proquest.com/conference-papers-proceedings/ systematic-analysis-defect-specific-code/docview/2777587963/se-2 [3] S. Bhatt, “Reinforcement learning 101,” Medium, https://towardsdatascience.com/reinforcement-learning-101-e24b50e1d292 (accessed Dec. 11, 2023). [4] T. Lutellier, H. V. Pham, L. Pang, Y. Li, M. Wei, and L. Tan, “Coconut: combining context-aware neural translation models using ensemble for program repair,” in Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. Virtual Event USA: ACM, Jul. 2020, p. 101–114. [Online]. Available: https://dl.acm.org/doi/10.1145/3395363.3397369 [5] M. Kim, Y. Kim, H. Jeong, J. Heo, S. Kim, H. Chung, and E. Lee, “An empirical study of deep transfer learning-based program repair for kotlin projects,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Singapore Singapore: ACM, Nov. 2022, p. 1441–1452. [Online]. Available: https://dl.acm.org/doi/10.1145/3540250.3558967 [6] C. S. Xia and L. Zhang, “Conversational automated program repair,” no. arXiv:2301.13246, Jan. 2023, arXiv:2301.13246 [cs]. [Online]. Available: http://arxiv.org/abs/2301.13246 [7] R. Paul, M. M. Hossain, M. L. Siddiq, M. Hasan, A. Iqbal, and J. C. S. Santos, “Enhancing automated program repair through fine-tuning and prompt engineering,” no. arXiv:2304.07840, Jul. 2023, arXiv:2304.07840 [cs]. [Online]. Available: http://arxiv.org/abs/2304.07840 Callum MacNeil 28