Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Multi-Agent Large Language Models for Code Inte...

Multi-Agent Large Language Models for Code Intelligence: Opportunities, Challenges, and Research Directions

Large Language Models (LLMs) have demonstrated impressive capabilities in tasks such as code generation, completion, and repair. A recent trend involves designing agentic systems, where multiple LLM-powered agents collaborate to accomplish software engineering goals. These multi-agent setups have achieved state-of-the-art performance in several tasks, but their practical adoption faces significant challenges. In this tutorial, we will begin by introducing the foundations of agentic systems and their applications in software engineering. We will then broaden the discussion to cover critical aspects of code intelligence using LLMs and agentic systems that must be addressed for real-world deployment, including code comprehension, effective communication, security concerns, explainability, reasoning capabilities, and computational efficiency. Through this lens, we will analyze the current limitations of LLMs and agent-LLMs, highlighting the gaps, and how to adapt the models to our requirements. We will conclude with several directions and considerations when using Agentic systems.

Avatar for Fatemeh Fard (UBC)

Fatemeh Fard (UBC)

November 18, 2025
Tweet

Other Decks in Research

Transcript

  1. MULTI-AGENT LARGE LANGUAGE MODELS FOR CODE INTELLIGENCE: OPPORTUNITIES, CHALLENGES, AND

    RESEARCH DIRECTIONS F ATE M E H H E N D I J AN I F AR D U N I V E R S I TY OF B R I TI S H C OL U M B I A
  2. • AI and SE • Knowledge transfer to new languages

    and domains with limited data • Computational efficiency of models • Explainability and reasoning (AI Safety) • Agent communications and alignment • SE applications • Code generation • Comment generation and documentation • Code clone detection • Code search • Code reviews • Vulnerability detection and repair • Financial domain 2
  3. Code Intelligence • Code generation • Program repair • Program

    comprehension • Code review • Vulnerability detection and repair • … 3
  4. AI4 SE Evolutions Machine learning and statistical methods Deep learning

    models Language models, LLMs, FMs AI Agents AIware, Agentware, Mindware 4
  5. Agenda • What are agents and multi-agent systems? • The

    state of MAS in SE • Challenges • Opportunities 6 The introduction part is mainly from these resources: Liu, Junwei, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. "Large language model-based agents for software engineering: A survey." arXiv preprint arXiv:2409.02977 (2024). Wang, Yanlin, Wanjun Zhong, Yanxian Huang, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, and Zibin Zheng. "Agents in software engineering: Survey, landscape, and vision." Automated Software Engineering 32, no. 2 (2025): 1-36. He, Junda, Christoph Treude, and David Lo. "LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead." ACM Transactions on Software Engineering and Methodology 34.5 (2025): 1-30.
  6. 8 Picture from: Liu, Junwei, Kaixin Wang, Yixuan Chen, Xin

    Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. "Large language model-based agents for software engineering: A survey." arXiv preprint arXiv:2409.02977 (2024).
  7. 9 Picture from: Wang, Yanlin, Wanjun Zhong, Yanxian Huang, Ensheng

    Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, and Zibin Zheng. "Agents in software engineering: Survey, landscape, and vision." Automated Software Engineering 32, no. 2 (2025): 1-36.
  8. Perception • Perception: connect to external environment, sense, interpret, and

    understand inputs. • Input modalities: • Textual: tokens, trees, graphs, hybrid • Visual: UML designs, UI in mobile apps • Audio or sensors 10
  9. Memory • Storage for information on historical data, actions, thoughts,

    current state (environment), external information. • Used for revisiting and utilizing previous records for deciding on new actions and refinements. • Semantic memory: world knowledge through RAG, APIs, libraries, … • Episodic memory: records of the current case and experience from previous decisions, historical interactions, past reasoning, … • Procedural memory: implicit knowledge (stored parameters of LLMs), explicit knowledge (how we code agents to perform actions) 11
  10. Memory • Short term (working memory): used for trajectories for

    the current task • Dialog • Action-observation-critique records • Intermediate output (e.g., summaries) • Long term: valuable historical experiences • Distilled trajectories (e.g., summaries) • Filtered information 12 Operations: Read/write Formats: NL, PL, key-value pairs, embeddings, trees
  11. Action • Executions of the agents • Internal actions: reasoning,

    planning, retrieval, learning (updating memory) • External actions: communication with humans or other agents, interact with the external environment (e.g., search, compilers, external tools) • Searching tools, file operation, GUI operation, static program analysis, dynamic analysis, testing tools, fault localization, version control tools 13
  12. Human-Agent Collaboration 15 Planning Requirements Development Evaluation Modify workflows Clarify

    ambiguities Guide agents to overcome failures Evaluate results (output correctness and alignment)
  13. Agents in SE • Software development • Requirements engineering: elicitation,

    modeling, negotiation, verification, evolution • Code generation: Plan and refine iteratively • feedback (self- or peer-), tool, human • Testing: generate and refine • Feedback from compilation, test execution, mutation testing, static analysis • Software maintenance • Fault localization with agent roles: code reviewer, architect, controller, expert • Tools: static analysis, retrieval, navigation, dynamic info collection, • Program repair: patch generation, validation, feedback analysis • Compilation, execution, checking • Single tasks and end-to-end pipelines for software development or maintenance 17
  14. Challenges • Hallucinations in LLMs • Dependency on prompts •

    Computational costs • Inference time and number of iterations • Context window size • Handling new versions of LLMs • Security and privacy aspects • Deployment and AI-Ops 18
  15. Research Opportunities 20 Define agent roles mapped to SE roles

    or beyond 1 Identify their capabilities and gaps 2 Develop methods to fill those capability gaps 3 He, Junda, Christoph Treude, and David Lo. "LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead." ACM Transactions on Software Engineering and Methodology 34.5 (2025): 1-30.
  16. Benchmarks and Metrics Category Metrics Execution Validation Pass rate, PassK,

    Executability, #Errors Similarity sketchBLEU, Cosine distance Costs Running time, Token usage, Expenses, #Sptints Manual Efforts Human revision costs Generated Code Scale Line of code, Code files, Completeness 23 Table from: Liu, Junwei, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. "Large language model-based agents for software engineering: A survey." arXiv preprint arXiv:2409.02977 (2024).
  17. Benchmarks and Metrics Challenges: • Hallucinations in LLMs • Dependence

    in prompts • Computational costs • Inference time and number of iterations • Context window size • Handling new versions of LLMs • Security and privacy aspects 24 Category Metrics Execution Validation Pass rate, PassK, Executability, #Errors Similarity sketchBLEU, Cosine distance Costs Running time, Token usage, Expenses, #Sptints Manual Efforts Human revision costs Generated Code Scale Line of code, Code files, Completeness
  18. Hidden Costs • LLMs used • Configs • Prompting details

    and its effects on • Prompting details and how it is affected by the distributions • Automatic Program Repair: • DeepSeek-V3 (with over 600B total and 30B+ active parameters) • 100% Pass@1 on Ruby • DeepSeek-Coder-6.7B-Instruct reduce performance to 61% 25
  19. LLM for Coding / Software Engineering tasks Software → AIware

    (AI-Powered Software) • Code generated by LLMs will appear in different software and applications 28 https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/
  20. 29

  21. Code Benchmarks 32 Code generation benchmarks HumanEval, MBPP, HumanEval+, EvalPlus

    Code comprehension benchmarks CRUXEval, REval, SpecEval Code repair benchmarks Single-turn bug fixing (e.g., CodeXGLUE code repair), FeedbackEval
  22. Code Reasoning and Real-world Benchmarks 33 Contest problems: CodeContests, APPS

    Algorithmic problems: TACO Software Engineering benchmarks: SWE-bench, SWE-bench Verified, SWE-bench+, SWE-Gym, RepoBench, RepoExec, RepoEval/RepoCoder, CrossCodeEval, EvoCodeBench, DevEval, FullStack Bench, ComplexCodeEval, BioCoder, BigCodeBench, ALE-Bench, SWE-Bench Live, τ-Bench
  23. Evaluation Metrics • Pass@K • Recall@K • Accuracy@K • Exact

    Match • Repair@K • CodeBLEU • Resolution Rate • Other similarity-based scores 34
  24. 36

  25. Evaluation Metrics Not Considering … • Evaluate functional correctness •

    Not other aspects rigorously • Security • Optimization • Readability • Performance • Maintainability • … 37
  26. Current Limitations Context and Scale Multi-Step Reasoning & Planning Human-in-the-Loop

    Interaction Robustness and Generalization Evaluation Beyond “Pass/Fail” Integration and Long-Term Tasks Holistic contextual understanding Only for most popular programming languages 38
  27. Good News … EffiBench (NeurIPS 2024) Investigating Software Aging in

    LLM-Generated Software Systems (Oct 2025) SecurityEval dataset (2022) CYBERSECEVAL 1 (2023), 2 (2024), 3 (2024) CWEval (Jan 2025)
  28. 40

  29. 41

  30. 42

  31. 43

  32. 44

  33. Research opportunities • Define agent roles • Identify LLM gaps

    • Develop methods to fill those gaps 46 Jie Wu and Fatemeh Fard. HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent. ACM Transactions on Software Engineering and Methodology, 2025. Jie JW Wu, Manav Chaudhary, Davit Abrahamyan, Arhaan Khaku, Anjiang Wei, and FatemehFard. ClarifyCoder: Clarification-Aware Fine-Tuning for Programmatic Problem Solving.
  34. Truthfulness of Code LLM Systems Thinking Iceberg Problem (Visible Events):

    Software engineers often ask clarifying questions to reduce ambiguity in both requirements and coding solutions, but LLMs do not for coding tasks.​ Vision (Invisible Structure): For coding tasks, AI system should proactively recognize which information is missing, and find these missing pieces to be able to complete the task with high quality​. 49 Wu, Jie JW, and Fatemeh H. Fard. "Benchmarking the communication competence of code generation for LLMS and LLM agent." TOSEM, 2025.
  35. Evaluating Communication Skills of Code LLM • Communication skills of

    a model = “being able to ask clarifying questions when the description of the code generation problem has issues”. ​ • HumanEvalComm: modified problem descriptions according to three issues: • Inconsistency • Ambiguity • Incompleteness To develop HumanEvalComm, we changed each problem description in HumanEval manually, using a taxonomy of clarification types.​ 50
  36. 53

  37. Evaluation Flowchart and Metrics 54 1st Round: • Communication Rate:

    The percentage of model responses that ask clarifying questions. • Good Question Rate: The percentage of model responses asking Good questions (quality label=3). 2nd Round: • Pass@k • Test Pass Rate
  38. LLM Agent Approach (Okanagan) We propose Okanagan that leverages multi-round

    structure and customized prompt format for asking clarifying questions in code generation tasks. We introduce 3 rounds in Okanagan 55
  39. How do Code LLMs perform in communication skills? Key takeaway:

    Over 60% of responses from Code LLMs still generate code rather than ask questions when the problem descriptions are manually modified according to different clarification categories. Pass@1 and Test Pass Rate of Code LLMs drop by 35% ∼ 52% and by 17% ∼ 35% respectively, with statistical significance in each category for over 75% numbers. 56
  40. How do Code LLMs perform in communication skills? Key takeaway:

    Over 60% of responses from Code LLMs still generate code rather than ask questions when the problem descriptions are manually modified according to different clarification categories. Pass@1 and Test Pass Rate of Code LLMs drop by 35% ∼ 52% and by 17% ∼ 35% respectively, with statistical significance in each category for over 75% numbers. 57
  41. How do Code LLMs perform in communication skills? Key takeaway:

    Over 60% of responses from Code LLMs still generate code rather than ask questions when the problem descriptions are manually modified according to different clarification categories. Pass@1 and Test Pass Rate of Code LLMs drop by 35% ∼ 52% and by 17% ∼ 35% respectively, with statistical significance in each category for over 75% numbers. 58
  42. How do Code LLMs perform in communication skills? Key takeaway:

    Over 60% of responses from Code LLMs still generate code rather than ask questions when the problem descriptions are manually modified according to different clarification categories. Pass@1 and Test Pass Rate of Code LLMs drop by 35% ∼ 52% and by 17% ∼ 35% respectively, with statistical significance in each category for over 75% numbers. 59
  43. Compare different categories? Incompleteness category has higher Communication Rate and

    Good Question Rate, but lower Pass@1 and Test Pass Rate than the Ambiguity and Inconsistency categories. A combination of two clarification types leads to much lower Test Pass Rates than one clarification type. 60 Ambiguity Inconsistency Incomplete
  44. How do LLM Agent (Okanagan) perform? 61 Key Takeaway: Okanagan

    effectively increases Communication Rate and Good Question Rate by an absolute 58% and 38%, and thus boosts Pass@1 and Test Pass Rate by an absolute 8% and 7%. However, Okanagan tends to still ask questions that appear to be unnecessary for original problems that do not need questions.
  45. How reliable is LLM-based Evaluator? We recruited 6 students at

    UBC to manually assess the results of LLM-based evaluator and mark the quality of models’ responses: Takeaway: The LLM-based evaluator provides acceptable answers to models’ responses, with higher than 50% Acceptable Answer Rate for all models and Good Answer Rate for most models. False Recovery Rates need to be further reduced for some models to ensure the reliability of test pass rates and Pass@1. 62 • Question or Not • Question Quality • Answer Quality
  46. How reliable is LLM-based Evaluator? • The Communication Rate aligns

    reasonably well with the manual evaluation • Higher than 50% acceptable answer for all models • But, for the Good Question Rate, LLM-based evaluator tends to mark more “Good” questions than it should. Takeaway: metrics and answers from LLM-based evaluator can provide useful insights to guide our experiment, but they may not perfectly align with human judgment. 63
  47. Gap between Code LLM and humans? While LLMs begin to

    ask more clarifying questions after 50% of the description is removed, even with 90% removed, only 54% of model responses prompted questions. Indication: Code Models should increase the ability to ask clarifying questions for coding tasks​, but how? • LLM-based Agent? (Okanagan) • Go deeper at the model level! ◦ Alignment (Fine-tuning) 64
  48. Ambiguous implementations for the problem statement “Return list with elements

    incremented by a number”. Variants demonstrate alternative interpretations of the problem. 66
  49. Asking Clarifying Questions? • LLMs generate code outputs in over

    63% of ambiguous scenarios without seeking necessary clarifications on HumanEvalComm benchmark • LLMs asking clarifying questions: • ClarifyGPT • TICODER • ClariGen • Okanagan 67
  50. Asking Clarifying Questions All the Time? 68 User interaction with

    clarifying questions is enforced for standard coding tasks that could be completed without clarifications. ClarifyGPT, TICODER, and ClariGen: explicitly require the user interaction module Okanagan: asks clarifying questions only when necessary but, asks unnecessary questions, affecting pass@1 LLM-based workflows are complicated and costly than a single LLM. ClariGen and Okanagan: need 3 calls to LLMs ClarifyGPT: needs 4 calls We need a balance between asking unnecessary questions and truly needed questions.
  51. 69

  52. 70

  53. Data Synthesis Process • Synthesize data: • Modify problems into

    the three categories • Ask the model to generate a question to clarify the problem description • Using APPS data, ~30 K new data are generated 71
  54. Data Synthesis Ambiguous •Modification: Rewrite to introduce ambiguity by creating

    multiple valid interpretations or leaving key details unspecified. •Question Gen: Identify points of ambiguity and formulate clarifying questions that resolve them. Incomplete • Modification: Rewrite to create incompleteness by omitting key concepts or conditions essential for solving. •Question Gen: Identify points of incompleteness and formulate clarifying questions to resolve them. Inconsistent • Modification: Rewrite to introduce inconsistency by incorporating conflicting statements. •Question Gen: Identify points of inconsistency and formulate clarifying questions to provide a consistent interpretation. 72
  55. 73

  56. HumanEvalComm 75 Model Comm. Rate Good Q. Rate Pass@1 Test

    Pass Rate CodeLlama 4.38% 3.93% 16.08% 34.99% DeepSeek Chat 28.25% 23.03% 34.76% 42.42% SeepSeek Coder 24.12% 21.29% 41.95% 56.86% ClarifyCoder (w/DSCoder) 57.42% 47.68% 35.3% 47.25% ClarifyCoder (w/DSCoder, loss= answer only) 63.61% 51.93% 29.38% 40.96%
  57. HumanEval 76 ClarifyCoder substantially improves a model’s ability to ask

    effective clarifying questions (more than doubling baseline models), while maintaining competitive code generation performance for standard coding tasks
  58. HumanEval 77 ClarifyCoder substantially improves a model’s ability to ask

    effective clarifying questions (more than doubling baseline models), while maintaining competitive code generation performance for standard coding tasks
  59. Categories • Incompleteness is easiest • Inconsistency is most challenging

    • Combining clarification types increases communication rates but decreases code generation performance • ClarifyCoder demonstrates its clarify-awareness in particular for Ambiguity and Inconsistency 78
  60. LLM Judgement vs Manual • Comparison of LLM-based and manual

    evaluation metrics for Comm. Rate and Good Q. Rate from 100 ClarifyCoder responses. 79
  61. Training Free Models? • Few shot prompting • CoT reasoning

    • They have lower communication competence compared to clarify-aware fine-tuning 80
  62. What we did … • Define roles • Identify gaps

    • Adapt LLMs in the agentic workflow 82
  63. Takeaways 83 Agentic workflows depend on the capabilities of the

    underlying LLMs. Sometimes we need to align LLMs for that goal, reduce unnecessary iterations and improve results Define evaluation metrics or introduce new ones Take care of synthesized data What else can we do to have improved agentic workflows?
  64. Improving Agentic Workflow Results 84 Specialized models and/or smaller LLMs

    Prompting techniques (Cot, Structured CoT, Tree of Thoughts, Hoar, …) RAG Deployment options (NVIDIA® TensorRT™ for high-performance deep learning inference) Merge models (mergeRepair) Reasoning aspects (FinFactChecker)
  65. Improving Agentic Workflow Results 85 Specialized models and/or smaller LLMs

    Prompting techniques (Cot, Structured CoT, Tree of Thoughts, Hoar, …) RAG Deployment options (NVIDIA® TensorRT™ for high-performance deep learning inference) Merge models (mergeRepair) Reasoning aspects (FinFactChecker)
  66. Improving Agentic Workflow Results 87 Specialized models and/or smaller LLMs

    Prompting techniques (Cot, Structured CoT, Tree of Thoughts, Hoar, …) RAG Deployment options (NVIDIA® TensorRT™ for high-performance deep learning inference) Merge models (mergeRepair) Reasoning aspects (FinFactChecker)
  67. Retrieval Augmented Generation Abrahamyan, Davit, and Fatemeh H. Fard. "StackRAG

    Agent: Improving Developer Answers with Retrieval-Augmented Generation." In 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 893-897. IEEE, 2024. 88
  68. Code Generation in R vs. Python Zhao, ZiXiao, and Fatemeh

    H. Fard. "Do Current Language Models Support Code Intelligence for R Programming Language?." TOSEM, 2025. 90
  69. Can We Improve Performance with RAG? Zixiao Zhao and Fatemeh

    Fard. Do Current Language Models Support Code Intelligence for RProgramming Language? ACM Transactions on Software Engineering and Methodology, 2025. Amirreza Esmaeili, Iman Saberi Tirani, and Fatemeh Fard. Empirical Studies of Parameter Efficient Methods for Large Language Models of Code and Knowledge Transfer to R. Empirical Software Engineering, 2025. 91
  70. Improving Agentic Workflow Results 93 Specialized models and/or smaller LLMs

    Prompting techniques (Cot, Structured CoT, Tree of Thoughts, Hoar, …) RAG Deployment options (NVIDIA® TensorRT™ for high-performance deep learning inference) Merge models (mergeRepair) Reasoning aspects (FinFactChecker)
  71. MergeRepair 94 Model 1 Model 2 Merged Model Meghdad Dehghan,

    Jie Wu, Fatemeh Fard, and Ali Ouni. MergeRepair: An Exploratory Study onMerging Task-Specific Adapters in Code LLMs for Automated Program Repair.
  72. MergeRepair • Development • Automatic Program Repair (APR) • Test

    & QA • Improvement • Improve APR results • Merged model has generalization capability to APR • With continual merging we proposed an importance of the order for the merged models 95 Model 1 Model 2 Merged Model
  73. Improving Agentic Workflow Results 96 Specialized models and/or smaller LLMs

    Prompting techniques (Cot, Structured CoT, Tree of Thoughts, Hoar, …) RAG Deployment options (NVIDIA® TensorRT™ for high-performance deep learning inference) Merge models (mergeRepair) Reasoning aspects (FinFactChecker)
  74. Relying on LLMs’ Judgement? 97 LLM Documents Generated Answer User

    query LLM Judge Response RAG Rishab Sharma, Iman Saberi, Elham Alipour, Jie JW Wu and Fatemeh Fard. FISCAL: Financial Synthetic Claim–document Augmented Learning for Efficient Fact-Checking. In NeurIPS 2025 Workshop: Generative AI in Finance, 2025.
  75. Relying on LLMs’ Judgement? • Synthesize data (Doc vs. Claims)

    • Train a model (Causal Language Modeling) • Use this new model as judge 98 LLM Docume nts Generated Answer User query LLM Judge Response RAG
  76. 100

  77. Challenges of AI Different Dimensions: • Truthfulness, • Safety, •

    Consistency, • Fairness, • Robustness, • Controllability, • Privacy, • Machine ethics • Governance and compliance 101
  78. Final Thoughts! The underlying LLMs play an important role in

    Agentic AI. Adapt the LLMs and use combination of approaches. Take care of compliance, governance, privacy, and security. Evaluate the need for an agentic sysmte (e.g., AgentLess). 103