Multi-Agent Large Language Models for Code Intelligence: Opportunities, Challenges, and Research Directions

MULTI-AGENT LARGE LANGUAGE MODELS FOR CODE INTELLIGENCE: OPPORTUNITIES, CHALLENGES, AND
RESEARCH DIRECTIONS F ATE M E H H E N D I J AN I F AR D U N I V E R S I TY OF B R I TI S H C OL U M B I A

• AI and SE • Knowledge transfer to new languages
and domains with limited data • Computational efficiency of models • Explainability and reasoning (AI Safety) • Agent communications and alignment • SE applications • Code generation • Comment generation and documentation • Code clone detection • Code search • Code reviews • Vulnerability detection and repair • Financial domain 2

Code Intelligence • Code generation • Program repair • Program
comprehension • Code review • Vulnerability detection and repair • … 3

AI4 SE Evolutions Machine learning and statistical methods Deep learning
models Language models, LLMs, FMs AI Agents AIware, Agentware, Mindware 4

Agents Introduction 5

Agenda • What are agents and multi-agent systems? • The
state of MAS in SE • Challenges • Opportunities 6 The introduction part is mainly from these resources: Liu, Junwei, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. "Large language model-based agents for software engineering: A survey." arXiv preprint arXiv:2409.02977 (2024). Wang, Yanlin, Wanjun Zhong, Yanxian Huang, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, and Zibin Zheng. "Agents in software engineering: Survey, landscape, and vision." Automated Software Engineering 32, no. 2 (2025): 1-36. He, Junda, Christoph Treude, and David Lo. "LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead." ACM Transactions on Software Engineering and Methodology 34.5 (2025): 1-30.

Agents • Intelligent autonomous entities • Goal • Perceive •
Reason • Act 7 Perceive Reason Act Goal

8 Picture from: Liu, Junwei, Kaixin Wang, Yixuan Chen, Xin
Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. "Large language model-based agents for software engineering: A survey." arXiv preprint arXiv:2409.02977 (2024).

9 Picture from: Wang, Yanlin, Wanjun Zhong, Yanxian Huang, Ensheng
Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, and Zibin Zheng. "Agents in software engineering: Survey, landscape, and vision." Automated Software Engineering 32, no. 2 (2025): 1-36.

Perception • Perception: connect to external environment, sense, interpret, and
understand inputs. • Input modalities: • Textual: tokens, trees, graphs, hybrid • Visual: UML designs, UI in mobile apps • Audio or sensors 10

Memory • Storage for information on historical data, actions, thoughts,
current state (environment), external information. • Used for revisiting and utilizing previous records for deciding on new actions and refinements. • Semantic memory: world knowledge through RAG, APIs, libraries, … • Episodic memory: records of the current case and experience from previous decisions, historical interactions, past reasoning, … • Procedural memory: implicit knowledge (stored parameters of LLMs), explicit knowledge (how we code agents to perform actions) 11

Memory • Short term (working memory): used for trajectories for
the current task • Dialog • Action-observation-critique records • Intermediate output (e.g., summaries) • Long term: valuable historical experiences • Distilled trajectories (e.g., summaries) • Filtered information 12 Operations: Read/write Formats: NL, PL, key-value pairs, embeddings, trees

Action • Executions of the agents • Internal actions: reasoning,
planning, retrieval, learning (updating memory) • External actions: communication with humans or other agents, interact with the external environment (e.g., search, compilers, external tools) • Searching tools, file operation, GUI operation, static program analysis, dynamic analysis, testing tools, fault localization, version control tools 13

Collaboration Mechanisms 14

Human-Agent Collaboration 15 Planning Requirements Development Evaluation Modify workflows Clarify
ambiguities Guide agents to overcome failures Evaluate results (output correctness and alignment)

Agent Roles in SE • Managers • Requirement engineers •
Testing • Developer • … 16

Agents in SE • Software development • Requirements engineering: elicitation,
modeling, negotiation, verification, evolution • Code generation: Plan and refine iteratively • feedback (self- or peer-), tool, human • Testing: generate and refine • Feedback from compilation, test execution, mutation testing, static analysis • Software maintenance • Fault localization with agent roles: code reviewer, architect, controller, expert • Tools: static analysis, retrieval, navigation, dynamic info collection, • Program repair: patch generation, validation, feedback analysis • Compilation, execution, checking • Single tasks and end-to-end pipelines for software development or maintenance 17

Challenges • Hallucinations in LLMs • Dependency on prompts •
Computational costs • Inference time and number of iterations • Context window size • Handling new versions of LLMs • Security and privacy aspects • Deployment and AI-Ops 18

Research Opportunities Perception modalities Code knowledge base to enhance memory
Human-agent collaboration Other SE tasks 19

Research Opportunities 20 Define agent roles mapped to SE roles
or beyond 1 Identify their capabilities and gaps 2 Develop methods to fill those capability gaps 3 He, Junda, Christoph Treude, and David Lo. "LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead." ACM Transactions on Software Engineering and Methodology 34.5 (2025): 1-30.

Agents Evaluation 21

Questions • What should we evaluate? • How should we
evaluate that? 22

Benchmarks and Metrics Category Metrics Execution Validation Pass rate, PassK,
Executability, #Errors Similarity sketchBLEU, Cosine distance Costs Running time, Token usage, Expenses, #Sptints Manual Efforts Human revision costs Generated Code Scale Line of code, Code files, Completeness 23 Table from: Liu, Junwei, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. "Large language model-based agents for software engineering: A survey." arXiv preprint arXiv:2409.02977 (2024).

Benchmarks and Metrics Challenges: • Hallucinations in LLMs • Dependence
in prompts • Computational costs • Inference time and number of iterations • Context window size • Handling new versions of LLMs • Security and privacy aspects 24 Category Metrics Execution Validation Pass rate, PassK, Executability, #Errors Similarity sketchBLEU, Cosine distance Costs Running time, Token usage, Expenses, #Sptints Manual Efforts Human revision costs Generated Code Scale Line of code, Code files, Completeness

Hidden Costs • LLMs used • Configs • Prompting details
and its effects on • Prompting details and how it is affected by the distributions • Automatic Program Repair: • DeepSeek-V3 (with over 600B total and 30B+ active parameters) • 100% Pass@1 on Ruby • DeepSeek-Coder-6.7B-Instruct reduce performance to 61% 25

LLMs Evaluation 26

Code Generation Evolution Simple functions Complex functions Files Repo- level
Complex projects 27

LLM for Coding / Software Engineering tasks Software → AIware
(AI-Powered Software) • Code generated by LLMs will appear in different software and applications 28 https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/

30 Trustworthy (SE Ready?) Code LLM

What should we evaluate? Do we have enough metrics to
evaluate them? 31

Code Benchmarks 32 Code generation benchmarks HumanEval, MBPP, HumanEval+, EvalPlus
Code comprehension benchmarks CRUXEval, REval, SpecEval Code repair benchmarks Single-turn bug fixing (e.g., CodeXGLUE code repair), FeedbackEval

Code Reasoning and Real-world Benchmarks 33 Contest problems: CodeContests, APPS
Algorithmic problems: TACO Software Engineering benchmarks: SWE-bench, SWE-bench Verified, SWE-bench+, SWE-Gym, RepoBench, RepoExec, RepoEval/RepoCoder, CrossCodeEval, EvoCodeBench, DevEval, FullStack Bench, ComplexCodeEval, BioCoder, BigCodeBench, ALE-Bench, SWE-Bench Live, τ-Bench

Evaluation Metrics • Pass@K • Recall@K • Accuracy@K • Exact
Match • Repair@K • CodeBLEU • Resolution Rate • Other similarity-based scores 34

What is not evaluated here? 35

Evaluation Metrics Not Considering … • Evaluate functional correctness •
Not other aspects rigorously • Security • Optimization • Readability • Performance • Maintainability • … 37

Current Limitations Context and Scale Multi-Step Reasoning & Planning Human-in-the-Loop
Interaction Robustness and Generalization Evaluation Beyond “Pass/Fail” Integration and Long-Term Tasks Holistic contextual understanding Only for most popular programming languages 38

Good News … EffiBench (NeurIPS 2024) Investigating Software Aging in
LLM-Generated Software Systems (Oct 2025) SecurityEval dataset (2022) CYBERSECEVAL 1 (2023), 2 (2024), 3 (2024) CWEval (Jan 2025)

End to End Example 45

Research opportunities • Define agent roles • Identify LLM gaps
• Develop methods to fill those gaps 46 Jie Wu and Fatemeh Fard. HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent. ACM Transactions on Software Engineering and Methodology, 2025. Jie JW Wu, Manav Chaudhary, Davit Abrahamyan, Arhaan Khaku, Anjiang Wei, and FatemehFard. ClarifyCoder: Clarification-Aware Fine-Tuning for Programmatic Problem Solving.

Current Limitations Human-in-the-Loop Interaction 47

Communication Ability of LLMs and SE Alignment in Agentic SE
Trustworthy Code LLMs 48

Truthfulness of Code LLM Systems Thinking Iceberg Problem (Visible Events):
Software engineers often ask clarifying questions to reduce ambiguity in both requirements and coding solutions, but LLMs do not for coding tasks. Vision (Invisible Structure): For coding tasks, AI system should proactively recognize which information is missing, and find these missing pieces to be able to complete the task with high quality. 49 Wu, Jie JW, and Fatemeh H. Fard. "Benchmarking the communication competence of code generation for LLMS and LLM agent." TOSEM, 2025.

Evaluating Communication Skills of Code LLM • Communication skills of
a model = “being able to ask clarifying questions when the description of the code generation problem has issues”. • HumanEvalComm: modified problem descriptions according to three issues: • Inconsistency • Ambiguity • Incompleteness To develop HumanEvalComm, we changed each problem description in HumanEval manually, using a taxonomy of clarification types. 50

HumanEvalComm Dataset 51

HumanEvalComm Dataset 52

Evaluation Flowchart and Metrics 54 1st Round: • Communication Rate:
The percentage of model responses that ask clarifying questions. • Good Question Rate: The percentage of model responses asking Good questions (quality label=3). 2nd Round: • Pass@k • Test Pass Rate

LLM Agent Approach (Okanagan) We propose Okanagan that leverages multi-round
structure and customized prompt format for asking clarifying questions in code generation tasks. We introduce 3 rounds in Okanagan 55

How do Code LLMs perform in communication skills? Key takeaway:
Over 60% of responses from Code LLMs still generate code rather than ask questions when the problem descriptions are manually modified according to different clarification categories. Pass@1 and Test Pass Rate of Code LLMs drop by 35% ∼ 52% and by 17% ∼ 35% respectively, with statistical significance in each category for over 75% numbers. 56

Compare different categories? Incompleteness category has higher Communication Rate and
Good Question Rate, but lower Pass@1 and Test Pass Rate than the Ambiguity and Inconsistency categories. A combination of two clarification types leads to much lower Test Pass Rates than one clarification type. 60 Ambiguity Inconsistency Incomplete

How do LLM Agent (Okanagan) perform? 61 Key Takeaway: Okanagan
effectively increases Communication Rate and Good Question Rate by an absolute 58% and 38%, and thus boosts Pass@1 and Test Pass Rate by an absolute 8% and 7%. However, Okanagan tends to still ask questions that appear to be unnecessary for original problems that do not need questions.

How reliable is LLM-based Evaluator? We recruited 6 students at
UBC to manually assess the results of LLM-based evaluator and mark the quality of models’ responses: Takeaway: The LLM-based evaluator provides acceptable answers to models’ responses, with higher than 50% Acceptable Answer Rate for all models and Good Answer Rate for most models. False Recovery Rates need to be further reduced for some models to ensure the reliability of test pass rates and Pass@1. 62 • Question or Not • Question Quality • Answer Quality

How reliable is LLM-based Evaluator? • The Communication Rate aligns
reasonably well with the manual evaluation • Higher than 50% acceptable answer for all models • But, for the Good Question Rate, LLM-based evaluator tends to mark more “Good” questions than it should. Takeaway: metrics and answers from LLM-based evaluator can provide useful insights to guide our experiment, but they may not perfectly align with human judgment. 63

Gap between Code LLM and humans? While LLMs begin to
ask more clarifying questions after 50% of the description is removed, even with 90% removed, only 54% of model responses prompted questions. Indication: Code Models should increase the ability to ask clarifying questions for coding tasks, but how? • LLM-based Agent? (Okanagan) • Go deeper at the model level! ◦ Alignment (Fine-tuning) 64

Clarification-Aware LLM in our Agentic flow? 65

Ambiguous implementations for the problem statement “Return list with elements
incremented by a number”. Variants demonstrate alternative interpretations of the problem. 66

Asking Clarifying Questions? • LLMs generate code outputs in over
63% of ambiguous scenarios without seeking necessary clarifications on HumanEvalComm benchmark • LLMs asking clarifying questions: • ClarifyGPT • TICODER • ClariGen • Okanagan 67

Asking Clarifying Questions All the Time? 68 User interaction with
clarifying questions is enforced for standard coding tasks that could be completed without clarifications. ClarifyGPT, TICODER, and ClariGen: explicitly require the user interaction module Okanagan: asks clarifying questions only when necessary but, asks unnecessary questions, affecting pass@1 LLM-based workflows are complicated and costly than a single LLM. ClariGen and Okanagan: need 3 calls to LLMs ClarifyGPT: needs 4 calls We need a balance between asking unnecessary questions and truly needed questions.

Data Synthesis Process • Synthesize data: • Modify problems into
the three categories • Ask the model to generate a question to clarify the problem description • Using APPS data, ~30 K new data are generated 71

Data Synthesis Ambiguous •Modification: Rewrite to introduce ambiguity by creating
multiple valid interpretations or leaving key details unspecified. •Question Gen: Identify points of ambiguity and formulate clarifying questions that resolve them. Incomplete • Modification: Rewrite to create incompleteness by omitting key concepts or conditions essential for solving. •Question Gen: Identify points of incompleteness and formulate clarifying questions to resolve them. Inconsistent • Modification: Rewrite to introduce inconsistency by incorporating conflicting statements. •Question Gen: Identify points of inconsistency and formulate clarifying questions to provide a consistent interpretation. 72

ClarifyCoder 74

HumanEvalComm 75 Model Comm. Rate Good Q. Rate Pass@1 Test
Pass Rate CodeLlama 4.38% 3.93% 16.08% 34.99% DeepSeek Chat 28.25% 23.03% 34.76% 42.42% SeepSeek Coder 24.12% 21.29% 41.95% 56.86% ClarifyCoder (w/DSCoder) 57.42% 47.68% 35.3% 47.25% ClarifyCoder (w/DSCoder, loss= answer only) 63.61% 51.93% 29.38% 40.96%

HumanEval 76 ClarifyCoder substantially improves a model’s ability to ask
effective clarifying questions (more than doubling baseline models), while maintaining competitive code generation performance for standard coding tasks

HumanEval 77 ClarifyCoder substantially improves a model’s ability to ask
effective clarifying questions (more than doubling baseline models), while maintaining competitive code generation performance for standard coding tasks

Categories • Incompleteness is easiest • Inconsistency is most challenging
• Combining clarification types increases communication rates but decreases code generation performance • ClarifyCoder demonstrates its clarify-awareness in particular for Ambiguity and Inconsistency 78

LLM Judgement vs Manual • Comparison of LLM-based and manual
evaluation metrics for Comm. Rate and Good Q. Rate from 100 ClarifyCoder responses. 79

Training Free Models? • Few shot prompting • CoT reasoning
• They have lower communication competence compared to clarify-aware fine-tuning 80

How much data for fine-tuning? 81

What we did … • Define roles • Identify gaps
• Adapt LLMs in the agentic workflow 82

Takeaways 83 Agentic workflows depend on the capabilities of the
underlying LLMs. Sometimes we need to align LLMs for that goal, reduce unnecessary iterations and improve results Define evaluation metrics or introduce new ones Take care of synthesized data What else can we do to have improved agentic workflows?

Improving Agentic Workflow Results 84 Specialized models and/or smaller LLMs
Prompting techniques (Cot, Structured CoT, Tree of Thoughts, Hoar, …) RAG Deployment options (NVIDIA® TensorRT™ for high-performance deep learning inference) Merge models (mergeRepair) Reasoning aspects (FinFactChecker)

86 MiniMax-M2

Retrieval Augmented Generation Abrahamyan, Davit, and Fatemeh H. Fard. "StackRAG
Agent: Improving Developer Answers with Retrieval-Augmented Generation." In 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 893-897. IEEE, 2024. 88

Different R Paradigms 89

Code Generation in R vs. Python Zhao, ZiXiao, and Fatemeh
H. Fard. "Do Current Language Models Support Code Intelligence for R Programming Language?." TOSEM, 2025. 90

Can We Improve Performance with RAG? Zixiao Zhao and Fatemeh
Fard. Do Current Language Models Support Code Intelligence for RProgramming Language? ACM Transactions on Software Engineering and Methodology, 2025. Amirreza Esmaeili, Iman Saberi Tirani, and Fatemeh Fard. Empirical Studies of Parameter Efficient Methods for Large Language Models of Code and Knowledge Transfer to R. Empirical Software Engineering, 2025. 91

MergeRepair 94 Model 1 Model 2 Merged Model Meghdad Dehghan,
Jie Wu, Fatemeh Fard, and Ali Ouni. MergeRepair: An Exploratory Study onMerging Task-Specific Adapters in Code LLMs for Automated Program Repair.

MergeRepair • Development • Automatic Program Repair (APR) • Test
& QA • Improvement • Improve APR results • Merged model has generalization capability to APR • With continual merging we proposed an importance of the order for the merged models 95 Model 1 Model 2 Merged Model

Relying on LLMs’ Judgement? 97 LLM Documents Generated Answer User
query LLM Judge Response RAG Rishab Sharma, Iman Saberi, Elham Alipour, Jie JW Wu and Fatemeh Fard. FISCAL: Financial Synthetic Claim–document Augmented Learning for Efficient Fact-Checking. In NeurIPS 2025 Workshop: Generative AI in Finance, 2025.

Relying on LLMs’ Judgement? • Synthesize data (Doc vs. Claims)
• Train a model (Causal Language Modeling) • Use this new model as judge 98 LLM Docume nts Generated Answer User query LLM Judge Response RAG

Challenges of Agentic AI 99

Challenges of AI Different Dimensions: • Truthfulness, • Safety, •
Consistency, • Fairness, • Robustness, • Controllability, • Privacy, • Machine ethics • Governance and compliance 101

We still need Benchmarks Metrics Techniques Computation costs Sustainable AI
Green AI 102

Final Thoughts! The underlying LLMs play an important role in
Agentic AI. Adapt the LLMs and use combination of approaches. Take care of compliance, governance, privacy, and security. Evaluate the need for an agentic sysmte (e.g., AgentLess). 103

Thanks For Listening :) Fatemeh Fard https://fard.ok.ubc.ca/ 104

Multi-Agent Large Language Models for Code Inte...

Multi-Agent Large Language Models for Code Intelligence: Opportunities, Challenges, and Research Directions

Other Decks in Research

Featured

Transcript