[mercari GEARS 2025] Techniques for Reliable Code Generation Using AI Agents

Techniques for Reliable Code Generation Using AI Agents  Lukas Appelhans 
Mercari / AI Task Force Enabler 

Lukas Appelhans      • Background in C++ & Android 
• Started working at Mercari in 2022  • Worked on the Design System & Client Architecture teams  Software Engineer, AI Task Force 

How I started working with AI  1. Migrating old Design
System to new Design System  ◦ Before agents got popular  ◦ Plain calls to OpenAI’s Chat API  2. Accelerating feature development  ◦ Spec-driven development 

What these projects have in common  • Reusable agentic workﬂows 
• AI agents need a lot of information, for example  ◦ How to migrate each Design System component  ◦ Architectural guidelines    Specialised AI Agent  Environment  Prompt  Result 

Can’t AI solve this already?  • Isn’t it enough to
ﬁll CLAUDE.md with all necessary information & ask Claude Code to do the job?  • Typical results (if the task is large enough)  ◦ Code that is not working  ◦ No tests are written  ◦ Does not follow project guidelines  ◦ Lack of reliability (low success rate) 

Motivation of this talk  We need agentic workﬂows that generate
working, well-tested code that follows guidelines and is ready to be merged. 

Outline of this talk  Getting started  Initial context & prompt 
Runtime  Using tools to steer the agent  Testing   Automatically evaluating the output 

Getting started  Initial context & prompt 

Limitations of Coding Agents  • Coding Agents don’t perform well
when given too much information  ◦ Amount of information is much less than what ﬁts into the context window 

Initial decision  Decide if the problem can be solved in
a single prompt. If not, think about how it can be decomposed into subtasks. 

Deciding the problem size  • Decomposing the problem  ◦ Reduces
the necessary information to solve a task  • Each AI agent should only work in a relatively limited scope  Task 1 Agent  Coordinator  Task 2 Agent 

AI agents that work on limited tasks  Two eﬀects  1.
Less context is accumulated per agent  2. Later agent invocation can run a new prompt (= more control) 

Deciding the problem size  • Rule of thumb: ~200 lines
of instructions  ◦ Depends on information density & how concrete the instructions are  ◦ Probably depends on the model that is used  Task 1 Agent  Coordinator  Task 2 Agent 

Splitting a problem into smaller tasks  • Strategy depends on
the problem to be solved  • Examples  ◦ Library migrations: Divide API-surface into diﬀerent batches  ◦ Product development: Split by architectural component (Module setup, screen, logic, logging) 

Runtime  Using tools to steer the agent 

Tools  • AI agents call tools while they are running
to interact with the environment  ◦ Reviewing tool calls is often useful to understand what the agent is doing & why  • Think about which tools are necessary to solve the task at hand 

Limiting which tools to use  • Tools provided to the
agent are part of the context  • Limiting the amount of provided tools can  ◦ Reduce the amount of irrelevant context  ◦ Guide the agent to use the correct tools    LLM  Initial prompt  • Tool 1  • Tool 2  • Tool 3  • System prompt  • User prompt 

Limiting which tools to use: Example  • Example: Code review 
◦ Limit access to view diﬀ of a speciﬁc pull request  ◦ Prevent using tools for fetching irrelevant information 

Self-correction mechanisms  • Self-correction mechanisms to help AI agents ﬁx
hallucinations & other errors  ◦ Building  ◦ Running test cases  ◦ Linting  ◦ Code Review (using another sub-agent) 

Self-correction mechanisms  • Feedback needs to be accurate  ◦ Avoid
misleading error messages  ◦ Code review cannot have false positives  • Feedback should be concise  ◦ E.g. build tools 

Testing  Automatically evaluating the output 

Testing  • Agentic systems are complex & non-deterministic  • Automated
tests can help to  ◦ Understand reliability & correctness  ◦ Find regressions (when updating any part of the system) 

Evaluations  • Evaluations: Automatic tests for agentic systems  ◦ Test
the systems performance on a speciﬁc problem  ◦ E.g. for a given prompt, automatically judge whether the output is correct    Specialised AI Agent  Prompt  Result 

Evaluations: Check result  • How to check the result of
a test case with a non-deterministic output?  a. Unit tests (often used in benchmarks)  b. Score outputs using another LLM (LLM-as-a-judge)  Result  Score  Fail if score < threshold     LLM-as-a -judge 

Evaluations: Example  • Create git worktree to separate execution environment 
• Allows for parallel runs  • Copy test data to existing codebase  • For example test modules  • Trigger workﬂow that should be tested  • E.g. using Claude Code’s print mode  • Run tests & verify that they pass  • LLM-as-a-judge to evaluate code output (for example using deepeval’s G-Eval metric)  Create  Setup  Run prompt  Evaluate 

Summary  1. Decompose the problem based on what the agent
can understand  2. Review how the agent interacts with its environment — and use that to your advantage  3. Build evaluations to verify reliability and correctness 

Thank You! 

[mercari GEARS 2025] Techniques for Reliable Co...

[mercari GEARS 2025] Techniques for Reliable Code Generation Using AI Agents

mercari PRO

More Decks by mercari

Other Decks in Technology

Featured

Transcript

Techniques for Reliable Code Generation Using AI Agents  Lukas Appelhans

Lukas Appelhans      • Background in C++ & Android

How I started working with AI  1. Migrating old Design

What these projects have in common  • Reusable agentic workﬂows

Can’t AI solve this already?  • Isn’t it enough to

Motivation of this talk  We need agentic workﬂows that generate

Outline of this talk  Getting started  Initial context & prompt

Getting started  Initial context & prompt

Limitations of Coding Agents  • Coding Agents don’t perform well

Initial decision  Decide if the problem can be solved in

Deciding the problem size  • Decomposing the problem  ◦ Reduces

AI agents that work on limited tasks  Two eﬀects  1.

Deciding the problem size  • Rule of thumb: ~200 lines

Splitting a problem into smaller tasks  • Strategy depends on

Runtime  Using tools to steer the agent

Tools  • AI agents call tools while they are running

Limiting which tools to use  • Tools provided to the

Limiting which tools to use: Example  • Example: Code review

Self-correction mechanisms  • Self-correction mechanisms to help AI agents ﬁx

Self-correction mechanisms  • Feedback needs to be accurate  ◦ Avoid

Testing  Automatically evaluating the output

Testing  • Agentic systems are complex & non-deterministic  • Automated

Evaluations  • Evaluations: Automatic tests for agentic systems  ◦ Test

Evaluations: Check result  • How to check the result of

Evaluations: Example  • Create git worktree to separate execution environment

Summary  1. Decompose the problem based on what the agent

Thank You!