Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[mercari GEARS 2025] Techniques for Reliable Co...

Avatar for mercari mercari PRO
November 14, 2025

[mercari GEARS 2025] Techniques for Reliable Code Generation Using AI Agents

Avatar for mercari

mercari PRO

November 14, 2025
Tweet

More Decks by mercari

Other Decks in Technology

Transcript

  1. Lukas Appelhans
 
 
 • Background in C++ & Android


    • Started working at Mercari in 2022
 • Worked on the Design System & Client Architecture teams
 Software Engineer, AI Task Force

  2. How I started working with AI
 1. Migrating old Design

    System to new Design System
 ◦ Before agents got popular
 ◦ Plain calls to OpenAI’s Chat API
 2. Accelerating feature development
 ◦ Spec-driven development

  3. What these projects have in common
 • Reusable agentic workflows


    • AI agents need a lot of information, for example
 ◦ How to migrate each Design System component
 ◦ Architectural guidelines
 
 Specialised AI Agent
 Environment
 Prompt
 Result

  4. Can’t AI solve this already?
 • Isn’t it enough to

    fill CLAUDE.md with all necessary information & ask Claude Code to do the job?
 • Typical results (if the task is large enough)
 ◦ Code that is not working
 ◦ No tests are written
 ◦ Does not follow project guidelines
 ◦ Lack of reliability (low success rate)

  5. Motivation of this talk
 We need agentic workflows that generate

    working, well-tested code that follows guidelines and is ready to be merged.

  6. Outline of this talk
 Getting started
 Initial context & prompt


    Runtime
 Using tools to steer the agent
 Testing 
 Automatically evaluating the output

  7. Limitations of Coding Agents
 • Coding Agents don’t perform well

    when given too much information
 ◦ Amount of information is much less than what fits into the context window

  8. Initial decision
 Decide if the problem can be solved in

    a single prompt. If not, think about how it can be decomposed into subtasks.

  9. Deciding the problem size
 • Decomposing the problem
 ◦ Reduces

    the necessary information to solve a task
 • Each AI agent should only work in a relatively limited scope
 Task 1 Agent
 Coordinator
 Task 2 Agent

  10. AI agents that work on limited tasks
 Two effects
 1.

    Less context is accumulated per agent
 2. Later agent invocation can run a new prompt (= more control)

  11. Deciding the problem size
 • Rule of thumb: ~200 lines

    of instructions
 ◦ Depends on information density & how concrete the instructions are
 ◦ Probably depends on the model that is used
 Task 1 Agent
 Coordinator
 Task 2 Agent

  12. Splitting a problem into smaller tasks
 • Strategy depends on

    the problem to be solved
 • Examples
 ◦ Library migrations: Divide API-surface into different batches
 ◦ Product development: Split by architectural component (Module setup, screen, logic, logging)

  13. Tools
 • AI agents call tools while they are running

    to interact with the environment
 ◦ Reviewing tool calls is often useful to understand what the agent is doing & why
 • Think about which tools are necessary to solve the task at hand

  14. Limiting which tools to use
 • Tools provided to the

    agent are part of the context
 • Limiting the amount of provided tools can
 ◦ Reduce the amount of irrelevant context
 ◦ Guide the agent to use the correct tools
 
 LLM
 Initial prompt
 • Tool 1
 • Tool 2
 • Tool 3
 • System prompt
 • User prompt

  15. Limiting which tools to use: Example
 • Example: Code review


    ◦ Limit access to view diff of a specific pull request
 ◦ Prevent using tools for fetching irrelevant information

  16. Self-correction mechanisms
 • Self-correction mechanisms to help AI agents fix

    hallucinations & other errors
 ◦ Building
 ◦ Running test cases
 ◦ Linting
 ◦ Code Review (using another sub-agent)

  17. Self-correction mechanisms
 • Feedback needs to be accurate
 ◦ Avoid

    misleading error messages
 ◦ Code review cannot have false positives
 • Feedback should be concise
 ◦ E.g. build tools

  18. Testing
 • Agentic systems are complex & non-deterministic
 • Automated

    tests can help to
 ◦ Understand reliability & correctness
 ◦ Find regressions (when updating any part of the system)

  19. Evaluations
 • Evaluations: Automatic tests for agentic systems
 ◦ Test

    the systems performance on a specific problem
 ◦ E.g. for a given prompt, automatically judge whether the output is correct
 
 Specialised AI Agent
 Prompt
 Result

  20. Evaluations: Check result
 • How to check the result of

    a test case with a non-deterministic output?
 a. Unit tests (often used in benchmarks)
 b. Score outputs using another LLM (LLM-as-a-judge)
 Result
 Score
 Fail if score < threshold 
 
 LLM-as-a -judge

  21. Evaluations: Example
 • Create git worktree to separate execution environment


    • Allows for parallel runs
 • Copy test data to existing codebase
 • For example test modules
 • Trigger workflow that should be tested
 • E.g. using Claude Code’s print mode
 • Run tests & verify that they pass
 • LLM-as-a-judge to evaluate code output (for example using deepeval’s G-Eval metric)
 Create
 Setup
 Run prompt
 Evaluate

  22. Summary
 1. Decompose the problem based on what the agent

    can understand
 2. Review how the agent interacts with its environment — and use that to your advantage
 3. Build evaluations to verify reliability and correctness