[mercari GEARS 2025] PJ Aurora’s Vision and Automated UI Quality Evaluation Agents

Asuke Yasukuni  AI Management Oﬃce / Enablement Engineer   PJ
Aurora’s Vision and Automated UI Quality Evaluation Agents 

Asuke Yasukuni      AI Task Force / Enablement Engineer 
I work as an Enabler (Engineer) in the UX/Creative ﬁeld at the AI Task Forces.  I originally worked as a backend engineer for Merpay's deferred payment service.  My hobby is creating things using AI, and I'm active in a variety of activities including illustrations, singing, and novels. 

PJ-Aurora 

PJ-Aurora  Mission  By changing Mercari's approach to manufacturing, we aim
to maximize the potential of both creators and customers.  The UX AI project has been underway since 2024. The goal is to enable anyone to create a Mercari-like UX. 

PJ-Aurora’s Vision  Changing the approach  Achieving true Scrum development with
the participation of all stakeholders by shifting from theory-driven to experiment-driven  Changing Mercari employees  By changing the approach, anyone with an idea can create a UX and ﬁnd the optimal solution  Changing Customers  Eliminate customers’ "can't do" situations by transitioning from traditional proposal-based UX to AI-autonomous UX 

Short-term Workﬂow Goals for Project Aurora  Idea  Design and Implementation 
(web)  Design system compliance  UI quality veriﬁcation  UX Prompt Template Completed  Design  Areas to discuss today 

Why develop UX veriﬁcation (Brand QA Agent) in-house?    •
UI generation allows us to stand on the shoulders of giants.  ◦ Examples: Figma Make, Coding Agent.  • However, only Mercari can evaluate the “Mercari-ness.” (=Uniqueness as Mercari)  ◦ Unique context  ◦ Unique evaluation criteria  ◦ Unique design philosophy 

Only Mercari can evaluate Mercari-ness  • That's why we need
a Brand QA Agent to review the massively generated UI and content for Mercari-ness.  • We believe that by interacting with the UI generation agent in the future, we will be able to continuously improve the accuracy of generation. 

Inquisitiveness and Humor  Open Minded  • Code of conduct based
on the Brand Core that should be observed in daily customer communications  • Communication Principle  • Design System Principle  Deﬁning Mercari-ness: How do we measure “Mercari-ness”?  Brand Core Personality  "Mercari-ness" is deﬁned by three personalities. These are the guidelines that serve as the starting point for creating all brand experiences.  Sincerity 

Brand QA Agent  Automate UI quality assessment 

What Brand QA Agents Evaluate  “Mercari-ness” and UI quality  Wording
and its quality  Evaluated based on the brand core    Evaluated based on each principle  • Design System Principle  • Communication Principle  Evaluated on rules more detailed than the brand core.    Evaluation is based on internally managed wording rules. 

Brand Core/  Principle/  Design System  UI Creator  AI Agent  Brand
QA Agent  ① Input UI  ②Read the Brand Core and guidelines  ④ Feedback on evaluation results  ③ Evaluate whether the UI complies with the Brand Core/Design System Principle/   Brand QA Agent evaluation image 

Brand QA Agent Evaluation Results  • A ﬁve-point overall evaluation 
• Provides feedback on 21 items, including strengths, areas for improvement, and the severity of violations for each principle  • Supports UI display of evaluation results and downloads in Markdown format 

Brand QA Agent Architecture  Overall Evaluation  Other agents may be
added as needed  Input  Wording QA Agent  Agent or Logic  Design System Principle   Communication Principle   QA Agent  Each evaluation point is processed independently, and then integrated to give an overall evaluation. 

Brand QA Agent Architecture  Responses API(GPT-5) OR Gemini Brand Core
Input Principle ﬁre search Assessme nt-based prompts  Guidelines  Rules  Evaluation results Agent-speciﬁc details 

Brand QA Agent  Quality assessment of the agent itself 

Human review ensures the validity of Brand QA Agent ratings 
Brand QA Agent  • The brand core, principles, and guidelines are abstract and broad concepts.  • We need to ensure that AI can correctly understand and evaluate them.  • It's important that all Mercari UIs can be correctly evaluated using the same standards.  Input UI  Evaluation results  Designer Review 

Human review of Brand QA Agent  • Understanding the brand
core: Highly accurate and appropriate understanding  • Presentation of evidence: Sufficient and specific  • Clarity of comments: Some room for improvement, but generally understandable  • Comprehensiveness: Covers almost all necessary aspects, with occasional omissions  • Overall evaluation: The AI evaluation is already "fairly good" for practical use  • Errors are at a level that a human could easily pick out and discard  • Rated "good enough to be incorporated into a design review flow" 

Measuring consistency of Brand QA Agent ratings  • Unstable evaluations
cannot be used for business purposes.  • GPT-5 systems do not support temperature.  • Accuracy must be maintained continuously even after agent updates.  • A dedicated agent will be developed to compare consistency and perform continuous automatic evaluations. 

Mechanism for automating consistency measurement  Create a gold standard from
human evaluations. Rerun the AI evaluation on the same UI as the gold standard to measure consistency.  Brand QA Agent  Input UI  Evaluation results  Gold Standard  (Evaluation Criteria)  Consistency measurement  Agent  Measurement results 

Brand QA Agent Evaluation Consistency Results  Consistency with the Gold
Standard (Evaluation Criteria)  • First Test - 62% Consistency  (After several iterations and reﬁnements)  • Seventh Test - 76% Consistency  • Eighth Test - 79.5% Consistency  • Ninth Test - 87% Consistency  Consistency has steadily improved through model changes, prompt adjustments, and improvements to the comparison agent itself.  We ultimately aim for a 95% consistency rate! 

Brand QA Agent Summary  What we've accomplished so far  •
We've built a prototype of the evaluation agent!  • The initial evaluation design is complete!  • The validity of human-based AI quality evaluation has been completed!  • The consistency evaluation match rate is good!  What we'll do next  • We'll design the system for actual use!  • And now, feedback, improvements, and operation await!  In other words... 

This is where the real battle begins!   

[mercari GEARS 2025] PJ Aurora’s Vision and Aut...

[mercari GEARS 2025] PJ Aurora’s Vision and Automated UI Quality Evaluation Agents

mercari PRO

More Decks by mercari

Other Decks in Technology

Featured

Transcript

Asuke Yasukuni  AI Management Oﬃce / Enablement Engineer   PJ

Asuke Yasukuni      AI Task Force / Enablement Engineer

PJ-Aurora

PJ-Aurora  Mission  By changing Mercari's approach to manufacturing, we aim

PJ-Aurora’s Vision  Changing the approach  Achieving true Scrum development with

Short-term Workﬂow Goals for Project Aurora  Idea  Design and Implementation

Why develop UX veriﬁcation (Brand QA Agent) in-house?    •

Only Mercari can evaluate Mercari-ness  • That's why we need

Inquisitiveness and Humor  Open Minded  • Code of conduct based

Brand QA Agent  Automate UI quality assessment

What Brand QA Agents Evaluate  “Mercari-ness” and UI quality  Wording

Brand Core/  Principle/  Design System  UI Creator  AI Agent  Brand

Brand QA Agent Evaluation Results  • A ﬁve-point overall evaluation

Brand QA Agent Architecture  Overall Evaluation  Other agents may be

Brand QA Agent Architecture  Responses API(GPT-5) OR Gemini Brand Core

Brand QA Agent  Quality assessment of the agent itself

Human review ensures the validity of Brand QA Agent ratings

Human review of Brand QA Agent  • Understanding the brand

Measuring consistency of Brand QA Agent ratings  • Unstable evaluations

Mechanism for automating consistency measurement  Create a gold standard from

Brand QA Agent Evaluation Consistency Results  Consistency with the Gold

Brand QA Agent Summary  What we've accomplished so far  •

This is where the real battle begins!