AIは公平な評価, 決断を行えるか？〜 LLM-as-a-Judgeの限界と意思決定バイアス〜　Can AI Make Fair Evaluations and Decisions?

AIは公平な評価, 決断を⾏えるか？〜 LLM-as-a-Judgeの限界と意思決定バイアス〜株式会社ニューロジカ開発部三ツ井智哉 AIは公平な評価,
決断を⾏えるか？〜 LLM-as-a-Judgeの限界と意思決定バイアス〜

⾃然⾔語⽣成（NLG）の評価 • ⼈間による評価：コストが⾼く時間がかかる • Large Language Models (LLMs) による評価：低コスト、時短 G-Eval
(EMNLP 2023) [1] • GPT-4などのLLMをNLGの評価者として利⽤ • Chain-of-Thought（CoT）などを活⽤し, ⼈間の評価との⾼い相関を実現 1,000件のモデル応答 LLM as a Judge とは © Neurogica Inc. はじめに [1] Liu, Yang, et al. "G-eval: NLG evaluation using gpt-4 with better human alignment." Proceedings of the 2023 conference on empirical methods in natural language processing. 2023. 問題点解決案エージェントまとめ数⽇〜数週間数⼗万円⼈によりばらつき数分数万円同⼀の基準⼈間 LLM

Large Language Models are not Fair Evaluators (ACL 2024) [2]
• LLMの位置バイアスの実証 • “AとB, どちらの出⼒が良いか？” という⽐較において, 選択肢の提⽰順（AとB）を⼊れ替えるだけで勝敗判定が変化 • 順番を⼊れ替えた結果を平均化する⼯夫が必要 LLM as a Judge の問題点 ① はじめに [2] Wang, Peiyi, et al. "Large language models are not fair evaluators." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. 問題点解決案エージェントまとめ Which response is better? Response 1: … Response 2: … Which response is better? Response 1: … Response 2: … Response 1: … Response 2: … LLM Judge © Neurogica Inc.

Self-Preference Bias in LLM-as-a-Judge (NeurIPS 2024 Workshop) [3] • LLMは⾃⾝の⽣成物を不当に⾼く評価してしまう傾向
• モデルにとって馴染みやすい, 予測しやすい⽂章を⾼く評価 • ⽣成側と評価側のスタイルの相性が混⼊ LLM as a Judge の問題点 ② はじめに [3] Wataoka, Koki, Tsubasa Takahashi, and Ryokan Ri. "Self-Preference Bias in LLM-as-a-Judge." Neurips Safe Generative AI Workshop 2024. 問題点解決案エージェントまとめ © Neurogica Inc.

Pairwise or Pointwise? (COLM 2025) [4] • 評価プロトコル⾃体の違いによるバイアスの受けやすさを⽐較 • Pairwise（A/B⽐較）：順番の⼊れ替えなどで評価が逆転する割合が約35%
と⾼い • Pointwise（絶対評価・スコアリング）：評価のブレが約9%に留まり, よりノイズに対して頑健問題の解決案はじめに [4] Tripathi, Tuhina, et al. "Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation." Second Conference on Language Modeling., 2025 問題点解決案エージェントまとめ Which response is better? Response 1: … Response 2: … Scoring this response Response 1: … Pairwise Pointwise © Neurogica Inc..

• LLM-as-a-Judge：出⼒を評価 • LLM Agent：検索・ブラウザ操作・外部ツール実⾏・購買・予約・調査評価から意思決定（エージェント）へはじめに問題点解決案エージェント
まとめ • Claudeは画⾯操作・クリック・⼊⼒を⾏う Computer Useを提供 • OpenAI Operatorはブラウザを使ってタスクを実⾏するAgentとして公開 © Neurogica Inc.

Actions Speak Louder than Words [5] • 差別的な回答をしないよう調整されたLLMでも, エージェントとしての意思決定には潜在的な社会的バイアス
エージェントにおけるバイアス ① はじめに [5] Li, Yuxuan, Hirokazu Shirado, and Sauvik Das. "Actions speak louder than words: Agent decisions reveal implicit biases in language models." Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. 2025. 問題点解決案エージェントまとめ • LLMにペルソナを与え, 避難・融資・採⽤などの意思決定シナリオで⽐較 • ほぼすべてのシミュレーションで有意な意思決定格差が観測 © Neurogica Inc.

What Is Your AI Agent Buying? [6] • LLM購買エージェントについて, モデルごとの選好バ
イアスを実証 • 商品の位置, 価格, レビュー数, 広告タグの有無などに対する感応度がモデルごと, 世代ごとに異なる. エージェントにおけるバイアス ② はじめに [6] Allouah, Amine, et al. "What Is Your AI Agent Buying? Evaluation, Biases, Model Dependence, & Emerging Implications of Agentic E-Commerce." Proceedings of the ACM Web Conference 2026. 問題点解決案エージェントまとめ ←位置バイアスを⽰すタグや評価による→ 影響を⽰す © Neurogica Inc.

まとめはじめに問題点解決案エージェントまとめ LLM評価の問題 • 選択肢の順番にバイアスが存在
• 評価するモデルと同じモデルの⽣成物を⾼く評価エージェントへの進化 • テキスト評価だけでなく操作, 意思決定も • LLMの意思決定にもバイアスが存在 © Neurogica Inc.. LLM as a Judge ⼈間と近い評価を⼈間よりはるかに短時間低コストで実⾏可能

AIは公平な評価, 決断を行えるか？〜 LLM-as-a-Judgeの限界と意思決定バイア...

AIは公平な評価, 決断を行えるか？〜 LLM-as-a-Judgeの限界と意思決定バイアス〜　Can AI Make Fair Evaluations and Decisions?

Neurogica

More Decks by Neurogica

Other Decks in Technology

Featured

Transcript

AIは公平な評価, 決断を⾏えるか？〜 LLM-as-a-Judgeの限界と意思決定バイアス〜株式会社ニューロジカ開発部三ツ井智哉 AIは公平な評価,

⾃然⾔語⽣成（NLG）の評価 • ⼈間による評価：コストが⾼く時間がかかる • Large Language Models (LLMs) による評価：低コスト、時短 G-Eval

Large Language Models are not Fair Evaluators (ACL 2024) [2]

Self-Preference Bias in LLM-as-a-Judge (NeurIPS 2024 Workshop) [3] • LLMは⾃⾝の⽣成物を不当に⾼く評価してしまう傾向

Pairwise or Pointwise? (COLM 2025) [4] • 評価プロトコル⾃体の違いによるバイアスの受けやすさを⽐較 • Pairwise（A/B⽐較）：順番の⼊れ替えなどで評価が逆転する割合が約35%

• LLM-as-a-Judge：出⼒を評価 • LLM Agent：検索・ブラウザ操作・外部ツール実⾏・購買・予約・調査評価から意思決定（エージェント）へはじめに問題点解決案エージェント

Actions Speak Louder than Words [5] • 差別的な回答をしないよう調整されたLLMでも, エージェントとしての意思決定には潜在的な社会的バイアス

What Is Your AI Agent Buying? [6] • LLM購買エージェントについて, モデルごとの選好バ

まとめはじめに問題点解決案エージェントまとめ LLM評価の問題 • 選択肢の順番にバイアスが存在

AIは公平な評価, 決断を行えるか？ 〜 LLM-as-a-Judgeの限界と意思決定バイア...

AIは公平な評価, 決断を行えるか？ 〜 LLM-as-a-Judgeの限界と意思決定バイアス 〜 Can AI Make Fair Evaluations and Decisions?

More Decks by Neurogica

Other Decks in Technology

Featured

Transcript

AIは公平な評価, 決断を行えるか？〜 LLM-as-a-Judgeの限界と意思決定バイア...

AIは公平な評価, 決断を行えるか？〜 LLM-as-a-Judgeの限界と意思決定バイアス〜　Can AI Make Fair Evaluations and Decisions?