[論文紹介] ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce Search

ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce Search
発表者筑波⼤学加藤研究室茂⼿⽊太⼀ Beatriz Soviero, Daniel Kuhn, Alexandre Salle, Viviane P. Moreira ECIR 2024

eコマース分野における LLMを活⽤した適合性判定⼿法の提案論⽂の概要 2 クエリ⽊製テーブルガイドライン LLM doc
id query id relevance a 1 1 b 1 1 c 1 0 d 1 1 Qrel 商品⽂書適合性判定データセット

背景(1/2) 3 • ⼈⼿によるデータセットの構築 ◦ Googleは170ページ以上のガイドラインを使⽤[1] [1] Google LLC, General
Guidelines. https://static.googleusercontent.com/media/guidelines.raterhub.com/ja//searchqualityevaluatorguidelines.pdf クエリ IR Reading 2024 ガイドラインアノテーター doc id query id relevance a 1 1 b 1 0 c 1 0 d 1 1 Qrel ⽂書適合性判定データセット

背景(2/2) • 作成コストが⼤きい ◦ 73000件の適合性判定には600時間以上が必要[2] ◦ WANDSの構築に費やした時間は3500時間以上と推定される[3] ◦ ガイドラインの作成にも深いドメイン知識が必要 •
ローコストで質の⾼い適合性判定を実現する⼿法の提案 ◦ クラウドワーカーの利⽤[4] ◦ LLMを適合性判定に活⽤ 4 [2] Sanderson et al. Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval. p.247-375. 2010 [3] Chen et al. WANDS: dataset for product search relevance assessment. ECIR 2022 [4] Blanco et al. Repeatable and reliable search system evaluation using crowdsourcing. SIGIR 2011

LLMの適合性判定の性能を検証する研究 • LLMを適合性判定に⽤いるメリット・デメリットを調査[5] • プロンプトの特徴が適合性判定に与える影響を検証[6] ◦ Role, Description, Narrative, Aspect,
Multiple • ⾼度な知識が必要なドメインでの適合性判定[7] ◦ LeCaRD：中国語の判例検索データセット ◦ 判例からLegal Fact, Material Factを抽出 ◦ Legal Fact, Material Factが似ている場合は適合と判定関連研究 5 [5] Faggioli et al. Perspectives on large language models for relevance judgment. ICTIR 2023 [6] Thomas et al Large Language Models can Accurately Predict Searcher Preferences. arXiv:2309.10621, 2023 [7] Ma et al. Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval. arXiv:2403.18405, 2024

1. eコマースデータセットにおいてLLMによる適合性判定は⼈間の適合性判定とどの程度⼀致するのか評価 2. LLMは適合性判定に役⽴つガイドラインを⽣成できるのか検証研究の⽬的 6

• クエリと⽂書が短い • ブランド，サイズ，材質，⾊など属性情報が多い • Ex.１ ◦ クエリ：Kitchen wooden stand
◦ ⽂書：Kitchen cart solid wood • Ex.２ ◦ クエリ：70s inspired furniture ◦ ⽂書：ambesonne blue curtains , ocean inspired waves pattern with japanese influences nautical maritime aquatic , window treatments 2 panel set for living room bedroom decor , 56 '' x 63 '' , sky blue white eコマースデータセット(WANDS)の特徴 7

⼈⼿でガイドラインを作成，LLMで適合性判定提案⼿法1：Human Baseline 8 クエリ⽊製テーブルガイドライン LLM doc
id query id relevance a 1 1 b 1 1 c 1 0 d 1 1 Qrel 商品⽂書適合性判定データセット作成 human

• 作成されたガイドライン(⼀部抜粋)＊提案⼿法1：Human Baseline 9 Relevant: this label represents the
products that directly match the query. The query and the name of a relevant product may not share words. So if the product is in the same category as the query but has different specifications such as brand, it is still relevant. Example of a relevant judgment: the product 'Gifford Desk' is relevant to the query 'writing deskʼ. The user searched for a writing desk and the returned product is a desk from the brand Gifford that can be used to write. Therefore, the product matches the userʼs intent. Examples of relevance judgments: ('dining table vinyl cloth', 'horwich dining table', 'Not relevant') ('togo chair', 'evendale upholstered side chair', 'Relevant') Relevantの定義 Relevantの例適合性判定の例＊https://github.com/danimtk/chatGPT-goes-shopping/blob/main/data/guidelines/human_baseline_ten_shot_WANDS.txt

LLMでガイドラインを作成，LLMで適合性判定提案⼿法2：LLM-Generated 10 クエリ⽊製テーブルガイドライン LLM doc id
query id relevance a 1 1 b 1 1 c 1 0 d 1 1 Qrel 商品⽂書適合性判定データセット⽣成 LLM 商品⽂書クエリ relevance サイドテーブル Relevant 座椅⼦ Not relevant example 適合性判定済みのクエリと⽂書のペアを 200件与えることでガイドラインを⽣成

• ⽣成されたガイドライン(⼀部抜粋)＊提案⼿法2：LLM-Generated 11 1. Relevance: - A product is
considered 'Relevant' if it directly matches the user's intent or closely aligns with the query's context. - Look for products that fulfill the purpose or function described in the query. Examples: - In the query ʻhardwood beds,ʼ the ʻMeryl Solid Wood Platform Bedʻ would be relevant as it matches the material (hardwood) and the product type (bed). Examples of relevance judgments: ('dining table vinyl cloth', 'horwich dining table', 'Not relevantʼ) ('togo chair', 'evendale upholstered side chair', 'Relevant') Relevantの定義 Relevantの例適合性判定の例＊https://github.com/danimtk/chatGPT-goes-shopping/blob/main/data/guidelines/LLM-generated_ten-shot_WANDS.txt

• LLM ◦ GPT-3.5-turbo ◦ GPT-4 • データセット ◦ WANDS：⽣活雑貨(公開)
◦ Pharama：オンライン薬局(⾮公開) ◦ 難易度別にEasyとHardに分類 • 評価⽅法 ◦ 正解ラベルとLLMが⽣成した適合性ラベルを⽐較 ◦ AccuracyとKappaを使⽤実験概要 12 紹介論⽂中のTable.1から引⽤．データセットの統計情報

⼈と遜⾊のない適合性判定ができている実験結果(1/2) 13 紹介論⽂中のTable.2から引⽤．正解ラベルとの⼀致

• クエリ難易度によるaccuracyへの影響 ◦ Easy：~90% ◦ Hard：~52% • LLMが適合性判定を失敗した例 ◦ クエリと商品名で形容詞が違う場合(Hard
Positive) (ʻcard tableʼ, ʻrian coffee tableʼ, ʻRelevantʼ) ◦ クエリが⼀般的なものを指しているのに対して商品名が具体的な場合(Easy Positive) (ʻflamingoʼ, ʻpalm sprints flamingo graphic artʼ, ʻRelevantʼ) 実験結果(2/2) 14

• LLMを活⽤した適合性判定⼿法の提案 ◦ eコマースにおいてLLMはどの程度適合性判定ができるか検証 ◦ LLMは適合性判定に役⽴つガイドラインを⽣成できるか検証 • Q.コストを考えるとどのLLMを使⽤するのがよいか？まとめ・考察 15
GPT-3.5-turbo GPT-4-turbo GPT-4o Input (1M tokens) $0.5 $10.0 $5.0 Output (1M tokens) $1.5 $30.0 $15.0

[論文紹介] ChatGPT Goes Shopping: LLMs Can Predict ...

[論文紹介] ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce Search

t-motegi

More Decks by t-motegi

Other Decks in Research

Featured

Transcript

ChatGPT Goes Shopping: LLMs Can Predict Relevance in eCommerce Search

eコマース分野における LLMを活⽤した適合性判定⼿法の提案論⽂の概要 2 クエリ⽊製テーブルガイドライン LLM doc

背景(1/2) 3 • ⼈⼿によるデータセットの構築 ◦ Googleは170ページ以上のガイドラインを使⽤[1] [1] Google LLC, General

背景(2/2) • 作成コストが⼤きい ◦ 73000件の適合性判定には600時間以上が必要[2] ◦ WANDSの構築に費やした時間は3500時間以上と推定される[3] ◦ ガイドラインの作成にも深いドメイン知識が必要 •

LLMの適合性判定の性能を検証する研究 • LLMを適合性判定に⽤いるメリット・デメリットを調査[5] • プロンプトの特徴が適合性判定に与える影響を検証[6] ◦ Role, Description, Narrative, Aspect,

1. eコマースデータセットにおいてLLMによる適合性判定は⼈間の適合性判定とどの程度⼀致するのか評価 2. LLMは適合性判定に役⽴つガイドラインを⽣成できるのか検証研究の⽬的 6

• クエリと⽂書が短い • ブランド，サイズ，材質，⾊など属性情報が多い • Ex.１ ◦ クエリ：Kitchen wooden stand

⼈⼿でガイドラインを作成，LLMで適合性判定提案⼿法1：Human Baseline 8 クエリ⽊製テーブルガイドライン LLM doc

• 作成されたガイドライン(⼀部抜粋)＊提案⼿法1：Human Baseline 9 Relevant: this label represents the

LLMでガイドラインを作成，LLMで適合性判定提案⼿法2：LLM-Generated 10 クエリ⽊製テーブルガイドライン LLM doc id

• ⽣成されたガイドライン(⼀部抜粋)＊提案⼿法2：LLM-Generated 11 1. Relevance: - A product is

• LLM ◦ GPT-3.5-turbo ◦ GPT-4 • データセット ◦ WANDS：⽣活雑貨(公開)

⼈と遜⾊のない適合性判定ができている実験結果(1/2) 13 紹介論⽂中のTable.2から引⽤．正解ラベルとの⼀致

• クエリ難易度によるaccuracyへの影響 ◦ Easy：~90% ◦ Hard：~52% • LLMが適合性判定を失敗した例 ◦ クエリと商品名で形容詞が違う場合(Hard