【輪講資料】Mining Query Subtopics from Search Log Data【SIGIR2012】

Mining Query Subtopics from Search Log Data Yunhua hu, Yanan
Qian, Hang Li, Daxin Jiang, Jian Pei, and Qinghua Zheng http://research.microsoft.com/apps/pubs/default.aspx?id=168006 International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) 2012 2013-04-24 輪講資料

Introduction • 検索ニーズを満足させるためにはユーザの検索意図を理解することが重要 ⇒ 検索クエリの意図を理解するために、以下のような研究がなされている - informational, navigational and transactional
- semantic categories or topics - subtopics (query) # 本研究はこちら ⇒ 多数の意味を持つ多義クエリ or 多数のファセット※を持つクエリ ※ファセット：ある種に対して,単一の特性を適用して作り出した区分の集合.概念のもつ性質に応じて概念をまとめるために使われる属性. ダンカン：お笑い芸人（オフィス北野所属）ダンカン：怪獣（ウルトラセブン第33話「侵略する死者たち」に登場。） “ダンカン”を検索するお ambiguous on line game marketplace homepage “xbox” multifaceted

Introduction 本論文ではユーザが検索するときに行う２つの現象について研究を行った (1) one subtopic per search (OSS) ⇒ 検索する人は検索クエリに多義があっても、検索意図は定まっている。
(2) subtopic clarification by additional keyword (SCAK) ⇒ 検索行動の中で、サブトピックに関連する追加ワードを検索クエリに追加してサブトピックを明確化する。 “ダンカン”を検索するお xboxの marketplaceで最新の曲をダウンロードしたいぉ。検索順番検索クエリ 1 xbox 2 xbox marketplace

Introduction / OSS (1) (2) (3) (4) (5) MSの偉い人 FOX製作gleeの俳優さん
俳優さんを知りたい人たちは(2),(4)は一緒にクリックする、一方でMSの偉い人を知りたい人は(1)(3)(5) を一緒にクリックする。 one subtopic per search

Introduction /SCAK (1) (2) (3) (4) (5) MSの偉い人 FOX製作gleeの俳優さん一方でMSの偉い人を知り
たい人は検索ワードにクエリ拡張 Harry Shum microsoft ⇒ mscrosoft subtopicを決定する情報

Introduction • 本研究ではsubtopicの抽出にクラスタリングを用いて行い以下の精度を実現 - ambiguous subtopic ⇒ B-cubed F1 0.925
～ 0.956 - multifaceted subtopic ⇒ B-cubed F1 0.896 ～ 0.930 ◦ search results clustering - B-cubed precision +5.4[%] - B-cubed recall +6.1[%] ◦ search results re-ranking Δ = 0.61 などの効果があった。 NEXT PAGE

One Subtopic per Search • one subtopic per search ⇒
同じsubtopicを持つ複数URLは同じ人にクリックされる？ ⇒ ほんとうにそうなっているか精度を測ってみると ◦ multiclickが少ないほど精度が高い。 ◦ クリックされた頻度が大きいほど頻度が高い。検索ユーザはみんなが思っている以上にちゃんとしているし、ランダムにクリックしない

Subtopic clarification by Additional Keyword • subtopic clarification by additional
keyword ⇒ サブトピックを明確化するために検索行動の中で、ユーザは主クエリに追加ワードを付与していく検索のあとにクリックされたURLを見ると、主キーワードと追加されたキーワードは同じsubtopicを指す傾向がある。 n Query 1 xbox 2 xbox marketplace 3 xbox marketplace BOF3 • ほんまいかな？（本当でしょうか？）

Subtopic clarification by Additional Keyword Q → 単一語で名詞 W→ Qに追加される拡張語
“Q” と “Q+W”となっているクエリパタンをサンプリングで抽出してクリックしたURL間のoverlapとsubtopicのoverlapを確認 42 [%] もある！密接な関係がある beijing / beijing duck fast / fast food computer science / computer science department URL overlapない → subtopicが非同一（あとで抜いておく）

Clustering Method (概略） all <query,URL> stored 不必要な
拡張クエリは枝きりされている (do not overlap)

Clustering Method (indexing) → 実装上の工夫 (Q,Q+W) → prefix tree (Q,W+Q)
→ suffix tree ex) harry shum / microsoft harry shum ex) harry shum / harry shum jr 効率的に拡張語のデータにアクセスできる

Clustering Method (pruning) (Q,Q+W)で fast/ fast foot で URL が
overlapしていないレコードは枝きりするぉ。 heuristic rule だけどな！

Clustering (similarity) We conduct clustering on the clicked URLs of
each query and its expanded queries. ▪Similarity Function S1 is a similarity function based on the OSS phenomenon, S2 is based on the SCAK phenomenon, S3 is based on string similarities, with α, β, and γ as weights. OSS term SCAK term string sim term ▪S1(OSS) term ui: http://www.a.com/ http://www.b.com/: 2 http://www.c.com/: 23 http://www.d.com/: 10 http://www.e.com/: 20 ユーザ検索にてある検索ワードで共起したURL集合(mui)

Clustering (similarity & Algorithm) ▪S2(SCAK) term ⃗wui and ⃗wuj denote
the vectors of keywords associated with ui and uj. u1 keywords vector given by this value u1 = {q=>1.0,q+w1=>1.0,w2+q=>1.0} ▪S3(string) term ui,ujの文字列としての類似度を計量しているが、少しややこしいことをしているようだ。 ↓ M. Kan and H. Thi. Fast webpage classification using url features. In Proceedings of the 14th ACM ※ URLの表記情報からURLのrelevancyを推定する。 feature -> URLの長さや、URIのcomponent ※ 詳しく読めてない、興味ある人読んだら教えて。 ▪クラスタリングのアルゴリズム階層的クラスタリング(凝集型）でやる。

Clustering (post process) クラスタリング結果は URLをsubtopic毎にクラスタリングされたデータ構造が取得でいるので、拡張語などを対応する語などを抽出して以下のデータ構造を作成する。マイクロソフトの偉い人というsubtopic 俳優さんのsubtopic

EXPERIMENTS ON ACCURACY ※データの1/3をparameter tuningで利用し、残りデータで評価した。parameterは α、β、γ、θ(clustering parameter) は 0.35,0.4,0.25,0.3 俺らの仕事すげ～いい感じだわ。

EXPERIMENTS ON ACCURACY OSS term SCAK term string sim term
全部足したらかなりいい精度でているし、 OSS/SCAKをsub-functionに取り込むことにより、かなり精度によい影響がでている。やったね！

APPLICATIONS OF SUBTOPIC MINING • Search results clustering ⇒ subtopic
を考慮して検索された結果(URL)をグルーピングし提示する。 ※この研究は先行研究が沢山なされており、Wang and Zhai’s[1] の研究を baselineとして比較する。 [1] X. Wang and C. Zhai. Learn from web search logs to organize search results. In Proceedings SIGIR'07,pages 87–94, 2007. 5.4% interms of B-cubed precision, 6.1% in terms of B-cubed recall, 5.9% in terms of B-cubed F1

APPLICATIONS OF SUBTOPIC MINING • Search results Re-Ranking ここの部分がsubtopicに依存で動的に書き換わる(Re-Rankingされる）
query: “harry shum” last clickのpositionを既存と本施策実施時とで差分をとった結果 positionの差分が0.61 ユーザの検索行動のコストを低減できる UIである

CONCLUSION 本研究では2つのユーザ行動を通してquery subtopic miningの課題に取り組んだ。 ⇒ F1-mesure 0.925 (for finding
ambiguous subtopic) F1-mesure 0.896 (for finding multifaceted subtopic) 応用例として、以下の２種のアプリケーションを作成し、効果測定を実施 ⇒ search result clustering - improve precision by 5.4 [%] and recall by 6.1 [%] ⇒ search result re-ranking - Δ = 0.61 ・subtopicクラスタリングにおいてたった３つの新しい素性だけで既存法より精度を向上させた。・クラスタリングアルゴリズムは単純なものだけをためしたが、もっといろんなアルゴリズムにも適用できそうだ・subtopicクラスタリングの結果を２つのアプリを通じて検証し、効果も有効であることが確認できた。これからもっといろいろできそうだと思っている。

補足資料

Purity and Inverse Purity • Purity • Inverse Purity ⇒
不純物の混じっていない具合を計量する指標

B-Cubed precisoni,recall Let L(e) and C(e) denote the category and
the cluster of an item e. We can denote the correctness of the relation between e and e0 in the distribution as:

【輪講資料】Mining Query Subtopics from Search Log Da...

【輪講資料】Mining Query Subtopics from Search Log Data【SIGIR2012】

Yuichiro SEKIGUCHI

More Decks by Yuichiro SEKIGUCHI

Other Decks in Research

Featured

Transcript

Mining Query Subtopics from Search Log Data Yunhua hu, Yanan

Introduction • 検索ニーズを満足させるためにはユーザの検索意図を理解することが重要 ⇒ 検索クエリの意図を理解するために、以下のような研究がなされている - informational, navigational and transactional

Introduction 本論文ではユーザが検索するときに行う２つの現象について研究を行った (1) one subtopic per search (OSS) ⇒ 検索する人は検索クエリに多義があっても、検索意図は定まっている。

Introduction / OSS (1) (2) (3) (4) (5) MSの偉い人 FOX製作gleeの俳優さん

Introduction /SCAK (1) (2) (3) (4) (5) MSの偉い人 FOX製作gleeの俳優さん一方でMSの偉い人を知り

Introduction • 本研究ではsubtopicの抽出にクラスタリングを用いて行い以下の精度を実現 - ambiguous subtopic ⇒ B-cubed F1 0.925

One Subtopic per Search • one subtopic per search ⇒

Subtopic clarification by Additional Keyword • subtopic clarification by additional

Subtopic clarification by Additional Keyword Q → 単一語で名詞 W→ Qに追加される拡張語

Clustering Method (概略） all <query,URL> stored 不必要な

Clustering Method (indexing) → 実装上の工夫 (Q,Q+W) → prefix tree (Q,W+Q)

Clustering Method (pruning) (Q,Q+W)で fast/ fast foot で URL が

Clustering (similarity) We conduct clustering on the clicked URLs of

Clustering (similarity & Algorithm) ▪S2(SCAK) term ⃗wui and ⃗wuj denote

EXPERIMENTS ON ACCURACY ※データの1/3をparameter tuningで利用し、残りデータで評価した。parameterは α、β、γ、θ(clustering parameter) は 0.35,0.4,0.25,0.3 俺らの仕事すげ～いい感じだわ。

EXPERIMENTS ON ACCURACY OSS term SCAK term string sim term

APPLICATIONS OF SUBTOPIC MINING • Search results clustering ⇒ subtopic

APPLICATIONS OF SUBTOPIC MINING • Search results Re-Ranking ここの部分がsubtopicに依存で動的に書き換わる(Re-Rankingされる）

CONCLUSION 本研究では2つのユーザ行動を通してquery subtopic miningの課題に取り組んだ。 ⇒ F1-mesure 0.925 (for finding

補足資料

Purity and Inverse Purity • Purity • Inverse Purity ⇒

B-Cubed precisoni,recall Let L(e) and C(e) denote the category and