文献紹介: A Document Descriptor using Covariance of Word Vectors

Slide 1

Slide 1 text

A Document Descriptor using Covariance of Word Vectors 文献紹介 2019/02/27 長岡技術科学大学自然言語処理研究室稲岡夢人

Slide 2

Slide 2 text

Literature 2 Title A Document Descriptor using Covariance of Word Vectors Author Marwan Torki Volume Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 527-532, 2018.

Slide 3

Slide 3 text

Abstract  単語ベクトルを用いた固定長の文書表現を提案 (Document-Covariance Descriptor; DoCoV) → Supervised, Unsupervisedのアプリケーションで簡単に利用できる  様々なタスクでSoTAに匹敵する性能 3

Slide 4

Slide 4 text

Introduction  ベクトルを利用した文書検索には長い歴史がある ← Bag-of-Words, Latent Semantic Indexing(LSI)  近年はニューラル言語モデルで単語埋め込みを学習  単語ではなく文, 段落, 文書の分散表現も注目されている 4

Slide 5

Slide 5 text

vs. DoCoV  doc2vecやFastSentは単語と共通の空間  共分散は単語の密度の形状を符号化 5

Slide 6

Slide 6 text

vs. DoCoV  doc2vecやFastSentは学習に時間がかかる  DoCoV(共分散)の計算は並列性が高く高速に行える 6

Slide 7

Slide 7 text

DoCoV  Document Observation Matrix d次元の単語埋め込みとn単語の文書において ∈ ×と定義 (行は単語、列は埋め込みの各次元) 7

Slide 8

Slide 8 text

DoCoV  Covariance Matrix 8

Slide 9

Slide 9 text

DoCoV  Vectorized representation 9

Slide 10

Slide 10 text

Evaluation  IMDB movie reviewsの分類性能によって単語ベクトルによる変化を評価  ベクトルを線形SVMで分類  1つのレビューは複数の文で構成される  Train/Test/Unlabeled : 25K/25K/50K  事前学習済みのword2vec, GloVeと、 TrainとUnlabeledで学習したword2vecで比較 10

Slide 11

Slide 11 text

Result 11

Slide 12

Slide 12 text

Result 12

Slide 13

Slide 13 text

Result 13

Slide 14

Slide 14 text

Result 14

Slide 15

Slide 15 text

Evaluation  文の意味関連性データセットSICK, STS 2014で文書ベクトルを評価  事前学習済みの単語埋め込みを使用 (dim=300)  Pearson correlationとSpearman correlationで評価 15

Slide 16

Slide 16 text

Result 学習が必要な他手法と匹敵するような結果 16

Slide 17

Slide 17 text

Evaluation  Google newsで事前学習済みの単語埋め込みを使用  Movie Reviews(MR), Subjectivity(Subj), Customer Reviews(CR), TREC Question(TREC)をデータセットとして使用 17

Slide 18

Slide 18 text

Result 18

Slide 19

Slide 19 text

Result 19

Slide 20

Slide 20 text

Result 20

Slide 21

Slide 21 text

Result 21

Slide 22

Slide 22 text

Conclusions  文、段落、文書の新たなベクトル表現方法を提案  他手法のような反復の学習を必要としない  Supervised, Unsupervisedのタスクにおいてその有用性を確認 22