文献紹介_20181024_An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation

- 文献紹介 2018/10/24 - An Empirical Evaluation of doc2vec with
Practical Insights into Document Embedding Generation 長岡技術科学大学自然言語処理研究室多田太郎

About the thesis Authors : Jey Han Lau, Timothy Baldwin
IBM Research Conference : Proceedings of the 1st Workshop on Representation Learning for NLP, pages 78–86, 2016 Association for Computational Linguistics 2

Abstract ・doc2vecはオリジナル論文の様な性能を再現するのが難しい・doc2vecを２つのタスクで実験し評価を行う・大規模外部コーパスで学習したモデルや事前に学習された単語エンベディングで高い性能を確認・汎用目的でハイパーパラメータの推奨値を提案 3

Introduction これらの疑問に焦点を当て検証を行う（1）異なるタスクでのdoc2vecの有効性？（2）dmpvとdbowとでどちらが優れるか（3）ハイパーパラメータの最適化や事前に訓練された単語エンベディングによって doc2vecを改善することは可能か？ 4

Evaluation Tasks 1. Forum Question Duplication 2. Semantic Textual Similarity
small in-document collection で学習 5

Evaluation Tasks 1. Forum Question Duplication 　StackExchangeから抽出した１２のsubforums 　ペア学習：５０Mから１Bの質問ペア　テスト：30Mから300Mの質問ペア 2.
Semantic Textual Similarity 6

Evaluation Tasks 1. Forum Question Duplication 7

Evaluation Tasks 1. Forum Question Duplication 2. Semantic Textual Similarity
SEMとSemEvalの一部のshared task 文章のペアの類似性を求めるタスク 5 ドメイン, 各ドメイン 375 から 750のアノテートされたペアがある 8

Evaluation Tasks 2. Semantic Textual Similarity 9

Optimal Hyper-parameter Settings Training with Large External Corpora これまでの実験で結果の良かった dbow
で実験開発データを使用して以下のパラメータを固定し最適化　・ initial learning rate : 0.025 　・ minimum learning rate : 0.0001 大規模な外部コーパスでの学習による有効性を検証　・ English Wikipedia 　・ Associated Press English news articles from 2009 to 2015. 10

Optimal Hyper-parameter Settings 11

Improving doc2vec with Pre-trained Word Embeddings 13

Conclusion ・２つのタスクで文書分散表現を評価・dbow で dmpv よりも良い結果を得た・汎用目的のアプリケーションのハイパーパラメータの推奨値を提案・外部の大規模コーパスでの学習、事前学習したモデルの使用でロバストな性能を発揮 14

文献紹介_20181024_An Empirical Evaluation of doc2ve...

文献紹介_20181024_An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation

T.Tada

More Decks by T.Tada

Other Decks in Technology

Featured

Transcript