文献紹介 SCDV _ Sparse Composite Document Vectors using soft clustering over distributional representations

文献紹介(2018/Aug/23) SCDV :Sparse Composite Document Vectors using soft clustering over
distributional representations 長岡技術科学大学自然言語処理研究室多田　太郎

About the thesis Authors Dheeraj Mekala : IIT Kanpur Vivek
Gupta : Microsoft Research Bhargavi Paranjape : Microsoft Research Harish Karnick : IIT Kanpur Conference Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 659–669, 2017 Association for Computational Linguistics

Abstract • They present a feature vector formation technique for
documents - Sparse Composite Document Vector (SCDV) - • They outperform the previous state-of-the-art method, NTSG Liu et al. (2015). • They achieve significant reduction in training and prediction times compared to other representation methods.

Introduction • Distributed word embeddings represent words as dense, low-dimensional
and real-valued vectors that can capture their semantic and syntactic properties. • Representations based on neural network language models (Mikolov et al. 2013) can overcome bag-of-word model’s flaws (don’t account for word ordering and long-distance semantic relations) and further reduce the dimensionality of the vectors. • However, there is a need to extend word embeddings to entire paragraphs and documents for tasks such as document and short-text classification.

Introduction • Representing entire documents in a dense, low-dimensional space
is a challenge. • vectors of two documents that contain the same word in two distinct senses need to account for this distinction for an accurate semantic representation of the documents. • They propose the Sparse Composite Document Vector(SCDV) representation learning technique to address these challenges and create efficient, accurate and robust semantic representations of large texts for document classification tasks.

Sparse Composite Document Vectors（SCDV） The feature formation algorithm can be
divided into following three steps. • Word Vector Clustering • Document Topic-vector Formation • Sparse Document Vectors

Word Vector Clustering Learning d dimensional word vector representations for
every word in the vocabulary V using the skip-gram algorithm with negative sampling (SGNS) (Mikolov et al. 2013). Then cluster these word embeddings using the Gaussian Mixture Models(GMM) (Reynolds 2015) soft clustering technique. The number of clusters, K, to be formed is a parameter of the SCDV model. Each word belongs to every cluster with some probability P(ck|wi).

Document Topic-vector Formation For each word wi, we create K
different word-cluster vectors of d dimensions (→wcvik) by P(ck|wi). Concatenate all K word-cluster vectors (→wcvik) into a K×d dimensional embedding and weigh it with inverse document frequency of wi to form a word-topics vector (→wtvi). Finally, for all words appearing in document Dn, we sum their word-topic vectors →wtvi to obtain the document vector →dvDn.

Sparse Document Vectors Most values in →dvDn are very close
to zero. utilize this fact to a threshold (specified as a parameter), which results in the Sparse　 Composite Document Vector →SCDVDn.

Sparse Document Vectors Word-topics vector formation. Sparse Composite Document Vector
formation.

Experiments • Baselines • Text Classification ・Multi-class classification ・Multi-label classification
・Effect of Hyper-Parameters • Topic Coherence • Context-Sensitive Learning • Information Retrieval

Baselines Use the best parameter settings as reported. Bag-ofWords(BoW) model
(Harris, 1954), Bag of Word Vector (BoWV) (Gupta et al., 2016) model, paragraph vector models (Le and Mikolov, 2014), Topical word embeddings (TWE-1) (Liu et al.,2015b), Neural Tensor Skip-Gram Model (NTSG1to NTSG-3) (Liu et al., 2015a), tf-idf weighted average word-vector model (Singh and Mukerjee, 2015) weighted Bag of Concepts (weightBoC)(Kim et al., 2017),

Multi-class classification Dataset : 20NewsGroup in SCDV, They set the
dimension 200 sparsity threshold parameter to 4% the number of mixture components in GMM to 60. They learn word vector embedding using Skip-Gram with Negative Sampling (SGNS) of 10 minimum word frequency as 20. use 5-fold cross-validation on F1 score to tune parameter C of SVM. the current state of art : NTSG (Neural-Tensor-Skip-Gram)

Multi-label classification Dataset : Reuters-21578 Use LinearSVM for multi-class classification
and Logistic regression with OneVsRest setting for multi-label classification in baselines and SCDV.

Context-Sensitive Learning select some words (wj) with multiple senses from
20Newsgroup,each cluster (ci)

Analysis and Discussion SCDV overcomes several challenges encountered while training
document vectors 1. Clustering word-embeddings to discover topics improves performance of classification, also generating coherent clusters of words. clustering gives more discriminative representations of documents than paragraph vectors. This enables SCDV to represent complex documents.

Analysis and Discussion Visualization of paragraph vectors(left) and SCDV(right) using
t-SNE

Analysis and Discussion 2. Semantically different words are assigned to
different topics. a single document can contain words from multiple different topics. 3.Sparsity also enables linear SVM to scale to large dimensions. On 20NewsGroups, BoWV model takes up 1.1 GB while SCDV takes up only 236MB(80% decrease).

Analysis and Discussion 4. reduces document vector formation, training and
prediction time significantly.

Conclusion • They propose a document feature formation technique for
topic-based document representation. • SCDV outperforms state-of-the-art models in multi-class and multi-label classification tasks. • They show that fuzzy GMM clustering on word-vectors lead to more coherent topic than LDA and can also be used to detect Polysemic words. • SCDV is simple, efficient and creates a more accurate semantic representation of documents.

文献紹介 SCDV _ Sparse Composite Document Vectors u...

文献紹介 SCDV _ Sparse Composite Document Vectors using soft clustering over distributional representations

T.Tada

More Decks by T.Tada

Featured

Transcript

文献紹介(2018/Aug/23) SCDV :Sparse Composite Document Vectors using soft clustering over

About the thesis Authors Dheeraj Mekala : IIT Kanpur Vivek

Abstract • They present a feature vector formation technique for

Introduction • Distributed word embeddings represent words as dense, low-dimensional

Introduction • Representing entire documents in a dense, low-dimensional space

Sparse Composite Document Vectors（SCDV） The feature formation algorithm can be

Word Vector Clustering Learning d dimensional word vector representations for

Document Topic-vector Formation For each word wi, we create K

Sparse Document Vectors Most values in →dvDn are very close

Sparse Document Vectors Word-topics vector formation. Sparse Composite Document Vector

Experiments • Baselines • Text Classification ・Multi-class classification ・Multi-label classification

Baselines Use the best parameter settings as reported. Bag-ofWords(BoW) model

Multi-class classification Dataset : 20NewsGroup in SCDV, They set the

Multi-label classification Dataset : Reuters-21578 Use LinearSVM for multi-class classification

Context-Sensitive Learning select some words (wj) with multiple senses from

Analysis and Discussion SCDV overcomes several challenges encountered while training

Analysis and Discussion Visualization of paragraph vectors(left) and SCDV(right) using

Analysis and Discussion 2. Semantically different words are assigned to

Analysis and Discussion 4. reduces document vector formation, training and

Conclusion • They propose a document feature formation technique for