文献紹介 SCDV _ Sparse Composite Document Vectors using soft clustering over distributional representations

1cefda5462b43e6f411c53092627aa58?s=47 T.Tada
August 23, 2018
130

文献紹介 SCDV _ Sparse Composite Document Vectors using soft clustering over distributional representations

1cefda5462b43e6f411c53092627aa58?s=128

T.Tada

August 23, 2018
Tweet

Transcript

  1. 文献紹介(2018/Aug/23) SCDV :Sparse Composite Document Vectors using soft clustering over

    distributional representations 長岡技術科学大学 自然言語処理研究室 多田 太郎
  2. About the thesis Authors Dheeraj Mekala : IIT Kanpur Vivek

    Gupta : Microsoft Research Bhargavi Paranjape : Microsoft Research Harish Karnick : IIT Kanpur Conference Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 659–669, 2017 Association for Computational Linguistics
  3. Abstract • They present a feature vector formation technique for

    documents - Sparse Composite Document Vector (SCDV) - • They outperform the previous state-of-the-art method, NTSG Liu et al. (2015). • They achieve significant reduction in training and prediction times compared to other representation methods.
  4. Introduction • Distributed word embeddings represent words as dense, low-dimensional

    and real-valued vectors that can capture their semantic and syntactic properties. • Representations based on neural network language models (Mikolov et al. 2013) can overcome bag-of-word model’s flaws (don’t account for word ordering and long-distance semantic relations) and further reduce the dimensionality of the vectors. • However, there is a need to extend word embeddings to entire paragraphs and documents for tasks such as document and short-text classification.
  5. Introduction • Representing entire documents in a dense, low-dimensional space

    is a challenge. • vectors of two documents that contain the same word in two distinct senses need to account for this distinction for an accurate semantic representation of the documents. • They propose the Sparse Composite Document Vector(SCDV) representation learning technique to address these challenges and create efficient, accurate and robust semantic representations of large texts for document classification tasks.
  6. Sparse Composite Document Vectors(SCDV) The feature formation algorithm can be

    divided into following three steps. • Word Vector Clustering • Document Topic-vector Formation • Sparse Document Vectors
  7. Word Vector Clustering Learning d dimensional word vector representations for

    every word in the vocabulary V using the skip-gram algorithm with negative sampling (SGNS) (Mikolov et al. 2013). Then cluster these word embeddings using the Gaussian Mixture Models(GMM) (Reynolds 2015) soft clustering technique. The number of clusters, K, to be formed is a parameter of the SCDV model. Each word belongs to every cluster with some probability P(ck|wi).
  8. Document Topic-vector Formation For each word wi, we create K

    different word-cluster vectors of d dimensions (→wcvik) by P(ck|wi). Concatenate all K word-cluster vectors (→wcvik) into a K×d dimensional embedding and weigh it with inverse document frequency of wi to form a word-topics vector (→wtvi). Finally, for all words appearing in document Dn, we sum their word-topic vectors →wtvi to obtain the document vector →dvDn.
  9. Sparse Document Vectors Most values in →dvDn are very close

    to zero. utilize this fact to a threshold (specified as a parameter), which results in the Sparse  Composite Document Vector →SCDVDn.
  10. Sparse Document Vectors Word-topics vector formation. Sparse Composite Document Vector

    formation.
  11. Experiments • Baselines • Text Classification ・Multi-class classification ・Multi-label classification

    ・Effect of Hyper-Parameters • Topic Coherence • Context-Sensitive Learning • Information Retrieval
  12. Baselines Use the best parameter settings as reported. Bag-ofWords(BoW) model

    (Harris, 1954), Bag of Word Vector (BoWV) (Gupta et al., 2016) model, paragraph vector models (Le and Mikolov, 2014), Topical word embeddings (TWE-1) (Liu et al.,2015b), Neural Tensor Skip-Gram Model (NTSG1to NTSG-3) (Liu et al., 2015a), tf-idf weighted average word-vector model (Singh and Mukerjee, 2015) weighted Bag of Concepts (weightBoC)(Kim et al., 2017),
  13. Multi-class classification Dataset : 20NewsGroup in SCDV, They set the

    dimension 200 sparsity threshold parameter to 4% the number of mixture components in GMM to 60. They learn word vector embedding using Skip-Gram with Negative Sampling (SGNS) of 10 minimum word frequency as 20. use 5-fold cross-validation on F1 score to tune parameter C of SVM. the current state of art : NTSG (Neural-Tensor-Skip-Gram)
  14. Multi-label classification Dataset : Reuters-21578 Use LinearSVM for multi-class classification

    and Logistic regression with OneVsRest setting for multi-label classification in baselines and SCDV.
  15. Context-Sensitive Learning select some words (wj) with multiple senses from

    20Newsgroup,each cluster (ci)
  16. Analysis and Discussion SCDV overcomes several challenges encountered while training

    document vectors 1. Clustering word-embeddings to discover topics improves performance of classification, also generating coherent clusters of words. clustering gives more discriminative representations of documents than paragraph vectors. This enables SCDV to represent complex documents.
  17. Analysis and Discussion Visualization of paragraph vectors(left) and SCDV(right) using

    t-SNE
  18. Analysis and Discussion 2. Semantically different words are assigned to

    different topics. a single document can contain words from multiple different topics. 3.Sparsity also enables linear SVM to scale to large dimensions. On 20NewsGroups, BoWV model takes up 1.1 GB while SCDV takes up only 236MB(80% decrease).
  19. Analysis and Discussion 4. reduces document vector formation, training and

    prediction time significantly.
  20. Conclusion • They propose a document feature formation technique for

    topic-based document representation. • SCDV outperforms state-of-the-art models in multi-class and multi-label classification tasks. • They show that fuzzy GMM clustering on word-vectors lead to more coherent topic than LDA and can also be used to detect Polysemic words. • SCDV is simple, efficient and creates a more accurate semantic representation of documents.
  21. None