Upgrade to Pro — share decks privately, control downloads, hide ads and more …

文献紹介 SCDV _ Sparse Composite Document Vectors using soft clustering over distributional representations

T.Tada
August 23, 2018
270

文献紹介 SCDV _ Sparse Composite Document Vectors using soft clustering over distributional representations

T.Tada

August 23, 2018
Tweet

More Decks by T.Tada

Transcript

  1. 文献紹介(2018/Aug/23)
    SCDV :Sparse Composite Document Vectors using soft clustering
    over distributional representations
    長岡技術科学大学 自然言語処理研究室
    多田 太郎

    View full-size slide

  2. About the thesis
    Authors
    Dheeraj Mekala : IIT Kanpur
    Vivek Gupta : Microsoft Research
    Bhargavi Paranjape : Microsoft Research
    Harish Karnick : IIT Kanpur
    Conference
    Proceedings of the 2017 Conference on Empirical Methods in Natural
    Language Processing, pages 659–669,
    2017 Association for Computational Linguistics

    View full-size slide

  3. Abstract
    ● They present a feature vector formation technique for documents
    - Sparse Composite Document Vector (SCDV) -
    ● They outperform the previous state-of-the-art method,
    NTSG Liu et al. (2015).
    ● They achieve significant reduction in training and prediction times
    compared to other representation methods.

    View full-size slide

  4. Introduction
    ● Distributed word embeddings represent words as dense, low-dimensional and
    real-valued vectors that can capture their semantic and syntactic properties.
    ● Representations based on neural network language models (Mikolov et al.
    2013) can overcome bag-of-word model’s flaws (don’t account for word
    ordering and long-distance semantic relations) and further reduce the
    dimensionality of the vectors.
    ● However, there is a need to extend word embeddings to entire paragraphs
    and documents for tasks such as document and short-text classification.

    View full-size slide

  5. Introduction
    ● Representing entire documents in a dense, low-dimensional space is a
    challenge.
    ● vectors of two documents that contain the same word in two distinct senses
    need to account for this distinction for an accurate semantic representation of
    the documents.
    ● They propose the Sparse Composite Document Vector(SCDV) representation
    learning technique to address these challenges and create efficient, accurate
    and robust semantic representations of large texts for document classification
    tasks.

    View full-size slide

  6. Sparse Composite Document Vectors(SCDV)
    The feature formation algorithm can be divided into following three steps.
    ● Word Vector Clustering
    ● Document Topic-vector Formation
    ● Sparse Document Vectors

    View full-size slide

  7. Word Vector Clustering
    Learning d dimensional word vector representations for every word in the
    vocabulary V using the skip-gram algorithm with negative sampling (SGNS)
    (Mikolov et al. 2013).
    Then cluster these word embeddings using the Gaussian Mixture Models(GMM)
    (Reynolds 2015) soft clustering technique.
    The number of clusters, K, to be
    formed is a parameter of the SCDV
    model.
    Each word belongs to every cluster
    with some probability P(ck|wi).

    View full-size slide

  8. Document Topic-vector Formation
    For each word wi, we create K different word-cluster vectors of d dimensions
    (→wcvik) by P(ck|wi).
    Concatenate all K word-cluster vectors (→wcvik) into a K×d dimensional
    embedding and weigh it with inverse document frequency of wi to form a
    word-topics vector (→wtvi).
    Finally, for all words appearing in document Dn, we sum their word-topic vectors
    →wtvi to obtain the document vector →dvDn.

    View full-size slide

  9. Sparse Document Vectors
    Most values in →dvDn are very close to zero.
    utilize this fact to a threshold (specified as a
    parameter), which results in the Sparse 
    Composite Document Vector →SCDVDn.

    View full-size slide

  10. Sparse Document Vectors
    Word-topics vector formation. Sparse Composite Document Vector formation.

    View full-size slide

  11. Experiments
    ● Baselines
    ● Text Classification
    ・Multi-class classification
    ・Multi-label classification
    ・Effect of Hyper-Parameters
    ● Topic Coherence
    ● Context-Sensitive Learning
    ● Information Retrieval

    View full-size slide

  12. Baselines
    Use the best parameter settings as reported.
    Bag-ofWords(BoW) model (Harris, 1954),
    Bag of Word Vector (BoWV) (Gupta et al., 2016) model,
    paragraph vector models (Le and Mikolov, 2014),
    Topical word embeddings (TWE-1) (Liu et al.,2015b),
    Neural Tensor Skip-Gram Model (NTSG1to NTSG-3) (Liu et al., 2015a),
    tf-idf weighted average word-vector model (Singh and Mukerjee, 2015)
    weighted Bag of Concepts (weightBoC)(Kim et al., 2017),

    View full-size slide

  13. Multi-class classification
    Dataset : 20NewsGroup
    in SCDV, They set
    the dimension 200
    sparsity threshold parameter to 4%
    the number of mixture components in GMM to 60.
    They learn word vector embedding using
    Skip-Gram with Negative Sampling (SGNS) of 10
    minimum word frequency as 20.
    use 5-fold cross-validation on F1 score to
    tune parameter C of SVM.
    the current state of art :
    NTSG (Neural-Tensor-Skip-Gram)

    View full-size slide

  14. Multi-label classification
    Dataset : Reuters-21578
    Use LinearSVM for multi-class classification and Logistic regression with
    OneVsRest setting for multi-label classification in baselines and SCDV.

    View full-size slide

  15. Context-Sensitive Learning
    select some words (wj) with multiple senses from 20Newsgroup,each cluster (ci)

    View full-size slide

  16. Analysis and Discussion
    SCDV overcomes several challenges encountered while training document
    vectors
    1. Clustering word-embeddings to discover topics improves performance of
    classification, also generating coherent clusters of words.
    clustering gives more discriminative representations of documents than paragraph
    vectors. This enables SCDV to represent complex documents.

    View full-size slide

  17. Analysis and Discussion
    Visualization of paragraph vectors(left) and SCDV(right) using t-SNE

    View full-size slide

  18. Analysis and Discussion
    2. Semantically different words are assigned to different topics. a single document
    can contain words from multiple different topics.
    3.Sparsity also enables linear SVM to scale to large dimensions.
    On 20NewsGroups, BoWV model takes up 1.1 GB while SCDV takes up only
    236MB(80% decrease).

    View full-size slide

  19. Analysis and Discussion
    4. reduces document vector formation, training and prediction time significantly.

    View full-size slide

  20. Conclusion
    ● They propose a document feature formation technique for topic-based
    document representation.
    ● SCDV outperforms state-of-the-art models in multi-class and multi-label
    classification tasks.
    ● They show that fuzzy GMM clustering on word-vectors lead to more coherent
    topic than LDA and can also be used to detect Polysemic words.
    ● SCDV is simple, efficient and creates a more accurate semantic
    representation of documents.

    View full-size slide