Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text summarization Phase 1 evaluation 2

Text summarization Phase 1 evaluation 2

Phase 2 evaluation of text summarization final year project under professor U.A. Deshpande in collaboration with TCS presentation.

In phase 2 we studied how word embeddings are calculated to represent any arbitrary word in a fixed length numerical vector (CBOW and Skip-gram). Read about skip-thought architecture and GRU's.

Made a basic prototype of text summarization using k-means clustering which gives 60% to 75% of accuracy as compared with BBC news data set (example can be found in the slides).

Phase 1 presentation: https://speakerdeck.com/gautamabhishek46/text-summarization-phase-1-evaluation

Team:
Abhishek Gautam
Atharva Parwatkar
Sharvil Nagarkar

Professor in-charge: U. A. Deshpande
TCS Mentor : Dr. Sagar Sunkle

Abhishek Gautam

December 01, 2018
Tweet

More Decks by Abhishek Gautam

Other Decks in Education

Transcript

  1. Text Summarization
    Abhishek Gautam (BT15CSE002)
    Atharva Parwatkar (BT15CSE015)
    Sharvil Nagarkar (BT15CSE052)
    Under Prof. U.A.Deshpande

    View Slide

  2. Problem
    ● Neural networks are inefficient with text input.
    ● Sentences and documents are of variable length.
    ● How to represent words or sentences in fixed length?
    2

    View Slide

  3. Word Embeddings
    ● Word embeddings are representation of plain text words in fixed size
    numerical vector.
    ● It is capable of capturing context of a word in a document, semantic and
    syntactic similarity, relation with other words, etc.
    3

    View Slide

  4. Types of word embeddings
    ● Frequency based
    ● Prediction based
    4

    View Slide

  5. Frequency based word embeddings (Overview)
    5

    View Slide

  6. Advantages
    ● Fast computation
    ● Preserves the semantic relationship between words.
    Disadvantages
    ● Huge memory requirement.
    ● Good results are not obtained.
    6

    View Slide

  7. Prediction based word embeddings
    7

    View Slide

  8. Different prediction based word embeddings
    1. word2vec
    2. doc2vec
    3. FastText
    Doc2Vec and FastText are versions of word2Vec.
    8

    View Slide

  9. One-Hot-Encoding
    Representation of word “sample”
    Word :
    Index :
    Array :
    Index :
    9

    View Slide

  10. word2vec
    word2vec uses one of the following self-supervised model architecture to
    produce word embeddings.
    1. CBOW (continuous bag of words)
    2. Skip-gram model
    10

    View Slide

  11. CBOW
    ● This architecture tends to predict the probability of a word given a
    context window.
    ● Take One-Hot encoded vector of word as input.
    Example: Try to predict Fox given, “Quick”, “Brown”, “Jump” and “Over”
    11

    View Slide

  12. CBOW architecture (for one word context)
    12
    One-Hot
    encoded
    input vector
    Softmax
    probabilities
    of each
    vocabulary
    word

    View Slide

  13. ● One hidden layer is used.
    ● No activation function is used in
    hidden layer.
    ● Softmax activation function is
    used in output layer.
    ● Error is calculated by subtracting target one hot encoded array from
    softmax probabilities.
    ● Error is propagated using gradient descent
    ● Size of hidden layer is equal to the size of fixed length vector of word
    embeddings.
    CBOW
    13

    View Slide

  14. CBOW architecture
    (for multi word context)
    ● Calculate One-Hot encoded
    vectors for each context
    word.
    ● Concatenate them in order
    same as in sentence.
    ● Pass this vector as the input
    to CBOW network.
    14

    View Slide

  15. CBOW Advantages
    ● Low storage requirement than frequency based word embeddings.
    ● Can perform reasoning like: (King - man + woman = Queen)
    CBOW Disadvantages
    ● Takes the average of the context of a word. (Ex: Apple can be a
    company and apple)
    ● Huge training time.
    15

    View Slide

  16. Skip-gram model
    ● This architecture tends to predict the context of a given word.
    ● Take One-Hot encoded vector of word as input.
    Example: Try to predict “Quick”, “Brown”, “Jump” and “Over” given,
    “Fox”
    16

    View Slide

  17. Skip-gram model
    ● One hidden layer is used.
    ● No activation function is used in
    hidden layer.
    ● Softmax activation function is
    used in output layer.
    ● Error is calculated by subtracting target one hot encoded array from
    softmax probabilities.
    ● Error is propagated using gradient descent
    ● Size of hidden layer is equal to the size of fixed length vector of word
    embeddings. 17

    View Slide

  18. Skip-gram model Advantages
    ● Skip-gram model can capture more than semantics for a single word.
    CBOW Disadvantages
    ● Fails to identify combined words. Example: New york
    18

    View Slide

  19. Working Model
    ● Workflow Diagram
    ● Data Preprocessing
    ● Creation of Word Embeddings
    ● Clusterization
    ● Summarization
    19

    View Slide

  20. Workflow Diagram
    Building Word
    Embeddings
    Sentence
    Embeddings
    Clusterization
    Document
    Data
    Preprocessing
    Summarization
    20

    View Slide

  21. Data Collection and Preprocessing
    ● This step involves collection of news articles from files.
    ● After collection, following preprocessing steps are performed:
    ○ Tokenization
    ○ Normalization
    ■ Removal of non ASCII characters, punctuations, stopwords
    ■ Lemmatization
    21

    View Slide

  22. Example
    22

    View Slide

  23. Building Word Embeddings
    ● Word Embeddings are created by using Word2Vec class of Gensim
    library.
    ● Object of this class is trained with data (news articles) to create a
    vocabulary and word embeddings for each word.
    23

    View Slide

  24. Example
    24

    View Slide

  25. Use of Word Embeddings
    ● Word embeddings can be used by Deep Learning models to represent
    words.
    ● One interesting example of use of word embeddings is to find words
    similar to a given word.
    25

    View Slide

  26. Example
    26

    View Slide

  27. Generation of Sentence Embeddings
    ● Sentence embeddings are created as a weighted average of word
    embeddings of words in the sentence.
    ● Here, the notion is a frequent word in sentence should have less
    weightage.
    ● Hence, weight of a word embedding is inversely proportional to its
    frequency in the sentence.
    27

    View Slide

  28. Example
    28

    View Slide

  29. Clusterization and Summarization
    ● Clusters of sentences in input document are created using K-means
    clustering algorithm.
    ● Sentences closest to the cluster centers are then selected for the
    summary generation.
    ● Average of indexes for each cluster is found and sentences are ordered
    on the basis of this.
    29

    View Slide

  30. Example
    30

    View Slide

  31. Example Continued
    31

    View Slide

  32. Skip Thought Vectors
    ● The Skip-thought model was inspired by the skip-gram
    structure used in word2vec, which is based on the idea
    that a word’s meaning is embedded by the surrounding
    words.
    ● Similarly, in contiguous text, nearby sentences provide
    rich semantic and contextual information
    32

    View Slide

  33. Skip Thought Model
    ● The model is based on an encoder-decoder architecture.
    ● All variants of this architecture share a common goal: encoding source
    inputs into fixed-length vector representations, and then feeding such
    vectors through a “narrow passage” to decode into a target output.
    ● In the case of Neural Machine Translation, the input sentence is in a
    source language (English), and the target sentence is its translation in a
    target language (German).
    33

    View Slide

  34. Skip Thought Vectors
    ● With the Skip-thought model, the encoding of a source sentence is
    mapped to two target sentences: one as the preceding sentence, the
    other as the subsequent sentence.
    ● Unlike previous method, skip thought encoders take the sequence of
    words in the sentence into account.
    ● Prevents from incurring undesirable losses
    34

    View Slide

  35. Encoder Decoder Network
    35

    View Slide

  36. Skip Thought Architecture
    ● Then, an encoder, built using recurrent neural network layers(GRU),is
    able to capture the patterns of sequential word vectors. The hidden
    states of the encoder are fed as representations of the inputs into two
    separate decoders (to predict the preceding and subsequent sentence)
    ● Intuitively speaking, the encoder generates a representation of the
    input sentence itself. Back-propagating costs from the decoder during
    training enables the encoder to capture the relationship of the input
    sentence to its surrounding sentences as well.
    36

    View Slide

  37. Gated Recurrent Unit (GRU)

    37

    View Slide

  38. GRUs
    ● GRUs are improved version of standard recurrent neural network.
    ● To solve the vanishing gradient problem of a standard RNN, GRU uses
    update gate and reset gate.
    ● Update Gate: The update gate helps the model to determine how much
    of the past information (from previous time steps) needs to be passed
    along to the future.
    38

    View Slide

  39. GRUs
    ● Reset Gate: This gate is used from the model to decide how much of
    the past information to forget
    ● Current memory content: We introduce a new memory content which
    will use the reset gate to store the relevant information from the past.
    39

    View Slide

  40. GRUs
    ● Final memory at current time step: As a last step, the network needs to
    calculate h(t)— vector which holds information for the current unit and
    passes it down to the network.
    40

    View Slide


  41. 41

    View Slide

  42. Contextual Summarization
    ● Query-based summarization
    ● Extending the existing models
    42

    View Slide

  43. Our Approach
    ● The model created above can be extended further to generate
    summary based on the context asked by leveraging the similarities
    obtained using sentence embeddings between query and the
    document.
    ● Sentences which are most similar to the context passed will be
    returned by the model.
    43

    View Slide

  44. References
    ● https://arxiv.org/pdf/1411.2 - Word2vec Parameter Learning Explained
    ● https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/ -
    Word embeddings
    ● https://arxiv.org/pdf/1506.06726.pdf - Skip-Thought Vectors
    ● https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be -
    Understanding GRU Networks
    ● https://medium.com/jatana/unsupervised-text-summarization-using-sentence-embedding
    s-adb15ce83db1
    44

    View Slide

  45. Thank you!

    45

    View Slide