Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text summarization Phase 1 evaluation 2

Text summarization Phase 1 evaluation 2

Phase 2 evaluation of text summarization final year project under professor U.A. Deshpande in collaboration with TCS presentation.

In phase 2 we studied how word embeddings are calculated to represent any arbitrary word in a fixed length numerical vector (CBOW and Skip-gram). Read about skip-thought architecture and GRU's.

Made a basic prototype of text summarization using k-means clustering which gives 60% to 75% of accuracy as compared with BBC news data set (example can be found in the slides).

Phase 1 presentation: https://speakerdeck.com/gautamabhishek46/text-summarization-phase-1-evaluation

Team:
Abhishek Gautam
Atharva Parwatkar
Sharvil Nagarkar

Professor in-charge: U. A. Deshpande
TCS Mentor : Dr. Sagar Sunkle

Abhishek Gautam

December 01, 2018
Tweet

More Decks by Abhishek Gautam

Other Decks in Education

Transcript

  1. Problem • Neural networks are inefficient with text input. •

    Sentences and documents are of variable length. • How to represent words or sentences in fixed length? 2
  2. Word Embeddings • Word embeddings are representation of plain text

    words in fixed size numerical vector. • It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc. 3
  3. Advantages • Fast computation • Preserves the semantic relationship between

    words. Disadvantages • Huge memory requirement. • Good results are not obtained. 6
  4. Different prediction based word embeddings 1. word2vec 2. doc2vec 3.

    FastText Doc2Vec and FastText are versions of word2Vec. 8
  5. word2vec word2vec uses one of the following self-supervised model architecture

    to produce word embeddings. 1. CBOW (continuous bag of words) 2. Skip-gram model 10
  6. CBOW • This architecture tends to predict the probability of

    a word given a context window. • Take One-Hot encoded vector of word as input. Example: Try to predict Fox given, “Quick”, “Brown”, “Jump” and “Over” 11
  7. CBOW architecture (for one word context) 12 One-Hot encoded input

    vector Softmax probabilities of each vocabulary word
  8. • One hidden layer is used. • No activation function

    is used in hidden layer. • Softmax activation function is used in output layer. • Error is calculated by subtracting target one hot encoded array from softmax probabilities. • Error is propagated using gradient descent • Size of hidden layer is equal to the size of fixed length vector of word embeddings. CBOW 13
  9. CBOW architecture (for multi word context) • Calculate One-Hot encoded

    vectors for each context word. • Concatenate them in order same as in sentence. • Pass this vector as the input to CBOW network. 14
  10. CBOW Advantages • Low storage requirement than frequency based word

    embeddings. • Can perform reasoning like: (King - man + woman = Queen) CBOW Disadvantages • Takes the average of the context of a word. (Ex: Apple can be a company and apple) • Huge training time. 15
  11. Skip-gram model • This architecture tends to predict the context

    of a given word. • Take One-Hot encoded vector of word as input. Example: Try to predict “Quick”, “Brown”, “Jump” and “Over” given, “Fox” 16
  12. Skip-gram model • One hidden layer is used. • No

    activation function is used in hidden layer. • Softmax activation function is used in output layer. • Error is calculated by subtracting target one hot encoded array from softmax probabilities. • Error is propagated using gradient descent • Size of hidden layer is equal to the size of fixed length vector of word embeddings. 17
  13. Skip-gram model Advantages • Skip-gram model can capture more than

    semantics for a single word. CBOW Disadvantages • Fails to identify combined words. Example: New york 18
  14. Working Model • Workflow Diagram • Data Preprocessing • Creation

    of Word Embeddings • Clusterization • Summarization 19
  15. Data Collection and Preprocessing • This step involves collection of

    news articles from files. • After collection, following preprocessing steps are performed: ◦ Tokenization ◦ Normalization ▪ Removal of non ASCII characters, punctuations, stopwords ▪ Lemmatization 21
  16. Building Word Embeddings • Word Embeddings are created by using

    Word2Vec class of Gensim library. • Object of this class is trained with data (news articles) to create a vocabulary and word embeddings for each word. 23
  17. Use of Word Embeddings • Word embeddings can be used

    by Deep Learning models to represent words. • One interesting example of use of word embeddings is to find words similar to a given word. 25
  18. Generation of Sentence Embeddings • Sentence embeddings are created as

    a weighted average of word embeddings of words in the sentence. • Here, the notion is a frequent word in sentence should have less weightage. • Hence, weight of a word embedding is inversely proportional to its frequency in the sentence. 27
  19. Clusterization and Summarization • Clusters of sentences in input document

    are created using K-means clustering algorithm. • Sentences closest to the cluster centers are then selected for the summary generation. • Average of indexes for each cluster is found and sentences are ordered on the basis of this. 29
  20. Skip Thought Vectors • The Skip-thought model was inspired by

    the skip-gram structure used in word2vec, which is based on the idea that a word’s meaning is embedded by the surrounding words. • Similarly, in contiguous text, nearby sentences provide rich semantic and contextual information 32
  21. Skip Thought Model • The model is based on an

    encoder-decoder architecture. • All variants of this architecture share a common goal: encoding source inputs into fixed-length vector representations, and then feeding such vectors through a “narrow passage” to decode into a target output. • In the case of Neural Machine Translation, the input sentence is in a source language (English), and the target sentence is its translation in a target language (German). 33
  22. Skip Thought Vectors • With the Skip-thought model, the encoding

    of a source sentence is mapped to two target sentences: one as the preceding sentence, the other as the subsequent sentence. • Unlike previous method, skip thought encoders take the sequence of words in the sentence into account. • Prevents from incurring undesirable losses 34
  23. Skip Thought Architecture • Then, an encoder, built using recurrent

    neural network layers(GRU),is able to capture the patterns of sequential word vectors. The hidden states of the encoder are fed as representations of the inputs into two separate decoders (to predict the preceding and subsequent sentence) • Intuitively speaking, the encoder generates a representation of the input sentence itself. Back-propagating costs from the decoder during training enables the encoder to capture the relationship of the input sentence to its surrounding sentences as well. 36
  24. GRUs • GRUs are improved version of standard recurrent neural

    network. • To solve the vanishing gradient problem of a standard RNN, GRU uses update gate and reset gate. • Update Gate: The update gate helps the model to determine how much of the past information (from previous time steps) needs to be passed along to the future. 38
  25. GRUs • Reset Gate: This gate is used from the

    model to decide how much of the past information to forget • Current memory content: We introduce a new memory content which will use the reset gate to store the relevant information from the past. 39
  26. GRUs • Final memory at current time step: As a

    last step, the network needs to calculate h(t)— vector which holds information for the current unit and passes it down to the network. 40
  27. Our Approach • The model created above can be extended

    further to generate summary based on the context asked by leveraging the similarities obtained using sentence embeddings between query and the document. • Sentences which are most similar to the context passed will be returned by the model. 43
  28. References • https://arxiv.org/pdf/1411.2 - Word2vec Parameter Learning Explained • https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

    - Word embeddings • https://arxiv.org/pdf/1506.06726.pdf - Skip-Thought Vectors • https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be - Understanding GRU Networks • https://medium.com/jatana/unsupervised-text-summarization-using-sentence-embedding s-adb15ce83db1 44