Text summarization Phase 1 evaluation 2

Slide 1

Slide 1 text

Text Summarization Abhishek Gautam (BT15CSE002) Atharva Parwatkar (BT15CSE015) Sharvil Nagarkar (BT15CSE052) Under Prof. U.A.Deshpande

Slide 2

Slide 2 text

Problem ● Neural networks are inefficient with text input. ● Sentences and documents are of variable length. ● How to represent words or sentences in fixed length? 2

Slide 3

Slide 3 text

Word Embeddings ● Word embeddings are representation of plain text words in fixed size numerical vector. ● It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc. 3

Slide 4

Slide 4 text

Types of word embeddings ● Frequency based ● Prediction based 4

Slide 5

Slide 5 text

Frequency based word embeddings (Overview) 5

Slide 6

Slide 6 text

Advantages ● Fast computation ● Preserves the semantic relationship between words. Disadvantages ● Huge memory requirement. ● Good results are not obtained. 6

Slide 7

Slide 7 text

Prediction based word embeddings 7

Slide 8

Slide 8 text

Different prediction based word embeddings 1. word2vec 2. doc2vec 3. FastText Doc2Vec and FastText are versions of word2Vec. 8

Slide 9

Slide 9 text

One-Hot-Encoding Representation of word “sample” Word : Index : Array : Index : 9

Slide 10

Slide 10 text

word2vec word2vec uses one of the following self-supervised model architecture to produce word embeddings. 1. CBOW (continuous bag of words) 2. Skip-gram model 10

Slide 11

Slide 11 text

CBOW ● This architecture tends to predict the probability of a word given a context window. ● Take One-Hot encoded vector of word as input. Example: Try to predict Fox given, “Quick”, “Brown”, “Jump” and “Over” 11

Slide 12

Slide 12 text

CBOW architecture (for one word context) 12 One-Hot encoded input vector Softmax probabilities of each vocabulary word

Slide 13

Slide 13 text

● One hidden layer is used. ● No activation function is used in hidden layer. ● Softmax activation function is used in output layer. ● Error is calculated by subtracting target one hot encoded array from softmax probabilities. ● Error is propagated using gradient descent ● Size of hidden layer is equal to the size of fixed length vector of word embeddings. CBOW 13

Slide 14

Slide 14 text

CBOW architecture (for multi word context) ● Calculate One-Hot encoded vectors for each context word. ● Concatenate them in order same as in sentence. ● Pass this vector as the input to CBOW network. 14

Slide 15

Slide 15 text

CBOW Advantages ● Low storage requirement than frequency based word embeddings. ● Can perform reasoning like: (King - man + woman = Queen) CBOW Disadvantages ● Takes the average of the context of a word. (Ex: Apple can be a company and apple) ● Huge training time. 15

Slide 16

Slide 16 text

Skip-gram model ● This architecture tends to predict the context of a given word. ● Take One-Hot encoded vector of word as input. Example: Try to predict “Quick”, “Brown”, “Jump” and “Over” given, “Fox” 16

Slide 17

Slide 17 text

Skip-gram model ● One hidden layer is used. ● No activation function is used in hidden layer. ● Softmax activation function is used in output layer. ● Error is calculated by subtracting target one hot encoded array from softmax probabilities. ● Error is propagated using gradient descent ● Size of hidden layer is equal to the size of fixed length vector of word embeddings. 17

Slide 18

Slide 18 text

Skip-gram model Advantages ● Skip-gram model can capture more than semantics for a single word. CBOW Disadvantages ● Fails to identify combined words. Example: New york 18

Slide 19

Slide 19 text

Working Model ● Workflow Diagram ● Data Preprocessing ● Creation of Word Embeddings ● Clusterization ● Summarization 19

Slide 20

Slide 20 text

Workflow Diagram Building Word Embeddings Sentence Embeddings Clusterization Document Data Preprocessing Summarization 20

Slide 21

Slide 21 text

Data Collection and Preprocessing ● This step involves collection of news articles from files. ● After collection, following preprocessing steps are performed: ○ Tokenization ○ Normalization ■ Removal of non ASCII characters, punctuations, stopwords ■ Lemmatization 21

Slide 22

Slide 22 text

Example 22

Slide 23

Slide 23 text

Building Word Embeddings ● Word Embeddings are created by using Word2Vec class of Gensim library. ● Object of this class is trained with data (news articles) to create a vocabulary and word embeddings for each word. 23

Slide 24

Slide 24 text

Example 24

Slide 25

Slide 25 text

Use of Word Embeddings ● Word embeddings can be used by Deep Learning models to represent words. ● One interesting example of use of word embeddings is to find words similar to a given word. 25

Slide 26

Slide 26 text

Example 26

Slide 27

Slide 27 text

Generation of Sentence Embeddings ● Sentence embeddings are created as a weighted average of word embeddings of words in the sentence. ● Here, the notion is a frequent word in sentence should have less weightage. ● Hence, weight of a word embedding is inversely proportional to its frequency in the sentence. 27

Slide 28

Slide 28 text

Example 28

Slide 29

Slide 29 text

Clusterization and Summarization ● Clusters of sentences in input document are created using K-means clustering algorithm. ● Sentences closest to the cluster centers are then selected for the summary generation. ● Average of indexes for each cluster is found and sentences are ordered on the basis of this. 29

Slide 30

Slide 30 text

Example 30

Slide 31

Slide 31 text

Example Continued 31

Slide 32

Slide 32 text

Skip Thought Vectors ● The Skip-thought model was inspired by the skip-gram structure used in word2vec, which is based on the idea that a word’s meaning is embedded by the surrounding words. ● Similarly, in contiguous text, nearby sentences provide rich semantic and contextual information 32

Slide 33

Slide 33 text

Skip Thought Model ● The model is based on an encoder-decoder architecture. ● All variants of this architecture share a common goal: encoding source inputs into fixed-length vector representations, and then feeding such vectors through a “narrow passage” to decode into a target output. ● In the case of Neural Machine Translation, the input sentence is in a source language (English), and the target sentence is its translation in a target language (German). 33

Slide 34

Slide 34 text

Skip Thought Vectors ● With the Skip-thought model, the encoding of a source sentence is mapped to two target sentences: one as the preceding sentence, the other as the subsequent sentence. ● Unlike previous method, skip thought encoders take the sequence of words in the sentence into account. ● Prevents from incurring undesirable losses 34

Slide 35

Slide 35 text

Encoder Decoder Network 35

Slide 36

Slide 36 text

Skip Thought Architecture ● Then, an encoder, built using recurrent neural network layers(GRU),is able to capture the patterns of sequential word vectors. The hidden states of the encoder are fed as representations of the inputs into two separate decoders (to predict the preceding and subsequent sentence) ● Intuitively speaking, the encoder generates a representation of the input sentence itself. Back-propagating costs from the decoder during training enables the encoder to capture the relationship of the input sentence to its surrounding sentences as well. 36

Slide 37

Slide 37 text

Gated Recurrent Unit (GRU) ● 37

Slide 38

Slide 38 text

GRUs ● GRUs are improved version of standard recurrent neural network. ● To solve the vanishing gradient problem of a standard RNN, GRU uses update gate and reset gate. ● Update Gate: The update gate helps the model to determine how much of the past information (from previous time steps) needs to be passed along to the future. 38

Slide 39

Slide 39 text

GRUs ● Reset Gate: This gate is used from the model to decide how much of the past information to forget ● Current memory content: We introduce a new memory content which will use the reset gate to store the relevant information from the past. 39

Slide 40

Slide 40 text

GRUs ● Final memory at current time step: As a last step, the network needs to calculate h(t)— vector which holds information for the current unit and passes it down to the network. 40

Slide 41

Slide 41 text

● 41

Slide 42

Slide 42 text

Contextual Summarization ● Query-based summarization ● Extending the existing models 42

Slide 43

Slide 43 text

Our Approach ● The model created above can be extended further to generate summary based on the context asked by leveraging the similarities obtained using sentence embeddings between query and the document. ● Sentences which are most similar to the context passed will be returned by the model. 43

Slide 44

Slide 44 text

References ● https://arxiv.org/pdf/1411.2 - Word2vec Parameter Learning Explained ● https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/ - Word embeddings ● https://arxiv.org/pdf/1506.06726.pdf - Skip-Thought Vectors ● https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be - Understanding GRU Networks ● https://medium.com/jatana/unsupervised-text-summarization-using-sentence-embedding s-adb15ce83db1 44

Slide 45

Slide 45 text

Thank you! ● 45