Text summarization Phase 2 Evaluation 1

Slide 1

Slide 1 text

Text Summarization Abhishek Gautam (BT15CSE002) Atharva Parwatkar (BT15CSE015) Sharvil Nagarkar (BT15CSE052) Under Prof. U.A.Deshpande

Slide 2

Slide 2 text

Recap ● We created an unsupervised model using sentence embeddings ● Model was extractive in nature 2

Slide 3

Slide 3 text

Abstractive summarisation ● Limitations of extractive summarisation ● What abstractive model does differently ● Our approaches 3

Slide 4

Slide 4 text

Our Approaches ● A domain specific abstractive model ● A neural network using reinforcement learning ● A Seq2Seq Neural Attention model 4

Slide 5

Slide 5 text

Domain specific abstractive model ● Data ○ CNN Dailymail Dataset ○ Movie subtitles and their corresponding plots ■ Scraped movie subtitles data from https://yifysubtitles.org ■ Scraped Wikipedia articles for extracting plots of movies 5

Slide 6

Slide 6 text

Domain specific abstractive model ● Extractive model for pulling out most of the relevant information ● Applying Key-phrase extraction on the intermediate text generated for domain specific summary generation 6

Slide 7

Slide 7 text

Key Phrase Extraction ● Automatic keyphrase extraction is typically a two-step process: ○ A set of words and phrases that could convey the topical content of a document are identified. ○ Then these candidates are scored/ranked and the “best” are selected as a document’s keyphrases. ● Key phrase extraction is achieved by SpaCy. 7

Slide 8

Slide 8 text

1. Candidate Identification ● Common heuristics include filtering for words with certain parts of speech or, for multi-word phrases, certain POS patterns; and using external knowledge bases like WordNet or Wikipedia as a reference source of good/bad keyphrases. ● Noun phrases matching the POS pattern {(* + )? * +} (a regular expression written in a simplified format) 8

Slide 9

Slide 9 text

2. Keyphrase Selection ● Graph-based ranking method, in which the importance of a candidate is determined by its relatedness to other candidates, where “relatedness” may be measured by two terms’ frequency of co-occurrence or semantic relatedness. This method assumes that more important candidates are related to a greater number of other candidates, and that more of those related candidates are also considered important 9

Slide 10

Slide 10 text

Model using reinforcement learning ● Sequence to sequence sentence generation ● Reinforcement learning 10

Slide 11

Slide 11 text

Sequence to sequence sentence generation 11 ● Used for various NLP tasks such as machine translation, Q&A, etc. ● Encoder and decoder architecture is used. ○ It consists of LSTM or bidirectional LSTM. ○ Word embeddings are fed to encoder at each timestep. ○ Encoder creates a context vector and passes it to decoder when it receives a EOS (end of sentence symbol). ○ In each time step decoder predicts the next word using the previous hidden state output and predicted word until it predicts a EOS symbol.

Slide 12

Slide 12 text

Sequence to sequence sentence generation 12

Slide 13

Slide 13 text

13 Encoder Using bidirectional LSTM

Slide 14

Slide 14 text

Reinforcement learning ● Extractor: CNN-then-RNN. ● Extractor generates representation of important word, phrases and sentences. ● Using extractor’s output important sentences are selected using a Pointer Network. (Not shown in the image) 14

Slide 15

Slide 15 text

Reinforcement learning ● Abstractor network then compresses and rewrite an extracted document sentences to a concise summary sentences. ● ROUGE score is calculated and then extractor is trained using it. 15

Slide 16

Slide 16 text

Reinforcement learning (Extractor) 16

Slide 17

Slide 17 text

What is attention? 17 ● What does current model do? ● How does attention help?

Slide 18

Slide 18 text

What current model does? 18 ...front against russian terrorism defence...

Slide 19

Slide 19 text

What current model does? 19 ● In the picture, “front”, “against” and “terrorism” words are fed into an encoder, and after a special signal the decoder starts producing a translated (simplified) sentence. ● The decoder is supposed to generate a translation solely based on the last hidden state from the encoder. ● It seems unreasonable to assume that we can encode all information about a potentially very long sentence into a single vector and then have the decoder produce a good translation based on only that. ● Attention model solves this problem.

Slide 20

Slide 20 text

Attention Model 20 ● With an attention mechanism we no longer try encode the full source sentence into a fixed-length vector. ● Rather, we allow the decoder to “attend” to different parts of the source sentence at each step of the output generation. ● We let the model learn what to attend to based on the input sentence and what it has produced so far. ● Each decoder output word depends on a weighted combination of all the hidden states of input, not just the last state of input. ● Weights are updated during training.

Slide 21

Slide 21 text

Cost of using Attention Model 21 ● We need to calculate an attention value for each combination of input and output word. ● If you have a 50-word input sequence and generate a 50-word output sequence that would be 2500 attention values.

Slide 22

Slide 22 text

22 An Example Visualization of attention model

Slide 23

Slide 23 text

Generating Summaries 23 ● Beam Search

Slide 24

Slide 24 text

Generating Summaries 24 At each step, the decoder outputs a probability distribution over the target vocabulary. To get the output word at this step we can do the following: ● Greedy Sampling ○ Choose the word with highest probability at each timestep. ○ Sometimes tend to produce incorrect results. ● Better approach is to use Beam Search.

Slide 25

Slide 25 text

Beam Search 25 ● Ideally all the possible branches should be checked for best result. ● But, this is not feasible as the number of possible hypotheses is exponential. ● Hence, we compromise between an exact solution and greedy approach using Beam Search. ● Essentially, Beam Search maintains k top hypothesis for the summary. ● It uses pruning to retain top k results. ● This ensures that each target word gets a fair shot at generating the summary.

Slide 26

Slide 26 text

Beam Search Example 26

Slide 27

Slide 27 text

Future Approach 27 ● Implementation of Seq2Seq model, attention model, etc. ● Testing with CNN/DM dataset, scraped movies dataset. ● Tuning the model, under supervision, to gain deep insights.

Slide 28

Slide 28 text

References ● https://arxiv.org/pdf/1805.11080.pdf ● http://bdewilde.github.io/blog/2014/09/23/intro-to-automatic-keyphrase-extraction/ ● http://home.iitk.ac.in/~soumye/cs498a/pres.pdf ● https://github.com/icoxfog417/awesome-text-summarization ● https://www.aclweb.org/anthology/D/D15/D15-1044.pdf ● https://www.cs.cmu.edu/~bhiksha/courses/deeplearning/Fall.2015/slides/lec14.neubig.seq_to_se q.pdf 28

Slide 29

Slide 29 text

Thank you! 29