Text Summarization
Abhishek Gautam (BT15CSE002)
Atharva Parwatkar (BT15CSE015)
Sharvil Nagarkar (BT15CSE052)
Under Prof. U.A.Deshpande
Slide 2
Slide 2 text
Recap
● We created an unsupervised model using sentence
embeddings
● Model was extractive in nature
2
Slide 3
Slide 3 text
Abstractive summarisation
● Limitations of extractive summarisation
● What abstractive model does differently
● Our approaches
3
Slide 4
Slide 4 text
Our Approaches
● A domain specific abstractive model
● A neural network using reinforcement learning
● A Seq2Seq Neural Attention model
4
Slide 5
Slide 5 text
Domain specific abstractive model
● Data
○ CNN Dailymail Dataset
○ Movie subtitles and their corresponding plots
■ Scraped movie subtitles data from https://yifysubtitles.org
■ Scraped Wikipedia articles for extracting plots of movies
5
Slide 6
Slide 6 text
Domain specific abstractive model
● Extractive model for pulling out most of the relevant information
● Applying Key-phrase extraction on the intermediate text generated for
domain specific summary generation
6
Slide 7
Slide 7 text
Key Phrase Extraction
● Automatic keyphrase extraction is typically a two-step process:
○ A set of words and phrases that could convey the topical content of a
document are identified.
○ Then these candidates are scored/ranked and the “best” are selected as a
document’s keyphrases.
● Key phrase extraction is achieved by SpaCy.
7
Slide 8
Slide 8 text
1. Candidate Identification
● Common heuristics include filtering for words with certain parts of speech or,
for multi-word phrases, certain POS patterns; and using external knowledge
bases like WordNet or Wikipedia as a reference source of good/bad keyphrases.
● Noun phrases matching the POS pattern {(* + )? *
+} (a regular expression written in a simplified format)
8
Slide 9
Slide 9 text
2. Keyphrase Selection
● Graph-based ranking method, in which the importance of a candidate is
determined by its relatedness to other candidates, where “relatedness” may be
measured by two terms’ frequency of co-occurrence or semantic relatedness.
This method assumes that more important candidates are related to a greater
number of other candidates, and that more of those related candidates are also
considered important
9
Slide 10
Slide 10 text
Model using reinforcement
learning
● Sequence to sequence sentence generation
● Reinforcement learning
10
Slide 11
Slide 11 text
Sequence to sequence sentence generation
11
● Used for various NLP tasks such as machine translation, Q&A, etc.
● Encoder and decoder architecture is used.
○ It consists of LSTM or bidirectional LSTM.
○ Word embeddings are fed to encoder at each timestep.
○ Encoder creates a context vector and passes it to decoder when it receives a EOS (end
of sentence symbol).
○ In each time step decoder predicts the next word using the previous hidden state
output and predicted word until it predicts a EOS symbol.
Slide 12
Slide 12 text
Sequence to sequence
sentence generation
12
Slide 13
Slide 13 text
13
Encoder
Using
bidirectional
LSTM
Slide 14
Slide 14 text
Reinforcement learning
● Extractor: CNN-then-RNN.
● Extractor generates
representation of important word,
phrases and sentences.
● Using extractor’s output important
sentences are selected using a
Pointer Network. (Not shown in
the image)
14
Slide 15
Slide 15 text
Reinforcement learning
● Abstractor network then
compresses and rewrite an
extracted document sentences to a
concise summary sentences.
● ROUGE score is calculated and
then extractor is trained using it.
15
Slide 16
Slide 16 text
Reinforcement learning (Extractor)
16
Slide 17
Slide 17 text
What is attention?
17
● What does current model do?
● How does attention help?
Slide 18
Slide 18 text
What current model does?
18
...front against
russian
terrorism
defence...
Slide 19
Slide 19 text
What current model does?
19
● In the picture, “front”, “against” and “terrorism” words are fed into an encoder,
and after a special signal the decoder starts producing a translated (simplified)
sentence.
● The decoder is supposed to generate a translation solely based on the last hidden
state from the encoder.
● It seems unreasonable to assume that we can encode all information about a
potentially very long sentence into a single vector and then have the decoder
produce a good translation based on only that.
● Attention model solves this problem.
Slide 20
Slide 20 text
Attention Model
20
● With an attention mechanism we no longer try encode the full source sentence
into a fixed-length vector.
● Rather, we allow the decoder to “attend” to different parts of the source sentence
at each step of the output generation.
● We let the model learn what to attend to based on the input sentence and what it
has produced so far.
● Each decoder output word depends on a weighted combination of all the hidden
states of input, not just the last state of input.
● Weights are updated during training.
Slide 21
Slide 21 text
Cost of using Attention Model
21
● We need to calculate an attention value for each
combination of input and output word.
● If you have a 50-word input sequence and generate a
50-word output sequence that would be 2500 attention
values.
Slide 22
Slide 22 text
22
An Example
Visualization of attention model
Slide 23
Slide 23 text
Generating Summaries
23
● Beam Search
Slide 24
Slide 24 text
Generating Summaries
24
At each step, the decoder outputs a probability distribution over the target vocabulary.
To get the output word at this step we can do the following:
● Greedy Sampling
○ Choose the word with highest probability at each timestep.
○ Sometimes tend to produce incorrect results.
● Better approach is to use Beam Search.
Slide 25
Slide 25 text
Beam Search
25
● Ideally all the possible branches should be checked for best result.
● But, this is not feasible as the number of possible hypotheses is exponential.
● Hence, we compromise between an exact solution and greedy approach using
Beam Search.
● Essentially, Beam Search maintains k top hypothesis for the summary.
● It uses pruning to retain top k results.
● This ensures that each target word gets a fair shot at generating the summary.
Slide 26
Slide 26 text
Beam Search Example
26
Slide 27
Slide 27 text
Future Approach
27
● Implementation of Seq2Seq model, attention model, etc.
● Testing with CNN/DM dataset, scraped movies dataset.
● Tuning the model, under supervision, to gain deep insights.