Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text summarization Phase 2 Evaluation 1

Text summarization Phase 2 Evaluation 1

Phase 2 evaluation 1 of text summarization final year project, under professor U.A. Deshpande in collaboration with TCS, presentation.

Till this evaluation we researched about sequence2sequence encoder-decoder architecture, attention model and reinforcement learning model for abstractive summarization.

Phase 1 evaluation 2 presentation: https://speakerdeck.com/gautamabhishek46/text-summarization-phase-1-evaluation-2

Abhishek Gautam
Atharva Parwatkar
Sharvil Nagarkar

Professor in-charge: U. A. Deshpande
TCS Mentor : Dr. Sagar Sunkle

Abhishek Gautam

February 19, 2019

More Decks by Abhishek Gautam

Other Decks in Education


  1. Our Approaches • A domain specific abstractive model • A

    neural network using reinforcement learning • A Seq2Seq Neural Attention model 4
  2. Domain specific abstractive model • Data ◦ CNN Dailymail Dataset

    ◦ Movie subtitles and their corresponding plots ▪ Scraped movie subtitles data from https://yifysubtitles.org ▪ Scraped Wikipedia articles for extracting plots of movies 5
  3. Domain specific abstractive model • Extractive model for pulling out

    most of the relevant information • Applying Key-phrase extraction on the intermediate text generated for domain specific summary generation 6
  4. Key Phrase Extraction • Automatic keyphrase extraction is typically a

    two-step process: ◦ A set of words and phrases that could convey the topical content of a document are identified. ◦ Then these candidates are scored/ranked and the “best” are selected as a document’s keyphrases. • Key phrase extraction is achieved by SpaCy. 7
  5. 1. Candidate Identification • Common heuristics include filtering for words

    with certain parts of speech or, for multi-word phrases, certain POS patterns; and using external knowledge bases like WordNet or Wikipedia as a reference source of good/bad keyphrases. • Noun phrases matching the POS pattern {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+} (a regular expression written in a simplified format) 8
  6. 2. Keyphrase Selection • Graph-based ranking method, in which the

    importance of a candidate is determined by its relatedness to other candidates, where “relatedness” may be measured by two terms’ frequency of co-occurrence or semantic relatedness. This method assumes that more important candidates are related to a greater number of other candidates, and that more of those related candidates are also considered important 9
  7. Sequence to sequence sentence generation 11 • Used for various

    NLP tasks such as machine translation, Q&A, etc. • Encoder and decoder architecture is used. ◦ It consists of LSTM or bidirectional LSTM. ◦ Word embeddings are fed to encoder at each timestep. ◦ Encoder creates a context vector and passes it to decoder when it receives a EOS (end of sentence symbol). ◦ In each time step decoder predicts the next word using the previous hidden state output and predicted word until it predicts a EOS symbol.
  8. Reinforcement learning • Extractor: CNN-then-RNN. • Extractor generates representation of

    important word, phrases and sentences. • Using extractor’s output important sentences are selected using a Pointer Network. (Not shown in the image) 14
  9. Reinforcement learning • Abstractor network then compresses and rewrite an

    extracted document sentences to a concise summary sentences. • ROUGE score is calculated and then extractor is trained using it. 15
  10. What current model does? 19 • In the picture, “front”,

    “against” and “terrorism” words are fed into an encoder, and after a special signal the decoder starts producing a translated (simplified) sentence. • The decoder is supposed to generate a translation solely based on the last hidden state from the encoder. • It seems unreasonable to assume that we can encode all information about a potentially very long sentence into a single vector and then have the decoder produce a good translation based on only that. • Attention model solves this problem.
  11. Attention Model 20 • With an attention mechanism we no

    longer try encode the full source sentence into a fixed-length vector. • Rather, we allow the decoder to “attend” to different parts of the source sentence at each step of the output generation. • We let the model learn what to attend to based on the input sentence and what it has produced so far. • Each decoder output word depends on a weighted combination of all the hidden states of input, not just the last state of input. • Weights are updated during training.
  12. Cost of using Attention Model 21 • We need to

    calculate an attention value for each combination of input and output word. • If you have a 50-word input sequence and generate a 50-word output sequence that would be 2500 attention values.
  13. Generating Summaries 24 At each step, the decoder outputs a

    probability distribution over the target vocabulary. To get the output word at this step we can do the following: • Greedy Sampling ◦ Choose the word with highest probability at each timestep. ◦ Sometimes tend to produce incorrect results. • Better approach is to use Beam Search.
  14. Beam Search 25 • Ideally all the possible branches should

    be checked for best result. • But, this is not feasible as the number of possible hypotheses is exponential. • Hence, we compromise between an exact solution and greedy approach using Beam Search. • Essentially, Beam Search maintains k top hypothesis for the summary. • It uses pruning to retain top k results. • This ensures that each target word gets a fair shot at generating the summary.
  15. Future Approach 27 • Implementation of Seq2Seq model, attention model,

    etc. • Testing with CNN/DM dataset, scraped movies dataset. • Tuning the model, under supervision, to gain deep insights.