Text summarization Phase 1 evaluation

Slide 1

Slide 1 text

Text Summarization Abhishek Gautam (BT15CSE002) Atharva Parwatkar (BT15CSE015) Sharvil Nagarkar (BT15CSE052) Under Prof. U.A.Deshpande

Slide 2

Slide 2 text

Problem Statement Build mechanism to reduce the size of the document, create some sort of structure to it, and process the document so that it can convey most of the information in the original text.

Slide 3

Slide 3 text

Motivation ● Reduced reading time. ● Digital documents generated everyday ○ 2 Millions articles (news or blog) are published daily. (Source) ○ 4.3 Billion messages generated on facebook daily. (Source) ● Textual data with this speed can quickly accumulate to huge volumes. ● All this data is unstructured mostly.

Slide 4

Slide 4 text

Motivation ● Text summarization improves the effectiveness of indexing. ● Personalized summaries for question-answering systems. ● Summarization algorithms are less biased than human.

Slide 5

Slide 5 text

What is text summarization

Slide 6

Slide 6 text

What is text summarization Producing a new small piece of text that is produced from the original long text(s), that contains a significant portion of the information in the original text(s), and that is significantly smaller than the original text(s). The new piece of text is called summary.

Slide 7

Slide 7 text

Use cases

Slide 8

Slide 8 text

Use cases ● Analysing reviews of a product ● Generate news headlines ● Generate notes for students ● Generate minutes (of a meeting) ● Generate previews (of book or movies)

Slide 9

Slide 9 text

Types of text summarization processes

Slide 10

Slide 10 text

Types of text summarization processes ● Compressive text summarization ● Extractive text summarization ● Abstractive text summarization

Slide 11

Slide 11 text

Compressive text summarization Compressive summary is produced by deleting specific words, sentences or phrases from the original input text while preserving order of words, so that most of the information in the original text is retained.

Slide 12

Slide 12 text

Extractive text summarization Extractive summary is produced by pulling specific words, sentences or phrases from the original input text without considering word order, so that most of the information in the original text is there in the summary.

Slide 13

Slide 13 text

Example of text Source: https://en.wikipedia.org/wiki/A._P._J._Abdul_Kalam

Slide 14

Slide 14 text

Example of extractive summary Source: Link

Slide 15

Slide 15 text

Abstractive summary Abstractive summary is produced by using any words. Choice of words for abstractive summary does not depend on the input text. Summary is produced in such a way so that most of the information is conveyed by it and it will be significantly smaller in size.

Slide 16

Slide 16 text

Example summary An innocent hobbit of The Shire journeys with eight companions to the fires of Mount Doom to destroy the One Ring and the dark lord Sauron forever.

Slide 17

Slide 17 text

Text Summarization Techniques ● Supervised ● Unsupervised

Slide 18

Slide 18 text

Unsupervised Techniques ● TextRank - Traditional Approach

Slide 19

Slide 19 text

TextRank ● This approach models the document as a graph and uses an algorithm similar to Google’s PageRank algorithm to find top-ranked sentences. ● The PageRank value of a page is essentially the probability of a user visiting that page. ● Finds how similar each sentence is to all other sentences in the text. ● The most important sentence is the one that is most similar to all the others

Slide 20

Slide 20 text

Limitations of TextRank ● Rule-based ranking. ● Slight change in similarity function can dramatically affect the summary generation.

Slide 21

Slide 21 text

Supervised Techniques ● Supervised techniques make use of a collection of documents and their corresponding human-generated summaries to train the model. ● Features (eg. number of words in the sentence, presence of keywords in the sentence) are taken into account in deciding whether to include the sentence in summary or not.

Slide 22

Slide 22 text

Supervised Technique : Example

Slide 23

Slide 23 text

Supervised Technique Approaches Judging importance of a sentence using feature categories: ● Surface categories - position, length of sentence ● Content features - stats of content-bearing words ● Relevance features - Exploiting inter-sentence relationship Sentences are then ranked accordingly and top ones scoring highest ranks become part of the summary.

Slide 24

Slide 24 text

Rise of Deep Learning Techniques ● Techniques via deep learning and machine learning have made breakthroughs in summarization of the text. ● Utilizing deep learning to determine whether or not a sentence, based off of several key features from the text, should be apart of the summarization. ● By tokenizing the text into paragraphs and sentences, and analyzing each sentence through a neural network, one might be able to create a comprehensive summary.

Slide 25

Slide 25 text

Supervised Techniques ● Using Convolutional Neural Networks - For extractive text summarization ● Using Recurrent Neural Networks - For abstractive text summarization

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

Supervised Technique Drawbacks ● Training data is not readily available and is expensive to generate. ● And the available human generated summaries are quite abstractive in nature

Slide 28

Slide 28 text

Our Approach

Slide 29

Slide 29 text

Our Approach ● Use of potential deep learning model for extractive summarization. ● Using BBC news summary dataset from kaggle.com. ● Selection of features that will be used by the model to decide whether a sentence should be part of the summary or not. ● Statistical analysis of results.

Slide 30

Slide 30 text

Workflow Feature Extraction Training the model Evaluation and Fine-tuning Gathering Dataset Finish Feature Selection

Slide 31

Slide 31 text

Gathering Datasets ● BBC news summary dataset taken from kaggle.com (for now) ● Spread across domains e.g. Business, Sport, Politics. ● More than 400 news report-summary pairs for each domain

Slide 32

Slide 32 text

Features ● Sentiment Difference: Sentences with sentiment that are similar to the overall text belong in the summary. ● Proper Noun Ratio: Sentences with more number of proper nouns in them are more likely to be pivotal to the summarization. ● Stat Ratio: Sentences with statistics in them give important information that should typically be included as part of the summarization.

Slide 33

Slide 33 text

Features ● Keyword Score: Sentences with more keywords indicate that the sentence is important. ● Sentence Length: The more words that are in a sentence, the more important the sentence is. ● Sentence Position: Sentences that are at the beginning or the end of paragraphs should be considered more important. ● Quotes and Dialogues: Sentences containing quotes and dialogues should be generally considered as more important.

Slide 34

Slide 34 text

Feature Extraction ● Process of feature extraction involves giving numerical values to the above features for each sentence. ● Firstly, input text is tokenized, after which processing is done on these tokens to represent each sentence as a vector. ● These vectors are then fed as input parameters to the model.

Slide 35

Slide 35 text

Goals for this semester ● To study text summarization and learn theories around it. ● To gather dataset and decide on the potential features that will be used to train our model. ● To start with implementation and come up with a prototype.

Slide 36

Slide 36 text

Challenges ● To identify potential relevant features. ● To assign these features a numeric value for every sentence (feature extraction). ● To keep the model lightweight.

Slide 37

Slide 37 text

Technology Stack ● Keras ● NLTK (Natural Language Toolkit) ● Python 3 ● Google Natural Language Processing API

Slide 38

Slide 38 text

Conclusion ● Current summarization systems are widely used to summarize news and other online articles. ● Most of the current research is based on extractive summarization. ● Abstractive summarization has not reached a mature stage because allied problems such as semantic representation, inference and natural language generation are relatively harder.

Slide 39

Slide 39 text

Thanks!