Text summarization Phase 1 evaluation

Text Summarization Abhishek Gautam (BT15CSE002) Atharva Parwatkar (BT15CSE015) Sharvil Nagarkar
(BT15CSE052) Under Prof. U.A.Deshpande

Problem Statement Build mechanism to reduce the size of the
document, create some sort of structure to it, and process the document so that it can convey most of the information in the original text.

Motivation • Reduced reading time. • Digital documents generated everyday
◦ 2 Millions articles (news or blog) are published daily. (Source) ◦ 4.3 Billion messages generated on facebook daily. (Source) • Textual data with this speed can quickly accumulate to huge volumes. • All this data is unstructured mostly.

Motivation • Text summarization improves the effectiveness of indexing. •
Personalized summaries for question-answering systems. • Summarization algorithms are less biased than human.

What is text summarization

What is text summarization Producing a new small piece of
text that is produced from the original long text(s), that contains a significant portion of the information in the original text(s), and that is significantly smaller than the original text(s). The new piece of text is called summary.

Use cases

Use cases • Analysing reviews of a product • Generate
news headlines • Generate notes for students • Generate minutes (of a meeting) • Generate previews (of book or movies)

Types of text summarization processes

Types of text summarization processes • Compressive text summarization •
Extractive text summarization • Abstractive text summarization

Compressive text summarization Compressive summary is produced by deleting specific
words, sentences or phrases from the original input text while preserving order of words, so that most of the information in the original text is retained.

Extractive text summarization Extractive summary is produced by pulling specific
words, sentences or phrases from the original input text without considering word order, so that most of the information in the original text is there in the summary.

Example of text Source: https://en.wikipedia.org/wiki/A._P._J._Abdul_Kalam

Example of extractive summary Source: Link

Abstractive summary Abstractive summary is produced by using any words.
Choice of words for abstractive summary does not depend on the input text. Summary is produced in such a way so that most of the information is conveyed by it and it will be significantly smaller in size.

Example summary An innocent hobbit of The Shire journeys with
eight companions to the fires of Mount Doom to destroy the One Ring and the dark lord Sauron forever.

Text Summarization Techniques • Supervised • Unsupervised

Unsupervised Techniques • TextRank - Traditional Approach

TextRank • This approach models the document as a graph
and uses an algorithm similar to Google’s PageRank algorithm to find top-ranked sentences. • The PageRank value of a page is essentially the probability of a user visiting that page. • Finds how similar each sentence is to all other sentences in the text. • The most important sentence is the one that is most similar to all the others

Limitations of TextRank • Rule-based ranking. • Slight change in
similarity function can dramatically affect the summary generation.

Supervised Techniques • Supervised techniques make use of a collection
of documents and their corresponding human-generated summaries to train the model. • Features (eg. number of words in the sentence, presence of keywords in the sentence) are taken into account in deciding whether to include the sentence in summary or not.

Supervised Technique : Example

Supervised Technique Approaches Judging importance of a sentence using feature
categories: • Surface categories - position, length of sentence • Content features - stats of content-bearing words • Relevance features - Exploiting inter-sentence relationship Sentences are then ranked accordingly and top ones scoring highest ranks become part of the summary.

Rise of Deep Learning Techniques • Techniques via deep learning
and machine learning have made breakthroughs in summarization of the text. • Utilizing deep learning to determine whether or not a sentence, based off of several key features from the text, should be apart of the summarization. • By tokenizing the text into paragraphs and sentences, and analyzing each sentence through a neural network, one might be able to create a comprehensive summary.

Supervised Techniques • Using Convolutional Neural Networks - For extractive
text summarization • Using Recurrent Neural Networks - For abstractive text summarization

Supervised Technique Drawbacks • Training data is not readily available
and is expensive to generate. • And the available human generated summaries are quite abstractive in nature

Our Approach

Our Approach • Use of potential deep learning model for
extractive summarization. • Using BBC news summary dataset from kaggle.com. • Selection of features that will be used by the model to decide whether a sentence should be part of the summary or not. • Statistical analysis of results.

Workflow Feature Extraction Training the model Evaluation and Fine-tuning Gathering
Dataset Finish Feature Selection

Gathering Datasets • BBC news summary dataset taken from kaggle.com
(for now) • Spread across domains e.g. Business, Sport, Politics. • More than 400 news report-summary pairs for each domain

Features • Sentiment Difference: Sentences with sentiment that are similar
to the overall text belong in the summary. • Proper Noun Ratio: Sentences with more number of proper nouns in them are more likely to be pivotal to the summarization. • Stat Ratio: Sentences with statistics in them give important information that should typically be included as part of the summarization.

Features • Keyword Score: Sentences with more keywords indicate that
the sentence is important. • Sentence Length: The more words that are in a sentence, the more important the sentence is. • Sentence Position: Sentences that are at the beginning or the end of paragraphs should be considered more important. • Quotes and Dialogues: Sentences containing quotes and dialogues should be generally considered as more important.

Feature Extraction • Process of feature extraction involves giving numerical
values to the above features for each sentence. • Firstly, input text is tokenized, after which processing is done on these tokens to represent each sentence as a vector. • These vectors are then fed as input parameters to the model.

Goals for this semester • To study text summarization and
learn theories around it. • To gather dataset and decide on the potential features that will be used to train our model. • To start with implementation and come up with a prototype.

Challenges • To identify potential relevant features. • To assign
these features a numeric value for every sentence (feature extraction). • To keep the model lightweight.

Technology Stack • Keras • NLTK (Natural Language Toolkit) •
Python 3 • Google Natural Language Processing API

Conclusion • Current summarization systems are widely used to summarize
news and other online articles. • Most of the current research is based on extractive summarization. • Abstractive summarization has not reached a mature stage because allied problems such as semantic representation, inference and natural language generation are relatively harder.

Thanks!

Text summarization Phase 1 evaluation

Text summarization Phase 1 evaluation

More Decks by Abhishek Gautam

Other Decks in Education

Featured

Transcript