Phase 1 evaluation of text summarization final year project under professor U.A. Deshpande in collaboration with TCS.
In phase 1 we studied reference material made available by our TCS mentor, different methods of text summarization including IR (information retrieval) approaches and defined some goals to be completed till next evaluation.
Professor in-charge: U. A. Deshpande
TCS Mentor : Sagar Sunkle
Abhishek Gautam (BT15CSE002)
Atharva Parwatkar (BT15CSE015)
Sharvil Nagarkar (BT15CSE052)
Under Prof. U.A.Deshpande
Build mechanism to reduce the size of the document, create some sort
of structure to it, and process the document so that it can convey most of
the information in the original text.
● Reduced reading time.
● Digital documents generated everyday
○ 2 Millions articles (news or blog) are published daily. (Source)
○ 4.3 Billion messages generated on facebook daily. (Source)
● Textual data with this speed can quickly accumulate to huge volumes.
● All this data is unstructured mostly.
● Text summarization improves the effectiveness of indexing.
● Personalized summaries for question-answering systems.
● Summarization algorithms are less biased than human.
What is text summarization
What is text summarization
Producing a new small piece of text that is produced
from the original long text(s), that contains a significant
portion of the information in the original text(s), and that
is significantly smaller than the original text(s).
The new piece of text is called summary.
● Analysing reviews of a product
● Generate news headlines
● Generate notes for students
● Generate minutes (of a meeting)
● Generate previews (of book or movies)
Types of text summarization
Types of text summarization processes
● Compressive text summarization
● Extractive text summarization
● Abstractive text summarization
Compressive text summarization
Compressive summary is produced by deleting specific words, sentences or
phrases from the original input text while preserving order of words, so that
most of the information in the original text is retained.
Extractive text summarization
Extractive summary is produced by pulling specific words, sentences or
phrases from the original input text without considering word order, so that
most of the information in the original text is there in the summary.
Example of text
Example of extractive summary
Abstractive summary is produced by using any words. Choice of words for
abstractive summary does not depend on the input text. Summary is
produced in such a way so that most of the information is conveyed by it
and it will be significantly smaller in size.
An innocent hobbit of The Shire
journeys with eight companions
to the fires of Mount Doom to
destroy the One Ring and the
dark lord Sauron forever.
Text Summarization Techniques
● TextRank - Traditional Approach
● This approach models the document as a graph and uses an algorithm
similar to Google’s PageRank algorithm to find top-ranked sentences.
● The PageRank value of a page is essentially the probability of a user
visiting that page.
● Finds how similar each sentence is to all other sentences in the text.
● The most important sentence is the one that is most similar to all the
Limitations of TextRank
● Rule-based ranking.
● Slight change in similarity function can dramatically affect the
● Supervised techniques make use of a collection of documents and their
corresponding human-generated summaries to train the model.
● Features (eg. number of words in the sentence, presence of keywords
in the sentence) are taken into account in deciding whether to include
the sentence in summary or not.
Supervised Technique : Example
Supervised Technique Approaches
Judging importance of a sentence using feature categories:
● Surface categories - position, length of sentence
● Content features - stats of content-bearing words
● Relevance features - Exploiting inter-sentence relationship
Sentences are then ranked accordingly and top ones scoring highest ranks
become part of the summary.
Rise of Deep Learning Techniques
● Techniques via deep learning and machine learning have made
breakthroughs in summarization of the text.
● Utilizing deep learning to determine whether or not a sentence, based
off of several key features from the text, should be apart of the
● By tokenizing the text into paragraphs and sentences, and analyzing
each sentence through a neural network, one might be able to create a
● Using Convolutional Neural Networks - For extractive text
● Using Recurrent Neural Networks - For abstractive text
Supervised Technique Drawbacks
● Training data is not readily available and is expensive to generate.
● And the available human generated summaries are quite abstractive in
● Use of potential deep learning model for extractive summarization.
● Using BBC news summary dataset from kaggle.com.
● Selection of features that will be used by the model to decide whether a
sentence should be part of the summary or not.
● Statistical analysis of results.
● BBC news summary dataset taken from kaggle.com (for now)
● Spread across domains e.g. Business, Sport, Politics.
● More than 400 news report-summary pairs for each domain
● Sentiment Difference: Sentences with sentiment that are similar to the
overall text belong in the summary.
● Proper Noun Ratio: Sentences with more number of proper nouns in
them are more likely to be pivotal to the summarization.
● Stat Ratio: Sentences with statistics in them give important
information that should typically be included as part of the
● Keyword Score: Sentences with more keywords indicate that the
sentence is important.
● Sentence Length: The more words that are in a sentence, the more
important the sentence is.
● Sentence Position: Sentences that are at the beginning or the end of
paragraphs should be considered more important.
● Quotes and Dialogues: Sentences containing quotes and dialogues
should be generally considered as more important.
● Process of feature extraction involves giving numerical values to the
above features for each sentence.
● Firstly, input text is tokenized, after which processing is done on these
tokens to represent each sentence as a vector.
● These vectors are then fed as input parameters to the model.
Goals for this semester
● To study text summarization and learn theories around it.
● To gather dataset and decide on the potential features that will be used
to train our model.
● To start with implementation and come up with a prototype.
● To identify potential relevant features.
● To assign these features a numeric value for every sentence (feature
● To keep the model lightweight.
● NLTK (Natural Language Toolkit)
● Python 3
● Google Natural Language Processing API
● Current summarization systems are widely used to summarize news
and other online articles.
● Most of the current research is based on extractive summarization.
● Abstractive summarization has not reached a mature stage because
allied problems such as semantic representation, inference and natural
language generation are relatively harder.