Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text summarization Phase 1 evaluation

Text summarization Phase 1 evaluation

Phase 1 evaluation of text summarization final year project under professor U.A. Deshpande in collaboration with TCS.

In phase 1 we studied reference material made available by our TCS mentor, different methods of text summarization including IR (information retrieval) approaches and defined some goals to be completed till next evaluation.

Team:
Abhishek Gautam
Atharva Parwatkar
Sharvil Ngarkar

Professor in-charge: U. A. Deshpande
TCS Mentor : Sagar Sunkle

Abhishek Gautam

October 25, 2018
Tweet

More Decks by Abhishek Gautam

Other Decks in Education

Transcript

  1. Problem Statement Build mechanism to reduce the size of the

    document, create some sort of structure to it, and process the document so that it can convey most of the information in the original text.
  2. Motivation • Reduced reading time. • Digital documents generated everyday

    ◦ 2 Millions articles (news or blog) are published daily. (Source) ◦ 4.3 Billion messages generated on facebook daily. (Source) • Textual data with this speed can quickly accumulate to huge volumes. • All this data is unstructured mostly.
  3. Motivation • Text summarization improves the effectiveness of indexing. •

    Personalized summaries for question-answering systems. • Summarization algorithms are less biased than human.
  4. What is text summarization Producing a new small piece of

    text that is produced from the original long text(s), that contains a significant portion of the information in the original text(s), and that is significantly smaller than the original text(s). The new piece of text is called summary.
  5. Use cases • Analysing reviews of a product • Generate

    news headlines • Generate notes for students • Generate minutes (of a meeting) • Generate previews (of book or movies)
  6. Types of text summarization processes • Compressive text summarization •

    Extractive text summarization • Abstractive text summarization
  7. Compressive text summarization Compressive summary is produced by deleting specific

    words, sentences or phrases from the original input text while preserving order of words, so that most of the information in the original text is retained.
  8. Extractive text summarization Extractive summary is produced by pulling specific

    words, sentences or phrases from the original input text without considering word order, so that most of the information in the original text is there in the summary.
  9. Abstractive summary Abstractive summary is produced by using any words.

    Choice of words for abstractive summary does not depend on the input text. Summary is produced in such a way so that most of the information is conveyed by it and it will be significantly smaller in size.
  10. Example summary An innocent hobbit of The Shire journeys with

    eight companions to the fires of Mount Doom to destroy the One Ring and the dark lord Sauron forever.
  11. TextRank • This approach models the document as a graph

    and uses an algorithm similar to Google’s PageRank algorithm to find top-ranked sentences. • The PageRank value of a page is essentially the probability of a user visiting that page. • Finds how similar each sentence is to all other sentences in the text. • The most important sentence is the one that is most similar to all the others
  12. Limitations of TextRank • Rule-based ranking. • Slight change in

    similarity function can dramatically affect the summary generation.
  13. Supervised Techniques • Supervised techniques make use of a collection

    of documents and their corresponding human-generated summaries to train the model. • Features (eg. number of words in the sentence, presence of keywords in the sentence) are taken into account in deciding whether to include the sentence in summary or not.
  14. Supervised Technique Approaches Judging importance of a sentence using feature

    categories: • Surface categories - position, length of sentence • Content features - stats of content-bearing words • Relevance features - Exploiting inter-sentence relationship Sentences are then ranked accordingly and top ones scoring highest ranks become part of the summary.
  15. Rise of Deep Learning Techniques • Techniques via deep learning

    and machine learning have made breakthroughs in summarization of the text. • Utilizing deep learning to determine whether or not a sentence, based off of several key features from the text, should be apart of the summarization. • By tokenizing the text into paragraphs and sentences, and analyzing each sentence through a neural network, one might be able to create a comprehensive summary.
  16. Supervised Techniques • Using Convolutional Neural Networks - For extractive

    text summarization • Using Recurrent Neural Networks - For abstractive text summarization
  17. Supervised Technique Drawbacks • Training data is not readily available

    and is expensive to generate. • And the available human generated summaries are quite abstractive in nature
  18. Our Approach • Use of potential deep learning model for

    extractive summarization. • Using BBC news summary dataset from kaggle.com. • Selection of features that will be used by the model to decide whether a sentence should be part of the summary or not. • Statistical analysis of results.
  19. Gathering Datasets • BBC news summary dataset taken from kaggle.com

    (for now) • Spread across domains e.g. Business, Sport, Politics. • More than 400 news report-summary pairs for each domain
  20. Features • Sentiment Difference: Sentences with sentiment that are similar

    to the overall text belong in the summary. • Proper Noun Ratio: Sentences with more number of proper nouns in them are more likely to be pivotal to the summarization. • Stat Ratio: Sentences with statistics in them give important information that should typically be included as part of the summarization.
  21. Features • Keyword Score: Sentences with more keywords indicate that

    the sentence is important. • Sentence Length: The more words that are in a sentence, the more important the sentence is. • Sentence Position: Sentences that are at the beginning or the end of paragraphs should be considered more important. • Quotes and Dialogues: Sentences containing quotes and dialogues should be generally considered as more important.
  22. Feature Extraction • Process of feature extraction involves giving numerical

    values to the above features for each sentence. • Firstly, input text is tokenized, after which processing is done on these tokens to represent each sentence as a vector. • These vectors are then fed as input parameters to the model.
  23. Goals for this semester • To study text summarization and

    learn theories around it. • To gather dataset and decide on the potential features that will be used to train our model. • To start with implementation and come up with a prototype.
  24. Challenges • To identify potential relevant features. • To assign

    these features a numeric value for every sentence (feature extraction). • To keep the model lightweight.
  25. Technology Stack • Keras • NLTK (Natural Language Toolkit) •

    Python 3 • Google Natural Language Processing API
  26. Conclusion • Current summarization systems are widely used to summarize

    news and other online articles. • Most of the current research is based on extractive summarization. • Abstractive summarization has not reached a mature stage because allied problems such as semantic representation, inference and natural language generation are relatively harder.