$30 off During Our Annual Pro Sale. View Details »

Text summarization Phase 1 evaluation

Text summarization Phase 1 evaluation

Phase 1 evaluation of text summarization final year project under professor U.A. Deshpande in collaboration with TCS.

In phase 1 we studied reference material made available by our TCS mentor, different methods of text summarization including IR (information retrieval) approaches and defined some goals to be completed till next evaluation.

Team:
Abhishek Gautam
Atharva Parwatkar
Sharvil Ngarkar

Professor in-charge: U. A. Deshpande
TCS Mentor : Sagar Sunkle

Abhishek Gautam

October 25, 2018
Tweet

More Decks by Abhishek Gautam

Other Decks in Education

Transcript

  1. Text Summarization
    Abhishek Gautam (BT15CSE002)
    Atharva Parwatkar (BT15CSE015)
    Sharvil Nagarkar (BT15CSE052)
    Under Prof. U.A.Deshpande

    View Slide

  2. Problem Statement
    Build mechanism to reduce the size of the document, create some sort
    of structure to it, and process the document so that it can convey most of
    the information in the original text.

    View Slide

  3. Motivation
    ● Reduced reading time.
    ● Digital documents generated everyday
    ○ 2 Millions articles (news or blog) are published daily. (Source)
    ○ 4.3 Billion messages generated on facebook daily. (Source)
    ● Textual data with this speed can quickly accumulate to huge volumes.
    ● All this data is unstructured mostly.

    View Slide

  4. Motivation
    ● Text summarization improves the effectiveness of indexing.
    ● Personalized summaries for question-answering systems.
    ● Summarization algorithms are less biased than human.

    View Slide

  5. What is text summarization

    View Slide

  6. What is text summarization
    Producing a new small piece of text that is produced
    from the original long text(s), that contains a significant
    portion of the information in the original text(s), and that
    is significantly smaller than the original text(s).
    The new piece of text is called summary.

    View Slide

  7. Use cases

    View Slide

  8. Use cases
    ● Analysing reviews of a product
    ● Generate news headlines
    ● Generate notes for students
    ● Generate minutes (of a meeting)
    ● Generate previews (of book or movies)

    View Slide

  9. Types of text summarization
    processes

    View Slide

  10. Types of text summarization processes
    ● Compressive text summarization
    ● Extractive text summarization
    ● Abstractive text summarization

    View Slide

  11. Compressive text summarization
    Compressive summary is produced by deleting specific words, sentences or
    phrases from the original input text while preserving order of words, so that
    most of the information in the original text is retained.

    View Slide

  12. Extractive text summarization
    Extractive summary is produced by pulling specific words, sentences or
    phrases from the original input text without considering word order, so that
    most of the information in the original text is there in the summary.

    View Slide

  13. Example of text
    Source: https://en.wikipedia.org/wiki/A._P._J._Abdul_Kalam

    View Slide

  14. Example of extractive summary
    Source: Link

    View Slide

  15. Abstractive summary
    Abstractive summary is produced by using any words. Choice of words for
    abstractive summary does not depend on the input text. Summary is
    produced in such a way so that most of the information is conveyed by it
    and it will be significantly smaller in size.

    View Slide

  16. Example summary
    An innocent hobbit of The Shire
    journeys with eight companions
    to the fires of Mount Doom to
    destroy the One Ring and the
    dark lord Sauron forever.

    View Slide

  17. Text Summarization Techniques
    ● Supervised
    ● Unsupervised

    View Slide

  18. Unsupervised Techniques
    ● TextRank - Traditional Approach

    View Slide

  19. TextRank
    ● This approach models the document as a graph and uses an algorithm
    similar to Google’s PageRank algorithm to find top-ranked sentences.
    ● The PageRank value of a page is essentially the probability of a user
    visiting that page.
    ● Finds how similar each sentence is to all other sentences in the text.
    ● The most important sentence is the one that is most similar to all the
    others

    View Slide

  20. Limitations of TextRank
    ● Rule-based ranking.
    ● Slight change in similarity function can dramatically affect the
    summary generation.

    View Slide

  21. Supervised Techniques
    ● Supervised techniques make use of a collection of documents and their
    corresponding human-generated summaries to train the model.
    ● Features (eg. number of words in the sentence, presence of keywords
    in the sentence) are taken into account in deciding whether to include
    the sentence in summary or not.

    View Slide

  22. Supervised Technique : Example

    View Slide

  23. Supervised Technique Approaches
    Judging importance of a sentence using feature categories:
    ● Surface categories - position, length of sentence
    ● Content features - stats of content-bearing words
    ● Relevance features - Exploiting inter-sentence relationship
    Sentences are then ranked accordingly and top ones scoring highest ranks
    become part of the summary.

    View Slide

  24. Rise of Deep Learning Techniques
    ● Techniques via deep learning and machine learning have made
    breakthroughs in summarization of the text.
    ● Utilizing deep learning to determine whether or not a sentence, based
    off of several key features from the text, should be apart of the
    summarization.
    ● By tokenizing the text into paragraphs and sentences, and analyzing
    each sentence through a neural network, one might be able to create a
    comprehensive summary.

    View Slide

  25. Supervised Techniques
    ● Using Convolutional Neural Networks - For extractive text
    summarization
    ● Using Recurrent Neural Networks - For abstractive text
    summarization

    View Slide

  26. View Slide

  27. Supervised Technique Drawbacks
    ● Training data is not readily available and is expensive to generate.
    ● And the available human generated summaries are quite abstractive in
    nature

    View Slide

  28. Our Approach

    View Slide

  29. Our Approach
    ● Use of potential deep learning model for extractive summarization.
    ● Using BBC news summary dataset from kaggle.com.
    ● Selection of features that will be used by the model to decide whether a
    sentence should be part of the summary or not.
    ● Statistical analysis of results.

    View Slide

  30. Workflow
    Feature
    Extraction
    Training the
    model
    Evaluation and
    Fine-tuning
    Gathering
    Dataset
    Finish
    Feature
    Selection

    View Slide

  31. Gathering Datasets
    ● BBC news summary dataset taken from kaggle.com (for now)
    ● Spread across domains e.g. Business, Sport, Politics.
    ● More than 400 news report-summary pairs for each domain

    View Slide

  32. Features
    ● Sentiment Difference: Sentences with sentiment that are similar to the
    overall text belong in the summary.
    ● Proper Noun Ratio: Sentences with more number of proper nouns in
    them are more likely to be pivotal to the summarization.
    ● Stat Ratio: Sentences with statistics in them give important
    information that should typically be included as part of the
    summarization.

    View Slide

  33. Features
    ● Keyword Score: Sentences with more keywords indicate that the
    sentence is important.
    ● Sentence Length: The more words that are in a sentence, the more
    important the sentence is.
    ● Sentence Position: Sentences that are at the beginning or the end of
    paragraphs should be considered more important.
    ● Quotes and Dialogues: Sentences containing quotes and dialogues
    should be generally considered as more important.

    View Slide

  34. Feature Extraction
    ● Process of feature extraction involves giving numerical values to the
    above features for each sentence.
    ● Firstly, input text is tokenized, after which processing is done on these
    tokens to represent each sentence as a vector.
    ● These vectors are then fed as input parameters to the model.

    View Slide

  35. Goals for this semester
    ● To study text summarization and learn theories around it.
    ● To gather dataset and decide on the potential features that will be used
    to train our model.
    ● To start with implementation and come up with a prototype.

    View Slide

  36. Challenges
    ● To identify potential relevant features.
    ● To assign these features a numeric value for every sentence (feature
    extraction).
    ● To keep the model lightweight.

    View Slide

  37. Technology Stack
    ● Keras
    ● NLTK (Natural Language Toolkit)
    ● Python 3
    ● Google Natural Language Processing API

    View Slide

  38. Conclusion
    ● Current summarization systems are widely used to summarize news
    and other online articles.
    ● Most of the current research is based on extractive summarization.
    ● Abstractive summarization has not reached a mature stage because
    allied problems such as semantic representation, inference and natural
    language generation are relatively harder.

    View Slide

  39. Thanks!

    View Slide