Slide 1

Slide 1 text

TECH TALK AUTOMATIC TEXT SUMMARIZATION: Maximum Marginal Relevance (MMR) Technique Fajri Koto Analytic Team June 3rd 2016 Jakarta, Indonesia PT Kreatif Media Karya www.kmklabs.com [email protected]

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Outline 1. Why Do We Need Summarization Engine? 2. Text Summarization Overview 3. Vector Space Model 4. MMR Algorithm 5. Overall Summarization Stages 6. Demo

Slide 4

Slide 4 text

1. Why do we need Summarization Engine? Definition is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.

Slide 5

Slide 5 text

1. Why do we need Summarization Engine?  Our storage become cheaper and larger.  The availability of documents become larger and larger (image, video, text)  To obtain information quickly Which Summary?  A quality informative summary

Slide 6

Slide 6 text

2. Text Summarization Overview Abstractive Summarization  building new sentences as summary of the whole text Extractive Summarization  Selecting the most representative (informative) sentences from the text / document itself as the summary  By scoring

Slide 7

Slide 7 text

2. Text Summarization Overview Extractive Summarization (cont’ d) Sentence 1 Sentence 2 Sentence 3 Sentence 4 Sentence 5 ….. Sentence n score 1 score 2 score 3 score 4 score 5 ….. score n Select Sentences With The Highest Score

Slide 8

Slide 8 text

3. Vector Space Model  We want to do scoring. Thus, we have to change text representation into number representation  Term Frequency (TF) - Vector Space Model Sentence 1  Saya pergi ke pasar Sentence 2  Ibu pergi ke rumah Bag of unique words: saya, pergi, ke, pasar, ibu, rumah

Slide 9

Slide 9 text

3. Vector Space Model  Term Frequency (TF) - Vector Space Model Sentence 1  Saya pergi ke pasar Sentence 2  Ibu pergi ke rumah saya, pergi, ke, pasar, ibu, rumah

Slide 10

Slide 10 text

3. Vector Space Model  Now we can find similarity score between two vector Similarity score : Cosine Similarity Where t is element vector (TF) of document D D1 and D2 is vector of document (vector of sentence)

Slide 11

Slide 11 text

4. Maximum Marginal Relevance Carbonell, J., & Goldstein, J. (1998, August). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 335-336). ACM.

Slide 12

Slide 12 text

4. Maximum Marginal Relevance Maximum Marginal Relevance  MMR has been widely used in text summarization because of its simplicity and efficiency  MMR will Re rank the sentence according to its relevance score  This formula look at and handle the redundant sentence.

Slide 13

Slide 13 text

4. Maximum Marginal Relevance Vector space TF TF (Term Frequency) is standard vector in doing summarization Another vector Space TF - IDF TFIDF = TF (t) * IDF (t,D)

Slide 14

Slide 14 text

4. Maximum Marginal Relevance Similarity score : Cosine Similarity Where t is element vector (TF or TFIDF) of document D D1 and D2 is vector of document (vector of sentence)

Slide 15

Slide 15 text

4. Maximum Marginal Relevance MMR Process in Summarization Passage 1 : 0.8 Passage 2 : 0.7 Passage 3 : 0.6 Passage 4 : 0.5 Passage 5 : 0.4 Document Summary Re-calculate score, and re-rank Choose the highest score Passage1: 0.8

Slide 16

Slide 16 text

5. Overall Summarization Stages Documents Sentence 1 Sentence 2 Sentence 3 Sentence 4 Sentence 5 ….. Sentence n Preprocessing Building Vector Space Scoring Selecting top-x

Slide 17

Slide 17 text

6. DEMO Preprocessing used in my implementation  Removing special character  Convert all text into lower case  Stemming, ex: membawa  bawa bernyanyi  nyanyi  Removing stopwords, ex: “saya, dia, ke, apa, apakah, etc….”

Slide 18

Slide 18 text

Thank You  Any Question?

Slide 19

Slide 19 text

Thank You  Any Question?

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

Evaluation Technique stands for Recall-Oriented Understudy for Gisting Evaluation ROGUE – N gram to do evaluation: