TECH TALK
AUTOMATIC TEXT SUMMARIZATION:
Maximum Marginal Relevance (MMR) Technique
Fajri Koto
Analytic Team
June 3rd 2016 Jakarta, Indonesia
PT Kreatif Media Karya
www.kmklabs.com
[email protected]
Slide 2
Slide 2 text
No content
Slide 3
Slide 3 text
Outline
1. Why Do We Need Summarization Engine?
2. Text Summarization Overview
3. Vector Space Model
4. MMR Algorithm
5. Overall Summarization Stages
6. Demo
Slide 4
Slide 4 text
1. Why do we need Summarization Engine?
Definition
is the process of reducing a text document with a computer
program in order to create a summary that retains the
most important points of the original document.
Slide 5
Slide 5 text
1. Why do we need Summarization Engine?
Our storage become cheaper and
larger.
The availability of documents
become larger and larger
(image, video, text)
To obtain information quickly
Which Summary?
A quality informative summary
Slide 6
Slide 6 text
2. Text Summarization Overview
Abstractive Summarization
building new sentences as summary of the whole text
Extractive Summarization
Selecting the most representative (informative)
sentences from the text / document itself as the
summary
By scoring
Slide 7
Slide 7 text
2. Text Summarization Overview
Extractive Summarization (cont’ d)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
…..
Sentence n
score 1
score 2
score 3
score 4
score 5
…..
score n
Select
Sentences With
The Highest
Score
Slide 8
Slide 8 text
3. Vector Space Model
We want to do scoring. Thus, we have to change text
representation into number representation
Term Frequency (TF) - Vector Space Model
Sentence 1 Saya pergi ke pasar
Sentence 2 Ibu pergi ke rumah
Bag of unique
words:
saya, pergi, ke,
pasar, ibu, rumah
Slide 9
Slide 9 text
3. Vector Space Model
Term Frequency (TF) - Vector Space Model
Sentence 1 Saya pergi ke pasar
Sentence 2 Ibu pergi ke rumah
saya, pergi, ke,
pasar, ibu, rumah
Slide 10
Slide 10 text
3. Vector Space Model
Now we can find similarity score between two vector
Similarity score : Cosine Similarity
Where t is element vector
(TF) of document D
D1 and D2 is vector of
document (vector of
sentence)
Slide 11
Slide 11 text
4. Maximum Marginal Relevance
Carbonell, J., & Goldstein, J. (1998, August). The use of MMR, diversity-based
reranking for reordering documents and producing summaries. In Proceedings of the
21st annual international ACM SIGIR conference on Research and development in
information retrieval (pp. 335-336). ACM.
Slide 12
Slide 12 text
4. Maximum Marginal Relevance
Maximum Marginal Relevance
MMR has been widely used in text summarization because
of its simplicity and efficiency
MMR will Re rank the sentence according to its relevance
score
This formula look at and handle the redundant sentence.
Slide 13
Slide 13 text
4. Maximum Marginal Relevance
Vector space TF
TF (Term Frequency) is standard vector in doing summarization
Another vector Space TF - IDF
TFIDF = TF (t) * IDF (t,D)
Slide 14
Slide 14 text
4. Maximum Marginal Relevance
Similarity score : Cosine Similarity
Where t is element vector (TF or TFIDF) of document D
D1 and D2 is vector of document (vector of sentence)
Slide 15
Slide 15 text
4. Maximum Marginal Relevance
MMR Process in Summarization
Passage 1 : 0.8
Passage 2 : 0.7
Passage 3 : 0.6
Passage 4 : 0.5
Passage 5 : 0.4
Document Summary
Re-calculate score,
and re-rank
Choose the
highest score
Passage1: 0.8
Slide 16
Slide 16 text
5. Overall Summarization Stages
Documents Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
…..
Sentence n
Preprocessing
Building Vector
Space
Scoring
Selecting top-x
Slide 17
Slide 17 text
6. DEMO
Preprocessing used in my implementation
Removing special character
Convert all text into lower case
Stemming, ex:
membawa bawa
bernyanyi nyanyi
Removing stopwords, ex:
“saya, dia, ke, apa, apakah, etc….”
Slide 18
Slide 18 text
Thank You
Any Question?
Slide 19
Slide 19 text
Thank You
Any Question?
Slide 20
Slide 20 text
No content
Slide 21
Slide 21 text
Evaluation Technique
stands for Recall-Oriented Understudy for
Gisting Evaluation
ROGUE – N gram to do evaluation: