Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AUTOMATIC TEXT SUMMARIZATION : Maximum Marginal...

KMKLabs
June 16, 2016

AUTOMATIC TEXT SUMMARIZATION : Maximum Marginal Relevance (MMR) Technique

Tech talk kali ini mengupas algoritma peringkasan teks otomatis yang lazim dibahas pada ranah NLP (Natural Language Processing). Sistem peringkasan teks merupakan sebuah sistem yang mampu meringkas dokumen menjadi kalimat-kalimat yang tergolong kepada inti sari atau topik pembicaraan dokumen. Sistem ini memungkinkan kita untuk mendapatkan kunci informasi dari sebuah dokumen secara cepat tanpa mengharuskan kita untuk membaca isi dokumen secara manual. Adapun sistem peringkasan yang dijelaskan pada tech talk kali ini adalah Maximum Marginal Relevance (MMR) yang tergolong kepada kategori peringkasan ekstraktif (memilih kalimat yang ada pada dokumen sebagai kalimat pokok dari isi dokumen). MMR dilakukan dengan melakukan pembobotan untuk setiap kalimat yang ada pada dokumen. Adapun pada tech talk ini, pembobotan dilakukan menggunakan vector space model berupa Term Frequency (TF) dan Cosine Similarity sebagai pengukuran similarity nya.

KMKLabs

June 16, 2016
Tweet

More Decks by KMKLabs

Other Decks in Technology

Transcript

  1. TECH TALK AUTOMATIC TEXT SUMMARIZATION: Maximum Marginal Relevance (MMR) Technique

    Fajri Koto Analytic Team June 3rd 2016 Jakarta, Indonesia PT Kreatif Media Karya www.kmklabs.com [email protected]
  2. Outline 1. Why Do We Need Summarization Engine? 2. Text

    Summarization Overview 3. Vector Space Model 4. MMR Algorithm 5. Overall Summarization Stages 6. Demo
  3. 1. Why do we need Summarization Engine? Definition is the

    process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.
  4. 1. Why do we need Summarization Engine?  Our storage

    become cheaper and larger.  The availability of documents become larger and larger (image, video, text)  To obtain information quickly Which Summary?  A quality informative summary
  5. 2. Text Summarization Overview Abstractive Summarization  building new sentences

    as summary of the whole text Extractive Summarization  Selecting the most representative (informative) sentences from the text / document itself as the summary  By scoring
  6. 2. Text Summarization Overview Extractive Summarization (cont’ d) Sentence 1

    Sentence 2 Sentence 3 Sentence 4 Sentence 5 ….. Sentence n score 1 score 2 score 3 score 4 score 5 ….. score n Select Sentences With The Highest Score
  7. 3. Vector Space Model  We want to do scoring.

    Thus, we have to change text representation into number representation  Term Frequency (TF) - Vector Space Model Sentence 1  Saya pergi ke pasar Sentence 2  Ibu pergi ke rumah Bag of unique words: saya, pergi, ke, pasar, ibu, rumah
  8. 3. Vector Space Model  Term Frequency (TF) - Vector

    Space Model Sentence 1  Saya pergi ke pasar Sentence 2  Ibu pergi ke rumah saya, pergi, ke, pasar, ibu, rumah
  9. 3. Vector Space Model  Now we can find similarity

    score between two vector Similarity score : Cosine Similarity Where t is element vector (TF) of document D D1 and D2 is vector of document (vector of sentence)
  10. 4. Maximum Marginal Relevance Carbonell, J., & Goldstein, J. (1998,

    August). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 335-336). ACM.
  11. 4. Maximum Marginal Relevance Maximum Marginal Relevance  MMR has

    been widely used in text summarization because of its simplicity and efficiency  MMR will Re rank the sentence according to its relevance score  This formula look at and handle the redundant sentence.
  12. 4. Maximum Marginal Relevance Vector space TF TF (Term Frequency)

    is standard vector in doing summarization Another vector Space TF - IDF TFIDF = TF (t) * IDF (t,D)
  13. 4. Maximum Marginal Relevance Similarity score : Cosine Similarity Where

    t is element vector (TF or TFIDF) of document D D1 and D2 is vector of document (vector of sentence)
  14. 4. Maximum Marginal Relevance MMR Process in Summarization Passage 1

    : 0.8 Passage 2 : 0.7 Passage 3 : 0.6 Passage 4 : 0.5 Passage 5 : 0.4 Document Summary Re-calculate score, and re-rank Choose the highest score Passage1: 0.8
  15. 5. Overall Summarization Stages Documents Sentence 1 Sentence 2 Sentence

    3 Sentence 4 Sentence 5 ….. Sentence n Preprocessing Building Vector Space Scoring Selecting top-x
  16. 6. DEMO Preprocessing used in my implementation  Removing special

    character  Convert all text into lower case  Stemming, ex: membawa  bawa bernyanyi  nyanyi  Removing stopwords, ex: “saya, dia, ke, apa, apakah, etc….”