Summarization by Analogy: An Example-based Approach for News Articles

Summarization by Analogy: An Example-based Approach for News Articles

Megumi Makino and Kazuhide Yamamoto. Summarization by Analogy: An Example-based Approach for News Articles. Proceedings of The Third International Joint Conference on Natural Language Processing (IJCNLP2008), pp.739-744 (2008.1)



January 31, 2008


  1. Summarization by Analogy: An Example-based Approach for News Article Megumi

    Makino Kazuhide Yamamoto (Nagaoka University of Technology, JAPAN)
  2. Previous works figure importance of word or sentence • Previous

    works –There are a lot of sentence extraction and sentence compression methods. –Figure importance of each word or sentence • By using term frequency, title and location ...etc. –Extract sentences or compress one sentence • However, when we generate a summary, ... 1
  3. Measuring importance of word or sentence is impossible • We

    generate a summary –By using much knowledge and experience in our mind We can not measure the importance of each word or each sentence. –By combining phrases in some sentences It was difficult to generate a summary which contains phrases in some sentences. 2
  4. We summarize a text as if human does • Goal

    –To generate summaries as if human does The summaries are generated by selecting and combining phrases in whole text. –To generate summaries without using the importance measure. 3
  5. We summarize a text by imitating a instance • Idea

    –Example-based approach • We use summary collection as instance. • We generate a summary by imitating the instance and combining phrases from the whole input. The summaries are generated by human. So, the summaries include their knowledge and experience. 4
  6. Advantages of Example-based approach 1. High modularity –We can improve

    systems by only adding or changing instances. 2. Use of similarity rather than importance –We substitute a similarity between two phrase for the importance. 3. High applicability of local context –We use similar instances to input, and can increase the fitness of input contents. 5
  7. Instance collection (News headlines) A B C D E. F

    G H I. J K L M N O P.       . a b c d e.       .       . Compare the input to each instance and Retrieve similar instance Input text Step1 Similar instance a b c d e. Phrase alignment between similar instance and input Step2    a   b   c   d   e    A    D    E    I    P    J    N    O   B    E    L    H    E    a    b    c    d    e    A    D    E    I    P    J    N    O    B    E    L    H    E Output: A / L / O / B / E /. Combine the corresponding phrases Step3 System Overview of Example-based Summarization One sentence Corresponding phrases Highest score path 6
  8. Retrieval of Similar Instance • Figure a similarity –Aim •

    To obtain a similar instance which has similar contents to the input –Figure Sim(E,I) between the input I and each instance E in the instance collection –Obtain a similar instance which has highest similarity Sim(E,I) 7
  9. Retrieval of Similar Instance – n : the number of

    sentences in input – Score(i) and w : weight for the main topic of the input – Tpi(・) : the set of predicate words in i-th sentence – Tci(・) : the set of content words in i-th sentence – : the number of overlaps ( ) { } ∑ = ∩ + ∩ ⋅ ⋅ = n i ci c pi p I T E T I T E T w i Score I E Sim 1 1 1 ) ( ) ( ) ( ) ( ) ( , ) ( ) ( 1 I T E T pi p ∩ 8
  10. Phrase Alignment : one to many correspondences • Compare and

    align the phrases –One to many correspondences We link one phrase in the similar instance to some similar phrases in the input. –4 alignment measures • Agreement of grammatical case • Agreement of named entity tag • Edit distance • Word similarity using mutual information 9
  11. Phrase Alignment Using 4 Measures (1) Agreement of Grammatical Case

    (2) Agreement of Named Entity 私が(I subj) 彼が(he subj) 計画を(plan obj) 予定を(schedule obj) [subj,obj: subject or object case marker] Panasonic SONY (ORGANIZATION) 24日 15日 (DATE) 10
  12. (3) Enhanced Edit Distance [Yamamoto et al. 03] – To

    link Abbreviation phrases 日銀 日本銀行 (Bank of Japan) – We correspond top 3 small distance phrases to a instance phrase. (4) Similarity with Mutual Information [Lin98] – To link syntactically similar phrases 大会を開く 会議を開く (to hold a convention) (to hold a meeting) – We correspond top 3 similar phrases to a instance phrase. Phrase Alignment Using 4 Measures 11
  13. Combine the Phrases Using Dynamic Programming Similar instance Nodes are

    corresponding phrases in input to “a” in instance <s> a c d </s> e b A L D O E • Search the best path Best path : <s> A L O D E </s> 12
  14. Combine the Phrases : node score • Aim –The summary

    consists of similar phrases to the similar instance. –The summary has good readability. • Node (wi) score –The score indicates a reliability of phrase similarity.    = rank w N i / 1 5 . 0 max ) ( if grammatical case or NE tag is matched otherwise similarity rank order of phrase, acquired in edit distance or similarity with MI. 13
  15. Combine the Phrases : edge score • Edge (wi, wi-1)

    score –The score indicates a adequacy of phrase connection. • Search the maximum score path 1 ) ( ) ( 1 ) , ( 1 1 + − = − − i i i i w loc w loc w w E ) , ( ) 1 ( ) ( ) ( 1 1 0 i m i i m i i p p w w E w N W Score ∑ ∑ = − = − + = α α location of sentence in the input containing phrase wi 14
  16. Sectional Evaluation • Test Corpus – instance: 26,764 news headlines

    – input: Nihon Keizai Shimbun (Japanese newspaper) 134 news article – training : 150 news article and their summaries (We tuned the parameter α by using the training set.) • Evaluation – We evaluated each part of our system by an examinee. • Retrieval process of similar instance • Phrase alignment and combination 15
  17. Result of Retrieval Process of Similar Instance : 57% Acc.

    • How similar are the input and obtained similar instance? 1. quite similar 40 2. slightly similar 37 3. not very similar 29 4. not similar 28 total 134 Similarity by matching content words –Our plan: other measure focused on similar word 77 / 134 tests are good; the accuracy is 57%. 16
  18. Result of Output Summary: 62% Acc. • 77 tests that

    are judged as good similar instances in evaluation of the retrieving process • How proper is the output summary? 1. quite proper 33 2. slightly proper 15 3. not very proper 22 4. not proper 7 total 77 48 / 77 tests are good; the accuracy is 62%. 17
  19. Output Example1 Input:神奈川県警の一連の不祥事のうち、厚木署集団警ら隊の集団暴行事件 で起訴された元巡査部長、川野優被告の論告求刑公判が二十一日、横浜地 裁で開かれた。 検察側はひまを持て余して部下に短銃を突き付けるなど、 組織における地位の高さに乗じた悪質な行為などと理不尽な暴力を指弾し、 川野被告に懲役一年六月を求刑した。判決は一月十一日に言い渡される。 Similar instance:大阪地裁で22日、8人が犠牲となった池田小児童殺傷事件

    の論告求刑が開かれ、検察側は宅間被告に死刑を求刑した。 (The prosecution made Takuma's closing arguments on the 22nd in the trial at the Osaka District Court, and asked for death penalty.) Output summary: 横浜地裁で二十一日、論告求刑が開かれ、検察側は川 野被告に懲役一年六月を求刑した。 (The prosecution made Kawano's closing arguments on 21st in the trial at the Yokohama District Court and demanded one and half years in prison.) Phrases around the whole text are picked to combine one sentence. 18
  20. Output Example2 Input:十四日の東京株式市場でソフトバンク株が急伸し、株式時価総額でトヨ タ自動車を抜いて第三位に浮上した。インターネット関連の中核銘柄として、 国内外の機関投資家や個人投資家の買いが集まった結果だ。日本を代表す るめかであるトヨタの時価総額を抜いたことに付いて、市場では日本の産業 構造の変化を象徴しているとの声も出ている。(skip the rest.) Similar

    instance:株式時価総額でキャノンが9日、ソニーを抜いて電気機器 業界トップに。 (Canon beats Sony in total market value and takes the No.1 position of electrical equipment market.) Output summary:株式時価総額でソフトバンク株が十四日、トヨタ自動車を 抜いて第三位に。 (Softbank Corp. beats TOYOTA in the market value and takes the No.3 position at 14th.) Imitating similar instances enables readable and compressed summaries. 19
  21. Conclusion • Our method generates a summary by imitating a

    similar instance. –We compare directly between phrases in the input and its similar instance. • Not need to figure the importance of sentence or word. • High in fitness of local contexts. 20
  22. Conclusion • Out method can summarize long text to one

    sentence by picking and combining the phrase in some sentences. –We can make a summary which have contents in the whole text. –Sentence extraction and sentence compression methods can not generate summaries like our outputs. 21