Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Natural Language Processing (10) Summarization

Natural Language Processing (10) Summarization

自然言語処理研究室

November 22, 2013
Tweet

More Decks by 自然言語処理研究室

Other Decks in Education

Transcript

  1. 1 / 30 Natural Language Processing (10) Summarization Kazuhide Yamamoto

    Dept. of Electrical Engineering Nagaoka University of Technology
  2. 2 / 30 Two kinds of summaries Informative summary •

    made for alternatives of the original; • information is preserved as much as possible. Indicative summary • made for judgment if we should read the original; • less important information is dropped. Generated summary depends on what kinds we need.
  3. 3 / 30 Summarization method • abstraction / generalization of

    words, concepts, etc. • paraphrasing • selection of important parts out of the input (particularly, sentence selection) Most of the proposed methods for summarization is conducted by selecting sentences.
  4. 4 / 30 Sentence selection (1) word frequencies • long

    history since 1958 • criterion: words that appear frequent are important • Sum of TF*IDF for each word is considered to be the importance of the sentence (1996) – similar to human selection result – better than head selection in newspaper articles.
  5. 5 / 30 TF*IDF TF(w): term frequency • frequency of

    the target word w in all documents IDF(w): inverse document frequency • = log (N/DF(w))+1 • N: number of all documents, • DF: number of documents that includes w. TF*IDF computes degree of concentration of the target word in a particular documents. In order to compute TF*IDF, we need to define "document" in advance.
  6. 6 / 30 Sentence selection (2) clue words Clue words

    are used as a key to select the sentence, or not to select the sentence. • "for example" – may follow examples that are considered not important • "therefore", "in summary", "consequently" – may follow conclusions that are important.
  7. 7 / 30 Sentence selection (3) position Sentence position can

    also be a key for sentence extraction when given documents are something like newspaper or editorial. • newspaper: head is important. • editorials: not only head but the end are important since there may be conclusion written at the end. We give some weights of importance to each sentence according to its position in the document.
  8. 8 / 30 Sentence selection (4) the title We see

    that the title of the document involves some important words, so it can be used. However, I personally think that use of this is no fair.
  9. 9 / 30 Sentence selection (5) discourse structure We need

    to analyze the structure of the document if we summarize it in a real sense. However, it is not so easy for the time being.
  10. 10 / 30 Sentence selection (6) cohesion Cohesion is degree

    of connectivity between words and phrases. Suppose we measure cohesion of given two sentences we see which to cut off sentences.
  11. 11 / 30 Problems in (current) sentence selection • less

    coherence throughout the text • no reference resolution is conducted. • the longer, the better. – since scores are somewhat counted. • chunk; is sentence best chunk for selection? • duplication; suppose that sentence A is somehow important, sentence A' that is close to A should be also important (and may be selected in the summary).
  12. 12 / 30 Summarization by abstraction • Use of thesaurus

    – I bought an apple, an orange, and a grape. – I bought some fruits. • Use of dictionary – I give him a good reason and cause him to stop hiking. – I persuaded him to stop hiking. Both attempts are still experimental. They have difficulties in constructing language resources such as thesaurus and dictionary.
  13. 13 / 30 Multiple document summarization Since 10 years ago

    multiple documents summarization has been attempted in order to meet the following demands: • One may want to browse an accident or an event, such as the earthquake. • One may want to pick up core description among many articles. • One may want to read the same event in a different point of view. • One may want to delete duplication.
  14. 14 / 30 Evaluation of Summarization How should we evaluate

    summaries? • (automatically) compare to human-written summary • Ranking by human • Task evaluation; reading comprehension etc. Evaluation criteria • readability; how natural the summary is. • degree of involvement of important words.
  15. 15 / 30 Sentence contraction • Sentence contraction shortens a

    sentence, not a document. • It is applied for automatic narration generation (of one's speech). • It can also be used for newswire for mobile phone. 「首相がスキャンダルの責任を取って辞意。来月中 に解散総選挙へ」
  16. 16 / 30 “Shinkansen summary” generation (1) shortening expressions 山本

    和英,池田 諭史, 大橋 一輝. 「新幹線要約」のための文末の整形. 自然言語処理, Vol.12, No.6, pp.85-111 , 言語処理学会 (2005.11) Satoshi Ikeda and Kazuhide Yamamoto. Transforming a Sentence End into News Headline Style. Proceedings of The Third International Workshop on Paraphrasing (IWP2005), pp.41-48 (2005.10)
  17. 17 / 30 What is Shinkansen summary そごうは西武百貨店から新たに代表取締役2人を迎える人事を 固めた。基幹店店長に起用しノウハウ導入、立て直し急ぐ。 政府は2001年度の実質経済成長率1.7%と見込む政府経済

    見通しを閣議決定した。公共投資は実質3.2%減と想定。 米の2000年10―12月期のGDP成長率の速報値は1.4% に。個人消費など落ち込み、米経済の急ブレーキを確認。 鉄鋼大手3社は、雇用延長制度を2001年度から前倒しで導入 する。公的年金の変更に対応、自動車や電機大手と足並み。 金融庁は、大手銀行に債券や外為運用状況を四半期ごとに開 示するよう要請へ。市場による経営監視を強化するのがねらい。
  18. 18 / 30 Shinkansen summary • (Japanese) short news messages

    seen on the Shinkansen. • Written in 60 characters for each article. • The same messages were obtained by e-mail service. – 3 times a day, 5 days a week.
  19. 19 / 30 Shinkansen summary: characteristics • Very short and

    simple. • Most of them are one- or two-sentence summary • Omission of expression at the end. – 「... 実質3.2%減と想定」 • Particle (having special meaning) at the end of sentence – 「... 四半期ごとに開示するよう要請へ」 • Many Chinese-derived words – 「決める」→「決定」、「選ぶ」→「選出」
  20. 20 / 30 Sentence compression (1) deletion and paraphrasing •

    Expressions at the end of sentence – 断定の「だ」、ですます、「...てしまう」 • Particles – 「遺体を発見」→「遺体発見」 • Functional words – 「協議する意向を示す」 – 「開催することで合意」 • Paraphrasing into shorter words – 「...が見つかった」→「...を発見」 – 「...を調べている」→「...を調査中」
  21. 21 / 30 Sentence compression (2) particle at end •

    implying presumption – 「自首したとみられる」→「自首か」 • implying expected events – 「来年から実施する予定だ」→「...実施へ」 • deletion of functional verbs – 「ぎりぎりの選択 となった」→「...選択に」 – 「辞任することを明らかにした」→「辞任を明ら かに」
  22. 22 / 30 “Shinkansen summary” generation (2) identifying important expressions

    山本 和英, 牧野 恵. 要約事例を用例として模倣利用したニュース記事要約. 自然言語処理, Vol.15, No.3, pp.115-158 , 言語処理学会 (2008.7) Megumi Makino and Kazuhide Yamamoto. Summarization by Analogy: An Example-based Approach for News Articles. Proceedings of The Third International Joint Conference on Natural Language Processing (IJCNLP2008), pp.739-744 (2008.1)
  23. 23 / 30 What we learn from the summary •

    Expressions are well sophisticated (or compressed). – how we express simply • Expressions are well selected – what we should express with priority
  24. 27 / 30 [ Original ] 三十日午後二時十分ごろ、剣淵町の国道40号で、旭川市 東旭川町下兵村二二八、農業南部正さんの乗用車と、旭川 市流通団地二条二ノ四三、運転手原政運さんのトラックが 正面衝突した。乗用車の四人のうち、南部さんと妻の喜美

    子さん、士別市東山町三〇二、無職池沢一郎さんの三人が 頭を打つなどして死亡、旭川市東旭川北一条四ノ一ノ二 八、無職真岩高子さんも左足の骨を折る重傷を負った。原 さんにけがはなかった。 
  25. 28 / 30 [ Original ] 三十日午後二時十分ごろ、剣淵町の国道40号で、旭川市 東旭川町下兵村二二八、農業南部正さんの乗用車と、旭川 市流通団地二条二ノ四三、運転手原政運さんのトラックが 正面衝突した。乗用車の四人のうち、南部さんと妻の喜美

    子さん、士別市東山町三〇二、無職池沢一郎さんの三人が 頭を打つなどして死亡、旭川市東旭川北一条四ノ一ノ二 八、無職真岩高子さんも左足の骨を折る重傷を負った。原 さんにけがはなかった。  [ Reference ] イラク中部で28日深夜、油送管が爆発し74人が死亡
  26. 29 / 30 [ Original ] 三十日午後二時十分ごろ、剣淵町の国道40号で、旭川市 東旭川町下兵村二二八、農業南部正さんの乗用車と、旭川 市流通団地二条二ノ四三、運転手原政運さんのトラックが 正面衝突した。乗用車の四人のうち、南部さんと妻の喜美

    子さん、士別市東山町三〇二、無職池沢一郎さんの三人が 頭を打つなどして死亡、旭川市東旭川北一条四ノ一ノ二 八、無職真岩高子さんも左足の骨を折る重傷を負った。原 さんにけがはなかった。  [Reference] イラク中部で28日深夜、油送管が爆発し74人が死亡 [Summary] 剣淵町の国道40号で三十日午後二時十分ごろ、旭川市流 通団地二条二ノ四三、運転手原政運さんのトラックが正面 衝突し南部さんと妻の喜美子さん、士別市東山町三〇二、 無職池沢一郎さんの三人が死亡