Natural Language Processing (10) Summarization

1 / 30 Natural Language Processing (10) Summarization Kazuhide Yamamoto
Dept. of Electrical Engineering Nagaoka University of Technology

2 / 30 Two kinds of summaries Informative summary •
made for alternatives of the original; • information is preserved as much as possible. Indicative summary • made for judgment if we should read the original; • less important information is dropped. Generated summary depends on what kinds we need.

3 / 30 Summarization method • abstraction / generalization of
words, concepts, etc. • paraphrasing • selection of important parts out of the input (particularly, sentence selection) Most of the proposed methods for summarization is conducted by selecting sentences.

4 / 30 Sentence selection (1) word frequencies • long
history since 1958 • criterion: words that appear frequent are important • Sum of TF*IDF for each word is considered to be the importance of the sentence (1996) – similar to human selection result – better than head selection in newspaper articles.

5 / 30 TF*IDF TF(w): term frequency • frequency of
the target word w in all documents IDF(w): inverse document frequency • = log (N/DF(w))+1 • N: number of all documents, • DF: number of documents that includes w. TF*IDF computes degree of concentration of the target word in a particular documents. In order to compute TF*IDF, we need to define "document" in advance.

6 / 30 Sentence selection (2) clue words Clue words
are used as a key to select the sentence, or not to select the sentence. • "for example" – may follow examples that are considered not important • "therefore", "in summary", "consequently" – may follow conclusions that are important.

7 / 30 Sentence selection (3) position Sentence position can
also be a key for sentence extraction when given documents are something like newspaper or editorial. • newspaper: head is important. • editorials: not only head but the end are important since there may be conclusion written at the end. We give some weights of importance to each sentence according to its position in the document.

8 / 30 Sentence selection (4) the title We see
that the title of the document involves some important words, so it can be used. However, I personally think that use of this is no fair.

9 / 30 Sentence selection (5) discourse structure We need
to analyze the structure of the document if we summarize it in a real sense. However, it is not so easy for the time being.

10 / 30 Sentence selection (6) cohesion Cohesion is degree
of connectivity between words and phrases. Suppose we measure cohesion of given two sentences we see which to cut off sentences.

11 / 30 Problems in (current) sentence selection • less
coherence throughout the text • no reference resolution is conducted. • the longer, the better. – since scores are somewhat counted. • chunk; is sentence best chunk for selection? • duplication; suppose that sentence A is somehow important, sentence A' that is close to A should be also important (and may be selected in the summary).

12 / 30 Summarization by abstraction • Use of thesaurus
– I bought an apple, an orange, and a grape. – I bought some fruits. • Use of dictionary – I give him a good reason and cause him to stop hiking. – I persuaded him to stop hiking. Both attempts are still experimental. They have difficulties in constructing language resources such as thesaurus and dictionary.

13 / 30 Multiple document summarization Since 10 years ago
multiple documents summarization has been attempted in order to meet the following demands: • One may want to browse an accident or an event, such as the earthquake. • One may want to pick up core description among many articles. • One may want to read the same event in a different point of view. • One may want to delete duplication.

14 / 30 Evaluation of Summarization How should we evaluate
summaries? • (automatically) compare to human-written summary • Ranking by human • Task evaluation; reading comprehension etc. Evaluation criteria • readability; how natural the summary is. • degree of involvement of important words.

15 / 30 Sentence contraction • Sentence contraction shortens a
sentence, not a document. • It is applied for automatic narration generation (of one's speech). • It can also be used for newswire for mobile phone. 「首相がスキャンダルの責任を取って辞意。来月中に解散総選挙へ」

16 / 30 “Shinkansen summary” generation (1) shortening expressions 山本
和英,池田諭史, 大橋一輝. 「新幹線要約」のための文末の整形. 自然言語処理, Vol.12, No.6, pp.85-111 , 言語処理学会 (2005.11) Satoshi Ikeda and Kazuhide Yamamoto. Transforming a Sentence End into News Headline Style. Proceedings of The Third International Workshop on Paraphrasing (IWP2005), pp.41-48 (2005.10)

17 / 30 What is Shinkansen summary そごうは西武百貨店から新たに代表取締役２人を迎える人事を固めた。基幹店店長に起用しノウハウ導入、立て直し急ぐ。政府は２００１年度の実質経済成長率１．７％と見込む政府経済
見通しを閣議決定した。公共投資は実質３．２％減と想定。米の２０００年１０―１２月期のＧＤＰ成長率の速報値は１．４％に。個人消費など落ち込み、米経済の急ブレーキを確認。鉄鋼大手３社は、雇用延長制度を２００１年度から前倒しで導入する。公的年金の変更に対応、自動車や電機大手と足並み。金融庁は、大手銀行に債券や外為運用状況を四半期ごとに開示するよう要請へ。市場による経営監視を強化するのがねらい。

18 / 30 Shinkansen summary • (Japanese) short news messages
seen on the Shinkansen. • Written in 60 characters for each article. • The same messages were obtained by e-mail service. – 3 times a day, 5 days a week.

19 / 30 Shinkansen summary: characteristics • Very short and
simple. • Most of them are one- or two-sentence summary • Omission of expression at the end. – 「... 実質３．２％減と想定」 • Particle (having special meaning) at the end of sentence – 「... 四半期ごとに開示するよう要請へ」 • Many Chinese-derived words – 「決める」→「決定」、「選ぶ」→「選出」

20 / 30 Sentence compression (1) deletion and paraphrasing •
Expressions at the end of sentence – 断定の「だ」、ですます、「...てしまう」 • Particles – 「遺体を発見」→「遺体発見」 • Functional words – 「協議する意向を示す」 – 「開催することで合意」 • Paraphrasing into shorter words – 「...が見つかった」→「...を発見」 – 「...を調べている」→「...を調査中」

21 / 30 Sentence compression (2) particle at end •
implying presumption – 「自首したとみられる」→「自首か」 • implying expected events – 「来年から実施する予定だ」→「...実施へ」 • deletion of functional verbs – 「ぎりぎりの選択となった」→「...選択に」 – 「辞任することを明らかにした」→「辞任を明らかに」

22 / 30 “Shinkansen summary” generation (2) identifying important expressions
山本和英, 牧野恵. 要約事例を用例として模倣利用したニュース記事要約. 自然言語処理, Vol.15, No.3, pp.115-158 , 言語処理学会 (2008.7) Megumi Makino and Kazuhide Yamamoto. Summarization by Analogy: An Example-based Approach for News Articles. Proceedings of The Third International Joint Conference on Natural Language Processing (IJCNLP2008), pp.739-744 (2008.1)

23 / 30 What we learn from the summary •
Expressions are well sophisticated (or compressed). – how we express simply • Expressions are well selected – what we should express with priority

24 / 30 [ Original ] 欧州通貨統合が来年一月からスタートするのに伴い、十八銀行は一月四日からユーロ建て旅行小切手の取り扱いを始める。取扱券種は五十、百、二百ユーロの三種類で、当面、本店営業部が販売窓口と
なる。販売状況を見て取扱店舗を増やす。

25 / 30 [ Original ] 欧州通貨統合が来年一月からスタートするのに伴い、十八銀行は一月四日からユーロ建て旅行小切手の取り扱いを始める。取扱券種は五十、百、二百ユーロの三種類で、当面、本店営業部が販売窓口と
なる。販売状況を見て取扱店舗を増やす。 [ Reference ] 富士通は移動通信事業を始める。

26 / 30 [ Original ] 欧州通貨統合が来年一月からスタートするのに伴い、十八銀行は一月四日からユーロ建て旅行小切手の取り扱いを始める。取扱券種は五十、百、二百ユーロの三種類で、当面、本店営業部が販売窓口となる。販売状況を見て取扱店舗を
増やす。 [ Reference ] 富士通は移動通信事業を始める。 [ Summary ] 十八銀行はユーロ建て旅行小切手の取り扱いを始める。

27 / 30 [ Original ] 三十日午後二時十分ごろ、剣淵町の国道４０号で、旭川市東旭川町下兵村二二八、農業南部正さんの乗用車と、旭川市流通団地二条二ノ四三、運転手原政運さんのトラックが正面衝突した。乗用車の四人のうち、南部さんと妻の喜美
子さん、士別市東山町三〇二、無職池沢一郎さんの三人が頭を打つなどして死亡、旭川市東旭川北一条四ノ一ノ二八、無職真岩高子さんも左足の骨を折る重傷を負った。原さんにけがはなかった。　

子さん、士別市東山町三〇二、無職池沢一郎さんの三人が頭を打つなどして死亡、旭川市東旭川北一条四ノ一ノ二八、無職真岩高子さんも左足の骨を折る重傷を負った。原さんにけがはなかった。　 [ Reference ] イラク中部で２８日深夜、油送管が爆発し７４人が死亡

子さん、士別市東山町三〇二、無職池沢一郎さんの三人が頭を打つなどして死亡、旭川市東旭川北一条四ノ一ノ二八、無職真岩高子さんも左足の骨を折る重傷を負った。原さんにけがはなかった。　 [Reference] イラク中部で２８日深夜、油送管が爆発し７４人が死亡 [Summary] 剣淵町の国道４０号で三十日午後二時十分ごろ、旭川市流通団地二条二ノ四三、運転手原政運さんのトラックが正面衝突し南部さんと妻の喜美子さん、士別市東山町三〇二、無職池沢一郎さんの三人が死亡

30 / 30 Summary: today's key words • summarization •
multiple document summarization

Natural Language Processing (10) Summarization

Natural Language Processing (10) Summarization

自然言語処理研究室

More Decks by 自然言語処理研究室

Other Decks in Education

Featured

Transcript

1 / 30 Natural Language Processing (10) Summarization Kazuhide Yamamoto

2 / 30 Two kinds of summaries Informative summary •

3 / 30 Summarization method • abstraction / generalization of

4 / 30 Sentence selection (1) word frequencies • long

5 / 30 TF*IDF TF(w): term frequency • frequency of

6 / 30 Sentence selection (2) clue words Clue words

7 / 30 Sentence selection (3) position Sentence position can

8 / 30 Sentence selection (4) the title We see

9 / 30 Sentence selection (5) discourse structure We need

10 / 30 Sentence selection (6) cohesion Cohesion is degree

11 / 30 Problems in (current) sentence selection • less

12 / 30 Summarization by abstraction • Use of thesaurus

13 / 30 Multiple document summarization Since 10 years ago

14 / 30 Evaluation of Summarization How should we evaluate

15 / 30 Sentence contraction • Sentence contraction shortens a

16 / 30 “Shinkansen summary” generation (1) shortening expressions 山本

17 / 30 What is Shinkansen summary そごうは西武百貨店から新たに代表取締役２人を迎える人事を固めた。基幹店店長に起用しノウハウ導入、立て直し急ぐ。政府は２００１年度の実質経済成長率１．７％と見込む政府経済

18 / 30 Shinkansen summary • (Japanese) short news messages

19 / 30 Shinkansen summary: characteristics • Very short and

20 / 30 Sentence compression (1) deletion and paraphrasing •

21 / 30 Sentence compression (2) particle at end •

22 / 30 “Shinkansen summary” generation (2) identifying important expressions

23 / 30 What we learn from the summary •

24 / 30 [ Original ] 欧州通貨統合が来年一月からスタートするのに伴い、十八銀行は一月四日からユーロ建て旅行小切手の取り扱いを始める。取扱券種は五十、百、二百ユーロの三種類で、当面、本店営業部が販売窓口と

25 / 30 [ Original ] 欧州通貨統合が来年一月からスタートするのに伴い、十八銀行は一月四日からユーロ建て旅行小切手の取り扱いを始める。取扱券種は五十、百、二百ユーロの三種類で、当面、本店営業部が販売窓口と

30 / 30 Summary: today's key words • summarization •