Natural Language Processing (8) Language resources

1 / 28 Natural Language Processing (8) Language resources Kazuhide
Yamamoto Dept. of Electrical Engineering Nagaoka University of Technology

2 / 28 Outline Language resources include: • dictionary (cf.
semantics) • rewritten rules (cf. parsing) • corpus • thesaurus Current natural language processing somewhat requires language resources. Moreover, performance of process heavily depends on quality and quantity of language resources we use. In order to improve better performance, it is required to build (by ourselves) "better" resources.

3 / 28 Dictionary

4 / 28 Dictionary What is the differences between word
dictionary for human and that for computer? • Human dictionary is used mainly for getting definition. • Computer dictionary is for getting the difference of senses as well as getting usage. Hence we don't directly utilize human dictionary for computer processing.

5 / 28 Dictionary: requirement for language processing knowledge of
language • syntax; inflection, part-of-speech, etc. • semantics; definition etc. knowledge of situation • so far what was spoken at where in when knowledge of world = common sense • what we all know; e.g., umbrella is used at rainy day, teacher is at school to teach something, the red signal of the light shows stop. etc. technical knowledge • shared knowledge within a special community, such as electronics, law, and cooking.

6 / 28 Dictionary: entry unit problem • compound noun/verb
– 「通りすぎる」 vs.「通り」「すぎる」 – swimming pool, hair cut, mother-in-law • proper noun – 「モーニング娘。」vs. 「モーニング」「娘」「。」 – War & Peace, the Industrial Revolution • 「したそうだ」 vs. 「した」「そうだ」 – 伝聞 hearsay – 願望 desire +推量 guess

7 / 28 What should be one word? expressions that
is not synthetic • 走り書き、生年月日、株式会社 expression having important roles in analysis • によって、はずがない abbreviations • 冷暖房、松竹梅

8 / 28 Variation of description • orthographic variation –
林檎・りんご・リンゴ、%・パーセント • difference within the same orthography – 付属・附属、バイオリン・ヴァイオリン • mixture of Kanji and Kana – 洗たく、けい光灯、消火せん、き裂、は握 • inflection – 行う・行なう、受付・受付け・受け付け

9 / 28 Conjugation problem • Usually only basic form
are written. • Conjugation is written as a rule. However, all conjugation are not always possible. • ゆく – change in -た form; we say いった instead of ゆった • ある – no negation form; we never say あらない. • かもしれない – only negation form; we never say かもしれる.

10 / 28 "Number" problem • Numbers are not registered
in the dictionary since there are unlimited. • However, some words are recorded in the dictionary that include number(s) in their expressions. – 一人、五分: different pronunciation – 八戸、四郎丸、六本木: place name – 一郎、百合子: person name – 一杯、十分: adverb

11 / 28 Corpus

12 / 28 Corpus, pl. corpora • Huge size of
text collection • Originally used for linguistics to see usage of words; corpus linguistics. • Recently it's widely provided and used for language processing.

13 / 28 Corpus: three kinds • text corpus (=
raw corpus) – No information is tagged. • tagged corpus – Morphological information is tagged. • bracketed corpus, treebank – Syntactic information is tagged as well.

14 / 28 Other corpora spoken language corpora • collected
with speech data and language data • mainly used for speech recognition parallel corpora • sentences in more than two languages that are corresponded • mainly used for machine translation and cross-language information retrieval

15 / 28 Newspaper corpus 郵政省は２０００年度から、全国の小中学校、高校６００校を対象にインターネットを利用した動画情報の配信実験を開始する。ネットの利用方法が文字情報の閲覧や電子メールのやりとりにとどまらず、配信される動画像を楽しむ方向に変化していることを受け、文部省と協力して教育現場でも遠隔授業など、動画利用のネット教材の充実を図ることにした。
１９９９年の原油の平均価格は、前年に比べ３５％上昇したことが３１日までに、明らかになった。石油輸出国機構（ＯＰＥＣ）諸国などによる原油の協調減産が影響した。ＯＰＥＣは現在の減産期限である今年３月末以降も減産を維持する構えで、２０００年も原油価格が高水準のまま推移すれば、世界的なインフレの原因になる可能性もある。年明け交渉、最有力候補に。

16 / 28 Minutes of Diet 国務大臣の演説 ◦議長（綿貫民輔君）　内閣総理大臣から施政方針に関する演説、外務大臣から外交に関する演説、財務大臣から財政に関する演説、竹中国務大臣から経済に関する演説のため、発言を求められております。順次これを許します。内閣総理大臣小泉純一郎君。
　〔内閣総理大臣小泉純一郎君登壇〕 ◦内閣総理大臣（小泉純一郎君）　天皇陛下におかれましては、御病気御療養中であります。陛下の一日も早い御快癒を、国民とともに心からお祈り申し上げます。（拍手）　内閣総理大臣として、今、私に与えられた職責は、我が国の経済と社会の再生です。小泉内閣として、「聖域なき構造改革」を推進するとの考えのもと、今後の国政に当たる基本方針を申し述べ、国民の皆様の御理解と御協力を得たいと思います。　日本経済は、世界的規模での社会経済変動の中、単なる景気循環ではなく、複合的な構造要因による停滞に直面しています。不良債権や財政赤字など負の遺産を抱え、戦後経験したことのないデフレ状態が継続し、経済活動と国民生活に大きな影響を与えています。...

17 / 28 Spoken language corpus 0001 00000.218-00003.351 L: 日本女子大学の
& ニホンジョシダイガクノ (R ××××)です & (R ××××××)デスよろしく & ヨロシクお願いいたします & オネガイイタシマス 0002 00004.158-00004.530 L: (F えっと) & (F エット) 0003 00004.762-00007.091 L: (F んえー) & (F ンエー) まず & マズ初めに & ハジメニちょっと & チョット訂正なんですが & テーセーナンデスガ 0004 00007.516-00009.501 L: (F えーっと) & (F エーット) 予稿集の & ヨコーシューノ (F えーっと) & (F エーット) 0005 00009.201-00009.539 L:<雑音> 0006 00011.019-00012.161 L: 九十一ページ & キュージュイチページ

18 / 28 Travel conversation corpus # S-ID:11 * 1D
<文頭><述並終点><助詞><体言><係:ノ格><区切:0-4> そちらそちらそちら指示詞 7 名詞形態指示詞 1 * 0 * 0 NIL <文頭><自立> ののの助詞 9 接続助詞 3 * 0 * 0 NIL <付属> * 3D <外の関係><ガ><助詞><体言><係:ガ格><区切:0-0> 方ほう方名詞 6 副詞的名詞 9 * 0 * 0 NIL <漢字><自立> ががが助詞 9 格助詞 1 * 0 * 0 NIL <付属> * 3D <副詞><用言:弱><係:連用><区切:0-4> ゆっくりゆっくりゆっくり副詞 8 * 0 * 0 * 0 NIL <自立> * 4D <ト><助詞><用言:強:動><係:ト格><レベル:C><区切:3-5><ID:〜と（引用）><提題受:15> くつろいでくつろいでくつろぐ動詞 2 * 0 子音動詞ガ行 4 タ系連用テ形 11 NIL <自立> いただけるいただけるいただける動詞 2 * 0 母音動詞 1 基本形 2 NIL <付属動詞候補タ系><付属> ととと助詞 9 格助詞 1 * 0 * 0 NIL <付属> * -1D <文末><句点><助詞><用言:強:動><レベル:C><区切:5-5><ID:（文末）><提題受:10> 思いおもい思う動詞 2 * 0 子音動詞ワ行 12 基本連用形 7 NIL <自立> ますますます接尾辞 14 動詞性接尾辞 7 動詞性接尾辞ます型 31 基本形 2 NIL <付属> ししし助詞 9 接続助詞 3 * 0 * 0 NIL <付属> 。。。特殊 1 句点 1 * 0 * 0 NIL <文末><付属> EOS

19 / 28 Treebank ( (CODE SpeakerB3 .)) ( (SBARQ
(INTJ Well) (WHNP-1 what) (SQ do (NP-SBJ you) (VP think (NP *T*-1) (PP about (NP (NP the idea) (PP of, (INTJ uh) , (S-NOM (NP-SBJ-2 kids) (VP having (S (NP-SBJ *-2) (VP to (VP do (NP public service work)))) (PP-TMP for (NP a year))))))))) ? E_S)) ( (SQ Do (NP-SBJ you) (VP think (SBAR 0 (S (NP-SBJ it) (VP 's (NP-PRD-UNF a))))) , N_S)) ( (CODE SpeakerA4 .)) ( (S (INTJ Well) , (EDITED (RM [) (NP-SBJ I) , (IP +)) (NP-SBJ I) (RS ]) (VP think (SBAR 0 (S (NP-SBJ it) (VP 's (NP-PRD a (ADJP pretty good) idea))))) . E_S)) http://www.cis.upenn.edu/~treebank/switch-samp-bkt.html

20 / 28 Web corpus 全身にこだわりが息づくモバイルマシン！小さくてかわいいNEWバイオUもエントリースタート！予約商品は発売日にお届け！容易に予想できる結果であったが、調査結果から、特に自分が外国人に近いと判断する際に滞在期間が大きく影響していることが示唆されている。リモートメンテナンスホスト(tftpクライアント)が、133.176.200.42である。
ワールドパークス会員番号をご入力ください。そういうふうに具体的にやっていかないと、大ざっぱに十束一からげでやるとほとんど議論が進まないし、そんなふうな感じがします。これだけではなかったと思うのですけれどもね。僕はアパート住まいですが、仮想現実の世界では、ゆったりした所で暮らしたいと思い、塀のある家を建てました。根拠は見事になし。暫くして下女は細君に命ぜられて、二階に洋燈を点けに行つたが、下りて来る時、一通の手紙を持つて来て、時雄に渡した。アナログ回線で最大12Mbps（モア）、最大8Mbps（8Mプラン）、最大1.5Mbps（1.5Mプラン）の高速＆つなぎほうだいを実現するインターネット向けアクセスサービスです。工事料はかかりません。

21 / 28 Balanced corpus • Balanced corpus is a
collection of text that supposed to represent the language. A wide range of domains, topics, and materials is considered and sentences are randomly selected to avoid biases. • Example: BCCWJ (Balanced Corpus of Contemporary Written Japanese) – released in 2011. 100M words. – Publication subcorpus: 35M words • books,magazines, and newspapers published during 2001-2005 – Library subcorpus: 30M words • books cataloged at more than 13 public libraries in Tokyo area, and published after 1985 – Special purpose subcorpus: 35M words • governmental whitepaper, textbook, laws, Internet (Yahoo! Q&A), Diet minutes, best selling books, etc.

22 / 28 When is corpus used? • term collection
– new words for machine translation dictionary • statistics – n-gram statistics and collocation statistics • instance collection – case frames and example-based machine translation • knowledge construction – thesaurus construction and grammar rule construction

23 / 28 Corpus: problems Copyright • not easy to
get newspaper corpus etc. even when we use it only for statistical processing. Quality • hard to maintain the quality of tags. Quantity • hard to get huge amount

24 / 28 Thesaurus

25 / 28 Thesaurus, pl. thesauri Thesaurus is hierarchical classification
of words. 社会施設公共施設公民館図書館社寺稲荷集団軍隊家庭自然物品人物陸軍所帯

26 / 28 How do we use thesaurus? Similarity calculation
• in order to get how similar given two words, A and B, are. Generalization / normalization • in order to say two words, A and B, in one word. e.g., apple and orange are rephrased fruits. Thesaurus is the only language resources we can use for lexical semantic analysis.

27 / 28 Thesaurus: problems Construction cost • much time
and expenses • not easy of work sharing Definition • A thesaurus represents only one classification of entities; there are a lot more. Association • Thesaurus is unable to represent associative relation; e.g., summer and watermelon, hair and shampoo.

28 / 28 Summary: today's key words • dictionary •
corpus • thesaurus

Natural Language Processing (8) Language resources

Natural Language Processing (8) Language resources

自然言語処理研究室

More Decks by 自然言語処理研究室

Other Decks in Education

Featured

Transcript

1 / 28 Natural Language Processing (8) Language resources Kazuhide

2 / 28 Outline Language resources include: • dictionary (cf.

3 / 28 Dictionary

4 / 28 Dictionary What is the differences between word

5 / 28 Dictionary: requirement for language processing knowledge of

6 / 28 Dictionary: entry unit problem • compound noun/verb

7 / 28 What should be one word? expressions that

8 / 28 Variation of description • orthographic variation –

9 / 28 Conjugation problem • Usually only basic form

10 / 28 "Number" problem • Numbers are not registered

11 / 28 Corpus

12 / 28 Corpus, pl. corpora • Huge size of

13 / 28 Corpus: three kinds • text corpus (=

14 / 28 Other corpora spoken language corpora • collected

17 / 28 Spoken language corpus 0001 00000.218-00003.351 L: 日本女子大学の

18 / 28 Travel conversation corpus # S-ID:11 * 1D

19 / 28 Treebank ( (CODE SpeakerB3 .)) ( (SBARQ

21 / 28 Balanced corpus • Balanced corpus is a

22 / 28 When is corpus used? • term collection

23 / 28 Corpus: problems Copyright • not easy to

24 / 28 Thesaurus

25 / 28 Thesaurus, pl. thesauri Thesaurus is hierarchical classification

26 / 28 How do we use thesaurus? Similarity calculation

27 / 28 Thesaurus: problems Construction cost • much time

28 / 28 Summary: today's key words • dictionary •