自然言語処理研究室研究概要(2015年)

長岡技術科学大学自然言語処理研究室研究概要（2015年）

೔ຊޠͷޠኮฏқԽͷධՁͷͨΊͷ σʔληοτͷߏஙIUUQXXXKOMQPSHSFTPVSDFT ن໛ ૯จ਺ ໊ࢺ ಈࢺ
ܗ༰ࢺ ෭ࢺ ӳޠ൛̍ ӳޠ൛̎ ೔ຊޠ൛ σʔληοτͷจ຺ґଘੑ ӳޠ൛̍ ೔ຊޠ൛ ᶃɿର৅ޠ͕ಉ͡จ຺ͷ૊ ᶄɿᶃͷ͏ͪ׵ݴϦετ͕౳͍͠૊ ᶅɿᶄͷ͏ͪ೉қ౓ॱҐ͕ҟͳΔ૊ ᶆɿᶅͷ͏ͪ࠷΋ฏқͳޠ͕ҧ͏૊ ɾɼѱؾɼ਌͸ѱؾͰݴͬͨΘ͚Ͱ͸ͳ͘ɺࢠڙ ɹΛ͋΍͢ͱ͍͏͜ͱΛຊ౰ʹ஌Βͳ͔༷ͬͨࢠɻ ɾɼѱؾɼҙ஍ѱʀѱ͍ߟ͑ʀѱҙʀ ɾɼѱؾɼ\ҙ஍ѱɼѱҙ^\ѱؾ^\ѱ͍ߟ͑^ ޠኮత׵ݴͷධՁͷͨΊͷσʔληοτ ɾ*%ɼର৅ޠɼ׵ݴ౤ථස౓ʀ׵ݴ౤ථස౓ʀ ޠኮฏқԽͷධՁͷͨΊͷσʔληοτ ɾ*%ɼର৅ޠɼ\࠷΋ฏқͳޠ^ʜ\ɹ^ʜ\࠷΋೉ղͳޠ^ ޠኮత׵ݴͱޠኮฏқԽͷධՁͷͨΊͷσʔληοτ *1"ࣙॻ㱯+6."/ࣙॻͷ಺༰ޠ ໊ࢺ ಈࢺ ܗ༰ࢺ ෭ࢺ ฏқͳޠΛ࡟আ ֶशجຊޠኮɿখֶੜͷͨΊͷཧղޠኮ ׵ݴ͕ͳ͍ޠΛ࡟আɹ ಺༰ޠ׵ݴࣙॻʹؚ·ΕΔޠͷΈ ௿ස౓ޠΛ࡟আ ৽ฉهࣄ೥෼Ͱͷग़ݱස౓͕Ҏ্ ֤ର৅ޠʹ͍ͭͯจ຺ͣͭޠኮత׵ݴΛऩू Ϋϥ΢υιʔγϯάʹΑΓ֤̑ਓͷ࡞ۀऀΛืू จ຺தͰͷର৅ޠͷ׵ݴΛࢥ͍ͭ͘ݶΓྻڍ ࣙॻͷࢀর͸Մɺແهೖ΋Մɺଞਓʹฉ͘ͷ͸ෆՄ ਓͷ࡞ۀऀ͔Βฏۉޠͷޠኮత׵ݴΛऩू Ϋϥ΢υιʔγϯάʹΑΓ৽ͨʹ̑ਓͷ࡞ۀऀΛืू ̏ਓҎ্͕ʮద੾ͳ׵ݴͰ͋Δʯͱճ౴ͨ͠׵ݴΛ࠾༻ ਓͷ࡞ۀऀʹΑͬͯฏۉޠͷ׵ݴ͕ೝΊΒΕͨ ࡞ۀऀؒͷҰக౓͸ʢे෼ʹߴ͍Ұக཰ʣ ᶃ ޠ ኮ త ׵ ݴ σ ồ λ η ỽ τ ʲഎܠʳޠኮฏқԽ͸ɺจதͷ೉ղͳޠΛΑΓฏқ ͳಉٛޠʹஔ׵͢Δٕज़ɻࢠͲ΋΍ݴޠֶशऀΛ͸ ͡Ίͱ͢Δ෯޿͍ಡऀͷจষಡղΛࢧԉ͢Δɻ ʲߩݙʳޠኮత׵ݴ΍ޠኮฏқԽͷΞϧΰϦζϜͷ ਫ਼౓΍࠶ݱ཰Λܭࢉ͠ɺ'஋ΛٻΊΔࣗಈධՁͷ࿮ ૊ΈΛఏڙˠݚڀͷαΠΫϧ͕ૉૣ͘ճΔΑ͏ʹʂ จ຺தͰର৅ޠͱͦͷ׵ݴΛฏқͳॱʹฒͼସ͑ Ϋϥ΢υιʔγϯάͰ֤̑ਓͷ࡞ۀऀΛืू ਓͷ࡞ۀऀؒͷҰக౓͸ ೉қ౓ॱҐͷฏۉΛͱͬͯσʔλΛ౷߹ ฏۉ஋͕ಉ͡୯ޠಉ࢜ʹ͸ಉ͡ॱҐΛׂΓ౰ͯΔ ฏۉஈ֊ͷ೉қ౓ׂ͕Γ౰ͯΒΕͨ ᶄ ޠ ኮ ฏ қ Խ σ ồ λ η ỽ τ

ධ൑෼ੳʹ͓͚Δ඼ࢺ৘ใͱҙຯྨܕ৘ใͷ༗ޮੑൺֱ ҙຯྨܕΛ༻͍Δ͜ͱͰ நग़Ͱ͖ΔΑ͏ʹͳͬͨධ൑දݱͷྫ ͻͱຯҧ͓ͬͨళͰ͢ ൵͠͞Λײ͡Δ ඼ࢺɹɹ ಈࢺ ҙຯྨܕ ܗ༰
඼ࢺɹɹ ಈࢺ ҙຯྨܕ ײ֮ɾײ৘ ˠɹܗ༰දݱͷಈࢺ΍ɼײ֮ͷಇ͖Λද͢ಈࢺ ɹɹͱ͍ͬͨ৘ใΛѻ͑Δ ඼ࢺͱҙຯྨܕͷൺֱ ඼ࢺ ܗଶతͳ෼ྨ ޠʹରͯ͠ݪଇҰҙ ܗଶతͳಛ௃Λମܥతʹѻ͑Δ ɹྫʣ׆༻ܗͷมԽͳͲ ҙຯྨܕ ҙຯతͳ෼ྨʢਓʹΑΔ൑அʣ ಉ͡ޠͰ΋จ຺ͰҟͳΓಘΔ ޠͷҙຯΛΑΓਖ਼֬ʹѻ͑Δ ධ൑෼ੳ΁ͷԠ༻݁Ռ ධ൑จ ඇධ൑จͷ෼ྨਫ਼౓ ඼ࢺͷΈˠҙຯྨܕར༻ ධ൑จநग़ͷ ' ஋ ඼ࢺͷΈˠҙຯྨܕར༻ ϙΠϯτ্ঢ ্ঢ ༻ݴ͸ಈࢺͱܗ༰ࢺʹେผ͞Εɼ Ұൠతʹಈࢺ ͸ਓ΍෺ͷಈ͖Λɼ ܗ༰ࢺ͸෺ࣄͷಛ௃΍ੑ࣭Λ ද͢ͱ͞Ε͍ͯΔɽ ͔͠͠ɼ ࣮ࡍʹ͸ ʮ༏ΕΔʯ ͷΑ ͏ͳܗ༰දݱʹͳΓಘΔಈࢺ͕ଘࡏ͢Δɽ ͜Εʹର͠ɼ ༻ݴΛҙຯతʹ෼ྨ͢Δ͜ͱΛ໨త ʹຊݚڀࣨͷઌߦݚڀʹͯ छྨͷҙຯྨܕ ʢಈ ࡞ɼ มԽɼ ײ֮ɾײ৘ɼ ܗ༰ʣ ͕ఏҊ͞Ε͍ͯΔɽ ຊݚڀͰ͸ɼ ධ൑෼ੳͷྫΛڍ͛ɼ ༻ݴΛҙຯྨ ܕʹج͍ͮͯ෼ྨ͢Δ͜ͱͷ༗༻ੑΛࣔ͢ɽ ۩ମతʹ͸ɼ จதͷ࠷ޙͷ༻ݴͷҙຯྨܕ͔Βɼ ͦͷจ͕Կ͔ʹରͯ͠ධ൑Λड़΂͍ͯΔ͔Ͳ͏͔ ͷ ஋෼ྨΛߦ͍ɼ ඼ࢺ৘ใΛ༻͍ͯ෼ྨͨ͠৔ ߹ͱൺֱͯ͠ਫ਼౓͕޲্͢Δ͜ͱΛ֬ೝͨ͠ɽ ֓ཁ

• • • • IP-MAT *pro*(SBJ) *T*(SBJ) IP-REL • •
•

What We Need is Word, Not Morpheme; Constructing Word Analyzer
for Japanese • We constructed word analyzer SNOWMAN • SNOWMAN can solve the two Japanese problems • Orthographical Variants • Multiword Expressions • Constructing SNOWMAN Dictionary • gathered dictionaies related the two problems • checked entries by hand • We show two comparisons in UniDic and SNOWMAN • Merging Orthographical Variants • Synonyms in UniDic are not always orthographical variants • UniDic entries are merged ignoring a different senses • Coverages of Frequency • The difference between UniDic and SNOWMAN is 0.7 points • UniDic can not recognize as same entries OrthographicalVariants • same pronunciation an d same sense • different notations • e.g. りんご・リンゴ・林檎 Multiword expressions • a word made up of two or more words • e.g. idioms, nominal verbs morphological analysis merging orthographical variants connecting multiple morphemes

animal Construction of Japanese Semantically Compatible Words Resource Fido is
a dog, I will also conclude that he’s an animal, that he is not a cat, and that he might or might not be a puppy (Kruszewski et al. 2015) . Future work, we evaluate this resource to some NLP tasks. Cllasification Word Types Concepts Hyponymy 1,196 343 Synonymy 57,178 21,784 Because we expect that semantically compatible words can solve a data sparseness problem with reducing a amount of words. Image of semantically compatible words The reason we collect Semantically compatible words and synonym are similar. However, the words refer to the same thing or have a hyponym-and-hypernym relation. Is it synonym? Fido Collecting from existing resources by automatically and manually Cat ⊂ ( Kitty, Meow) Baby, Child, Infant to assist, to lend a hand

೔ຊޠͷޠኮฏқԽγεςϜͷߏங ɹɹɹɹɹɹɹɹɹɹɹɹɹɹIUUQXXXKOMQPSHSFTPVSDFT ʲഎܠʳޠኮฏқԽ͸ɺจதͷ೉ղͳޠΛΑΓฏқ ͳಉٛޠʹஔ׵͢Δٕज़ɻࢠͲ΋΍ݴޠֶशऀΛ͸ ͡Ίͱ͢Δ෯޿͍ಡऀͷจষಡղΛࢧԉ͢Δɻ ʲ՝୊ʳ೔ຊޠͰ͸ɺ͜ͷ෼໺ʹ͸ίʔύε΋ଞͷ ݴޠࢿݯ΋γεςϜ΋ɺެ։͞Εͨ΋ͷ͸Կ΋ͳ͍ɻ ʲߩݙʳ೔ຊޠͷޠኮฏқԽγεςϜΛ8FCͰॳΊ
ͯެ։ˠϢʔβʹಧ͘ʂݚڀͷϕʔεϥΠϯʹ΋ʂ ʲಛ௃ʳޠኮฏқԽͷయܕతͳ̐ͭͷػߏʢԼਤʣ Λඋ͑ͨɺʮਅͬ౰ͳʯγεςϜΛߏஙͨ͠ɻ ւ֎͔Βͷʨ๚໰ऀˠ͓٬͞Μʩʹʨ഑Δˠ౉͢ʩ ΄͔ɺਆށࢢͷւ֎ࣄ຿ॴʹૹ෇͢Δɻ ֶशࢦಋཁྖͷʨ࿮ˠϑϨʔϜʩʹͱΒΘΕͳ͍ʜ ๺཮ۜͷߴ໦ʨ಄औˠࣾ௕ʩ΋ʮϦετϥࡦͱऩʜ ೖྗจ ະདྷ͸एऀ͕୲͏ ޠኮత׵ݴͷੜ੒ ୲͏఻ঝ͢Δ Ҿܧ͙ ࢧ͑Δ ड͚ܧ͙ ೉ղޠͷݕग़ ºɿ୲͏ ग़ྗจ ະདྷ͸एऀ͕ࢧ͑Δ ޠٛᐆດੑͷղফ Ҿܧ͙ ࢧ͑Δ ड͚ܧ͙ ೉қ౓ʹجͮ͘ฒͼସ͑ ࢧ͑Δड͚ܧ͙Ҿܧ͙ ᶅޠٛᐆດੑͷղফ ධՁɿड़ޠ߲ߏ଄ղੳʢ4ZO$IBʣͰಘͨड़ޠͱ߲ͷؔ܎͕ɺ֨ϑϨʔϜ ࣙॻʢژେ֨ϑϨʔϜʣʹهड़͞Ε͍ͯΕ͹ɺͦͷจ͸ద֨ͱݴ͑Δɻ ᶆ೉қ౓ʹجͮ͘ฒͼସ͑ ֤ฏқԽީิʹ୯ޠ਌ີ౓σʔλϕʔεΛ༻͍ͯ೉қ౓Λ෇༩ɻ࠷΋਌ີ ౓ͷߴ͍ޠʢ࠷΋ฏқͳޠʣΛɺೖྗจதͷ೉ղޠͱஔ׵ͯ͠ग़ྗ͢Δɻ ᶄޠኮత׵ݴͷੜ੒ ֤೉ղޠʹ͍ͭͯɺฏқԽީิͱͳΔޠኮత׵ݴΛྻڍ͢Δɻ͜ΕΒͷޠ ኮత׵ݴ͸ɺجຊతҙຯؔ܎ͷࣄྫϕʔεɺ಺༰ޠ׵ݴࣙॻɺಈࢺؚҙؔ ܎σʔλϕʔεɺ೔ຊޠ8PSE/FUಉٛޠσʔλϕʔε͔Βऩू͢Δɻ ᶃ೉ղޠͷݕग़ ೖྗจΛܗଶૉղੳʢ.F$BCʣ͠ɺฏқޠϦετʢֶशجຊޠኮʣʹؚ ·Εͳ͍಺༰ޠʢ໊ࢺɺಈࢺɺܗ༰ࢺɺ෭ࢺʣΛ೉ղޠͱͯ͠நग़͢Δɻ

語彙統制のための同義語集合の資源化「赤ちゃん」「赤ん坊」といった、異なる単語だがほぼ同義の語を同一視することを語彙統制と呼ぶ。自然言語の表現の多様性による莫大な語の組み合わせを減らすために、語彙統制を行う。ほぼ同義の語を上位下位関係の語と類義語と分類し資源化を行った。語彙統制と概要語彙統制の例「解析する」と「分析する」が同義語集合として扱われる。ほぼ同義の語を同一視することにより応用タスクでの性能
向上が期待できる。分類語数概念数上位下位関係 1,196 343 類義語 57,178 21,784 収集方法と結果既存の資源から収集した。上位下位関係は主に手作業で、類義語は自動で収集した。例）上位下位関係「猫：子猫、黒猫、親猫」類義語「やかん、ケトル」

Detecting Japanese character conversionerrors for proofreading insurance documents 保健証券等に記載の自動車をいいます（Indicate
car described in health policy）保健証券等(health policy), 記載 (described),自動車(car),いい(indicate) 保険証券等に記載の自動車をいいます (Indicate car described in insurance policy) 保険証券等(insurance policy),記載 (described),自動車(car),いい(indicate) 保健証券等に記載の自動車をいいます input sentence Extract content words Error detection Basic documet Extract content words Extract corresponding sentence Purpose: I build a system to support proofreading which compare basic document(article and so on.) and derivation document(pamphlet and so on.). Method: To extract content words from the basic document and the derivation document sentence(input). The most inclusion same content words sentence is the corresponding sentence to the input sentence. If the content word is not contained in the corresponding sentence and contained in input sentences, the content word is detected as an error word. Experiment: I made test set by replace the content words in basic document. And I evaluated the system by extracting error in the test set. Result: In Association, precision is 77.7%. In error detection, recall is 99.6%.

The Effect of Paraphrases in Sta3s3cal Machine Transla3on Background In
Sta3s3cal Machine Transla3on, the transla3on quality is mainly dependent on the corpus. Out of vocabulary, words not in corpus appears as not translated, is considered as caused by data sparseness in corpus. In related work, Ullman et al. [1] paraphrases high frequent compound nouns. In their result, the BLEU value, quan3ta3ve score, has lowered with paraphrased corpus. When looking at the graph of token frequency (fig. 1), 1-‐frequency tokens occupy the majority. Knowing this, we inves3gate how much reduced size of 1-‐frequency tokens would affect in BLEU values. 1.  E. Ullman and J. Nivre, “Paraphrasing Swedish compound nouns in machine transla3on,” EACL 2014, p. 99, 2014. 2.  P. Koehn, H. Hoang, A. Birch, C. Callison-‐Burch, M. Fed-‐ erico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constan3n, and E. Herbst. Moses: Open source toolkit for sta3s3cal machine transla3on. In Proc. 45th ACL, Companion Volume, pages 177–180, 2007. Experiment Instead of dele3ng the 1-‐frequency words, we make paraphrases of 1-‐ frequency verbs according to Ullman’s method. By paraphrasing low-‐ frequent words to more frequent words, it does not only eliminate the low-‐ frequent words but also makes the paraphrase verb more frequent. (fig. 2) The corpus is KFTT corpus which consists of 440k sentences for training. We have paraphrased randomly selected 200 1-‐frequency verbs to some other more common verbs. In fig. 3, it shows a paraphrasing example. It prevented the enemies from listening . It prevented the enemies from eavesdropping . fig. 2 fig. 1 fig. 3 Token Frequency For the experiment setup, MOSES[2] is used. Paraphrasing is done in both training set and test set as well. For evalua3ons, we have conducted both quan3ta3ve, BLEU, and subjec3ve evalua3ons. For subjec3ve, we evaluated transla3on in 4-‐scale: 0 is being incorrect in grammar and not retaining senses and 3 is vice-‐versa. The fig.4 shows the result. In result, BLEU shows the drop in Open Experiments same as to the result by Ullman. In subjec3ve evalua3on, it shows scale 0 shows increase in paraphrased meaning increase in low-‐quality transla3ons, but also some increase in scale 3 as well. fig. 4

雪だるまプロジェクト  判別できないことは無理に判別しない  曖昧性を許容 単語解析器雪だるま  表記統制部: 表記ゆれのまとめ上げ
 形態素結合部: 複数形態素の結合・追加  語義統制部: 同義語のまとめ上げ 作成方法  表記ゆれのまとめ上げ  様々な資源から情報取得  約25,000語を人手でチェック  複数形態素の結合・追加  辞書への追加(資源+人手)  ルールを用いた結合 展望  統計手法におけるデータスパースネス性の緩和 →応用タスクで検証予定単語解析器雪だるまの紹介 -表記統制と形態素結合について- 形態素解析 UniDic 表記統制形態素結合語義統制

自然言語処理研究室研究概要(2015年)

自然言語処理研究室研究概要(2015年)