自然言語処理研究室 研究概要(2015年)

自然言語処理研究室 研究概要(2015年)

C04e17d9b3810e5c0ad22cb8a12589de?s=128

自然言語処理研究室

February 28, 2017
Tweet

Transcript

  1. 長岡技術科学大学 自然言語処理研究室 研究概要(2015年)

  2. ೔ຊޠͷޠኮฏқԽͷධՁͷͨΊͷ σʔληοτͷߏஙIUUQXXXKOMQPSHSFTPVSDFT ن໛ ૯จ਺ ໊ࢺ   ಈࢺ  

    ܗ༰ࢺ   ෭ࢺ   ӳޠ൛̍              ӳޠ൛̎              ೔ຊޠ൛               σʔληοτͷจ຺ґଘੑ ӳޠ൛̍ ೔ຊޠ൛ ᶃɿର৅ޠ͕ಉ͡จ຺ͷ૊         ᶄɿᶃͷ͏ͪ׵ݴϦετ͕౳͍͠૊        ᶅɿᶄͷ͏ͪ೉қ౓ॱҐ͕ҟͳΔ૊       ᶆɿᶅͷ͏ͪ࠷΋ฏқͳޠ͕ҧ͏૊         ɾɼѱؾɼ਌͸ѱؾͰݴͬͨΘ͚Ͱ͸ͳ͘ɺࢠڙ ɹΛ͋΍͢ͱ͍͏͜ͱΛຊ౰ʹ஌Βͳ͔༷ͬͨࢠɻ ɾɼѱؾɼҙ஍ѱʀѱ͍ߟ͑ʀѱҙʀ ɾɼѱؾɼ\ҙ஍ѱɼѱҙ^\ѱؾ^\ѱ͍ߟ͑^ ޠኮత׵ݴͷධՁͷͨΊͷσʔληοτ  ɾ*%ɼର৅ޠɼ׵ݴ౤ථස౓ʀ׵ݴ౤ථස౓ʀ  ޠኮฏқԽͷධՁͷͨΊͷσʔληοτ  ɾ*%ɼର৅ޠɼ\࠷΋ฏқͳޠ^ʜ\ɹ^ʜ\࠷΋೉ղͳޠ^ ޠኮత׵ݴͱޠኮฏқԽͷධՁͷͨΊͷσʔληοτ *1"ࣙॻ㱯+6."/ࣙॻͷ಺༰ޠ ໊ࢺ ಈࢺ ܗ༰ࢺ ෭ࢺ  ฏқͳޠΛ࡟আ ֶशجຊޠኮɿখֶੜͷͨΊͷཧղޠኮ  ׵ݴ͕ͳ͍ޠΛ࡟আɹ ಺༰ޠ׵ݴࣙॻʹؚ·ΕΔޠͷΈ  ௿ස౓ޠΛ࡟আ ৽ฉهࣄ೥෼Ͱͷग़ݱස౓͕Ҏ্  ֤ର৅ޠʹ͍ͭͯจ຺ͣͭޠኮత׵ݴΛऩू Ϋϥ΢υιʔγϯάʹΑΓ֤̑ਓͷ࡞ۀऀΛืू จ຺தͰͷର৅ޠͷ׵ݴΛࢥ͍ͭ͘ݶΓྻڍ ࣙॻͷࢀর͸Մɺແهೖ΋Մɺଞਓʹฉ͘ͷ͸ෆՄ ਓͷ࡞ۀऀ͔Βฏۉޠͷޠኮత׵ݴΛऩू Ϋϥ΢υιʔγϯάʹΑΓ৽ͨʹ̑ਓͷ࡞ۀऀΛืू ̏ਓҎ্͕ʮద੾ͳ׵ݴͰ͋Δʯͱճ౴ͨ͠׵ݴΛ࠾༻ ਓͷ࡞ۀऀʹΑͬͯฏۉޠͷ׵ݴ͕ೝΊΒΕͨ ࡞ۀऀؒͷҰக౓͸ʢे෼ʹߴ͍Ұக཰ʣ ᶃ  ޠ ኮ త ׵ ݴ σ ồ λ η ỽ τ  ʲഎܠʳޠኮฏқԽ͸ɺจதͷ೉ղͳޠΛΑΓฏқ ͳಉٛޠʹஔ׵͢Δٕज़ɻࢠͲ΋΍ݴޠֶशऀΛ͸ ͡Ίͱ͢Δ෯޿͍ಡऀͷจষಡղΛࢧԉ͢Δɻ  ʲߩݙʳޠኮత׵ݴ΍ޠኮฏқԽͷΞϧΰϦζϜͷ ਫ਼౓΍࠶ݱ཰Λܭࢉ͠ɺ'஋ΛٻΊΔࣗಈධՁͷ࿮ ૊ΈΛఏڙˠݚڀͷαΠΫϧ͕ૉૣ͘ճΔΑ͏ʹʂ จ຺தͰର৅ޠͱͦͷ׵ݴΛฏқͳॱʹฒͼସ͑ Ϋϥ΢υιʔγϯάͰ֤̑ਓͷ࡞ۀऀΛืू ਓͷ࡞ۀऀؒͷҰக౓͸ ೉қ౓ॱҐͷฏۉΛͱͬͯσʔλΛ౷߹ ฏۉ஋͕ಉ͡୯ޠಉ࢜ʹ͸ಉ͡ॱҐΛׂΓ౰ͯΔ ฏۉஈ֊ͷ೉қ౓ׂ͕Γ౰ͯΒΕͨ ᶄ  ޠ ኮ ฏ қ Խ σ ồ λ η ỽ τ 
  3. ධ൑෼ੳʹ͓͚Δ඼ࢺ৘ใͱҙຯྨܕ৘ใͷ༗ޮੑൺֱ ҙຯྨܕΛ༻͍Δ͜ͱͰ நग़Ͱ͖ΔΑ͏ʹͳͬͨධ൑දݱͷྫ ͻͱຯҧ͓ͬͨళͰ͢ ൵͠͞Λײ͡Δ ඼ࢺɹɹ ಈࢺ ҙຯྨܕ  ܗ༰

    ඼ࢺɹɹ ಈࢺ ҙຯྨܕ  ײ֮ɾײ৘ ˠɹܗ༰දݱͷಈࢺ΍ɼײ֮ͷಇ͖Λද͢ಈࢺ ɹɹͱ͍ͬͨ৘ใΛѻ͑Δ ඼ࢺͱҙຯྨܕͷൺֱ ඼ࢺ ܗଶతͳ෼ྨ ޠʹରͯ͠ݪଇҰҙ ܗଶతͳಛ௃Λମܥతʹѻ͑Δ ɹྫʣ׆༻ܗͷมԽͳͲ ҙຯྨܕ ҙຯతͳ෼ྨʢਓʹΑΔ൑அʣ ಉ͡ޠͰ΋จ຺ͰҟͳΓಘΔ ޠͷҙຯΛΑΓਖ਼֬ʹѻ͑Δ ධ൑෼ੳ΁ͷԠ༻݁Ռ ධ൑จ  ඇධ൑จͷ෼ྨਫ਼౓ ඼ࢺͷΈˠҙຯྨܕར༻ ධ൑จநग़ͷ ' ஋ ඼ࢺͷΈˠҙຯྨܕར༻  ϙΠϯτ্ঢ  ্ঢ ༻ݴ͸ಈࢺͱܗ༰ࢺʹେผ͞Εɼ Ұൠతʹಈࢺ ͸ਓ΍෺ͷಈ͖Λɼ ܗ༰ࢺ͸෺ࣄͷಛ௃΍ੑ࣭Λ ද͢ͱ͞Ε͍ͯΔɽ ͔͠͠ɼ ࣮ࡍʹ͸ ʮ༏ΕΔʯ ͷΑ ͏ͳܗ༰දݱʹͳΓಘΔಈࢺ͕ଘࡏ͢Δɽ ͜Εʹର͠ɼ ༻ݴΛҙຯతʹ෼ྨ͢Δ͜ͱΛ໨త ʹຊݚڀࣨͷઌߦݚڀʹͯ  छྨͷҙຯྨܕ ʢಈ ࡞ɼ มԽɼ ײ֮ɾײ৘ɼ ܗ༰ʣ ͕ఏҊ͞Ε͍ͯΔɽ ຊݚڀͰ͸ɼ ධ൑෼ੳͷྫΛڍ͛ɼ ༻ݴΛҙຯྨ ܕʹج͍ͮͯ෼ྨ͢Δ͜ͱͷ༗༻ੑΛࣔ͢ɽ ۩ମతʹ͸ɼ จதͷ࠷ޙͷ༻ݴͷҙຯྨܕ͔Βɼ ͦͷจ͕Կ͔ʹରͯ͠ධ൑Λड़΂͍ͯΔ͔Ͳ͏͔ ͷ  ஋෼ྨΛߦ͍ɼ ඼ࢺ৘ใΛ༻͍ͯ෼ྨͨ͠৔ ߹ͱൺֱͯ͠ਫ਼౓͕޲্͢Δ͜ͱΛ֬ೝͨ͠ɽ ֓ཁ
  4. • • • • IP-MAT *pro*(SBJ) *T*(SBJ) IP-REL • •

  5. What We Need is Word, Not Morpheme; Constructing Word Analyzer

    for Japanese • We constructed word analyzer SNOWMAN • SNOWMAN can solve the two Japanese problems • Orthographical Variants • Multiword Expressions • Constructing SNOWMAN Dictionary • gathered dictionaies related the two problems • checked entries by hand • We show two comparisons in UniDic and SNOWMAN • Merging Orthographical Variants • Synonyms in UniDic are not always orthographical variants • UniDic entries are merged ignoring a different senses • Coverages of Frequency • The difference between UniDic and SNOWMAN is 0.7 points • UniDic can not recognize as same entries OrthographicalVariants • same pronunciation an d same sense • different notations • e.g. りんご・リンゴ・林檎 Multiword expressions • a word made up of two or more words • e.g. idioms, nominal verbs morphological analysis merging orthographical variants connecting multiple morphemes
  6. animal Construction of Japanese Semantically Compatible Words Resource Fido is

    a dog, I will also conclude that he’s an animal, that he is not a cat, and that he might or might not be a puppy (Kruszewski et al. 2015) . Future work, we evaluate this resource to some NLP tasks. Cllasification Word Types Concepts Hyponymy 1,196 343 Synonymy 57,178 21,784 Because we expect that semantically compatible words can solve a data sparseness problem with reducing a amount of words. Image of semantically compatible words The reason we collect Semantically compatible words and synonym are similar. However, the words refer to the same thing or have a hyponym-and-hypernym relation. Is it synonym? Fido Collecting from existing resources by automatically and manually Cat ⊂ ( Kitty, Meow) Baby, Child, Infant to assist, to lend a hand
  7. None
  8. ೔ຊޠͷޠኮฏқԽγεςϜͷߏங ɹɹɹɹɹɹɹɹɹɹɹɹɹɹIUUQXXXKOMQPSHSFTPVSDFT ʲഎܠʳޠኮฏқԽ͸ɺจதͷ೉ղͳޠΛΑΓฏқ ͳಉٛޠʹஔ׵͢Δٕज़ɻࢠͲ΋΍ݴޠֶशऀΛ͸ ͡Ίͱ͢Δ෯޿͍ಡऀͷจষಡղΛࢧԉ͢Δɻ  ʲ՝୊ʳ೔ຊޠͰ͸ɺ͜ͷ෼໺ʹ͸ίʔύε΋ଞͷ ݴޠࢿݯ΋γεςϜ΋ɺެ։͞Εͨ΋ͷ͸Կ΋ͳ͍ɻ  ʲߩݙʳ೔ຊޠͷޠኮฏқԽγεςϜΛ8FCͰॳΊ

    ͯެ։ˠϢʔβʹಧ͘ʂݚڀͷϕʔεϥΠϯʹ΋ʂ  ʲಛ௃ʳޠኮฏқԽͷయܕతͳ̐ͭͷػߏʢԼਤʣ Λඋ͑ͨɺʮਅͬ౰ͳʯγεςϜΛߏஙͨ͠ɻ ւ֎͔Βͷʨ๚໰ऀˠ͓٬͞Μʩʹʨ഑Δˠ౉͢ʩ ΄͔ɺਆށࢢͷւ֎ࣄ຿ॴʹૹ෇͢Δɻ  ֶशࢦಋཁྖͷʨ࿮ˠϑϨʔϜʩʹͱΒΘΕͳ͍ʜ  ๺཮ۜͷߴ໦ʨ಄औˠࣾ௕ʩ΋ʮϦετϥࡦͱऩʜ ೖྗจ  ະདྷ͸एऀ͕୲͏ ޠኮత׵ݴͷੜ੒  ୲͏఻ঝ͢Δ Ҿܧ͙ ࢧ͑Δ ड͚ܧ͙ ೉ղޠͷݕग़  ºɿ୲͏ ग़ྗจ  ະདྷ͸एऀ͕ࢧ͑Δ ޠٛᐆດੑͷղফ  Ҿܧ͙ ࢧ͑Δ ड͚ܧ͙ ೉қ౓ʹجͮ͘ฒͼସ͑  ࢧ͑Δड͚ܧ͙Ҿܧ͙ ᶅޠٛᐆດੑͷղফ  ධՁɿड़ޠ߲ߏ଄ղੳʢ4ZO$IBʣͰಘͨड़ޠͱ߲ͷؔ܎͕ɺ֨ϑϨʔϜ ࣙॻʢژେ֨ϑϨʔϜʣʹهड़͞Ε͍ͯΕ͹ɺͦͷจ͸ద֨ͱݴ͑Δɻ ᶆ೉қ౓ʹجͮ͘ฒͼସ͑  ֤ฏқԽީิʹ୯ޠ਌ີ౓σʔλϕʔεΛ༻͍ͯ೉қ౓Λ෇༩ɻ࠷΋਌ີ ౓ͷߴ͍ޠʢ࠷΋ฏқͳޠʣΛɺೖྗจதͷ೉ղޠͱஔ׵ͯ͠ग़ྗ͢Δɻ ᶄޠኮత׵ݴͷੜ੒  ֤೉ղޠʹ͍ͭͯɺฏқԽީิͱͳΔޠኮత׵ݴΛྻڍ͢Δɻ͜ΕΒͷޠ ኮత׵ݴ͸ɺجຊతҙຯؔ܎ͷࣄྫϕʔεɺ಺༰ޠ׵ݴࣙॻɺಈࢺؚҙؔ ܎σʔλϕʔεɺ೔ຊޠ8PSE/FUಉٛޠσʔλϕʔε͔Βऩू͢Δɻ ᶃ೉ղޠͷݕग़  ೖྗจΛܗଶૉղੳʢ.F$BCʣ͠ɺฏқޠϦετʢֶशجຊޠኮʣʹؚ ·Εͳ͍಺༰ޠʢ໊ࢺɺಈࢺɺܗ༰ࢺɺ෭ࢺʣΛ೉ղޠͱͯ͠நग़͢Δɻ
  9. 語彙統制のための同義語集合の資源化 「赤ちゃん」「赤ん坊」といった、異なる単語だがほぼ同義の語を同一視することを語彙 統制と呼ぶ。自然言語の表現の多様性による莫大な語の組み合わせを減らすために、語彙 統制を行う。ほぼ同義の語を上位下位関係の語と類義語と分類し資源化を行った。 語彙統制と概要 語彙統制の例 「解析する」と「分析する」 が同義語集合として扱われる。 ほぼ同義の語を同一視するこ とにより応用タスクでの性能

    向上が期待できる。 分類 語数 概念数 上位下位関係 1,196 343 類義語 57,178 21,784 収集方法と結果 既存の資源から収集した。上位下位関係は 主に手作業で、類義語は自動で収集した。 例)上位下位関係「猫:子猫、黒猫、親猫」 類義語「やかん、ケトル」
  10. Detecting Japanese character conversionerrors for proofreading insurance documents 保健証券等に記載の自動車をいいます (Indicate

    car described in health policy) 保健証券等(health policy), 記載 (described),自動車(car),いい(indicate) 保険証券等に記載の自動車をいいます (Indicate car described in insurance policy) 保険証券等(insurance policy),記載 (described),自動車(car),いい(indicate) 保健証券等に記載の自動車をいいます input sentence Extract content words Error detection Basic documet Extract content words Extract corresponding sentence Purpose: I build a system to support proofreading which compare basic document(article and so on.) and derivation document(pamphlet and so on.). Method: To extract content words from the basic document and the derivation document sentence(input). The most inclusion same content words sentence is the corresponding sentence to the input sentence. If the content word is not contained in the corresponding sentence and contained in input sentences, the content word is detected as an error word. Experiment: I made test set by replace the content words in basic document. And I evaluated the system by extracting error in the test set. Result: In Association, precision is 77.7%. In error detection, recall is 99.6%.
  11. The  Effect  of  Paraphrases  in  Sta3s3cal  Machine  Transla3on Background In

      Sta3s3cal   Machine   Transla3on,   the   transla3on   quality   is   mainly   dependent  on  the  corpus.    Out  of  vocabulary,  words  not  in  corpus  appears   as  not  translated,  is  considered  as  caused  by  data  sparseness  in  corpus.    In   related  work,  Ullman  et  al.  [1]  paraphrases  high  frequent  compound  nouns.     In   their   result,   the   BLEU   value,   quan3ta3ve   score,   has   lowered   with   paraphrased  corpus.    When  looking  at  the  graph  of  token  frequency  (fig.  1),   1-­‐frequency  tokens  occupy  the  majority.    Knowing  this,  we  inves3gate  how   much  reduced  size  of  1-­‐frequency  tokens  would  affect  in  BLEU  values.   1.  E.  Ullman  and  J.  Nivre,  “Paraphrasing  Swedish  compound  nouns  in  machine  transla3on,”  EACL  2014,  p.  99,  2014.     2.  P.  Koehn,  H.  Hoang,  A.  Birch,  C.  Callison-­‐Burch,  M.  Fed-­‐  erico,  N.  Bertoldi,  B.  Cowan,  W.  Shen,  C.  Moran,  R.  Zens,  C.  Dyer,  O.  Bojar,  A.  Constan3n,  and  E.   Herbst.  Moses:  Open  source  toolkit  for  sta3s3cal  machine  transla3on.  In  Proc.  45th  ACL,  Companion  Volume,  pages  177–180,  2007.     Experiment Instead   of   dele3ng   the   1-­‐frequency   words,   we   make   paraphrases   of   1-­‐ frequency   verbs   according   to   Ullman’s   method.     By   paraphrasing   low-­‐ frequent  words  to  more  frequent  words,  it  does  not  only  eliminate  the  low-­‐ frequent  words  but  also  makes  the  paraphrase  verb  more  frequent.  (fig.  2)     The  corpus  is  KFTT  corpus  which  consists  of  440k  sentences  for  training.    We   have  paraphrased  randomly  selected  200  1-­‐frequency  verbs  to  some  other   more  common  verbs.      In  fig.  3,  it  shows  a  paraphrasing  example.       It  prevented  the  enemies  from  listening  . It  prevented  the  enemies  from  eavesdropping  .     fig.  2 fig.  1 fig.  3 Token  Frequency For   the   experiment   setup,   MOSES[2]   is   used.     Paraphrasing   is   done   in   both   training   set   and   test   set   as   well.       For   evalua3ons,   we   have   conducted   both   quan3ta3ve,   BLEU,   and   subjec3ve   evalua3ons.       For   subjec3ve,   we   evaluated   transla3on   in   4-­‐scale:   0   is   being   incorrect  in  grammar  and  not  retaining  senses   and  3  is  vice-­‐versa.    The  fig.4  shows  the  result.     In   result,   BLEU   shows   the   drop   in   Open   Experiments  same  as  to  the  result  by  Ullman.   In   subjec3ve   evalua3on,   it   shows   scale   0   shows   increase   in   paraphrased   meaning   increase   in   low-­‐quality   transla3ons,   but   also   some  increase  in  scale  3  as  well.   fig.  4
  12. 雪だるまプロジェクト  判別できないことは無理に判 別しない  曖昧性を許容 単語解析器雪だるま  表記統制部: 表記ゆれのまとめ上げ

     形態素結合部: 複数形態素の結合・追加  語義統制部: 同義語のまとめ上げ 作成方法  表記ゆれのまとめ上げ  様々な資源から情報取得  約25,000語を人手でチェック  複数形態素の結合・追加  辞書への追加(資源+人手)  ルールを用いた結合 展望  統計手法におけるデータス パースネス性の緩和 →応用タスクで検証予定 単語解析器雪だるまの紹介 -表記統制と形態素結合について- 形態素解析 UniDic 表記統制 形態素結合 語義統制