$30 off During Our Annual Pro Sale. View Details »

自然言語処理研究室 研究概要(2015年)

自然言語処理研究室 研究概要(2015年)

自然言語処理研究室

February 28, 2017
Tweet

More Decks by 自然言語処理研究室

Other Decks in Research

Transcript

  1. 長岡技術科学大学
    自然言語処理研究室
    研究概要(2015年)

    View Slide

  2. ೔ຊޠͷޠኮฏқԽͷධՁͷͨΊͷ
    σʔληοτͷߏஙIUUQXXXKOMQPSHSFTPVSDFT
    ن໛ ૯จ਺ ໊ࢺ
    ಈࢺ
    ܗ༰ࢺ
    ෭ࢺ

    ӳޠ൛̍




    ӳޠ൛̎




    ೔ຊޠ൛




    σʔληοτͷจ຺ґଘੑ ӳޠ൛̍ ೔ຊޠ൛
    ᶃɿର৅ޠ͕ಉ͡จ຺ͷ૊


    ᶄɿᶃͷ͏ͪ׵ݴϦετ͕౳͍͠૊


    ᶅɿᶄͷ͏ͪ೉қ౓ॱҐ͕ҟͳΔ૊


    ᶆɿᶅͷ͏ͪ࠷΋ฏқͳޠ͕ҧ͏૊




    ɾɼѱؾɼ਌͸ѱؾͰݴͬͨΘ͚Ͱ͸ͳ͘ɺࢠڙ
    ɹΛ͋΍͢ͱ͍͏͜ͱΛຊ౰ʹ஌Βͳ͔༷ͬͨࢠɻ
    ɾɼѱؾɼҙ஍ѱʀѱ͍ߟ͑ʀѱҙʀ
    ɾɼѱؾɼ\ҙ஍ѱɼѱҙ^\ѱؾ^\ѱ͍ߟ͑^
    ޠኮత׵ݴͷධՁͷͨΊͷσʔληοτ

    ɾ*%ɼର৅ޠɼ׵ݴ౤ථස౓ʀ׵ݴ౤ථස౓ʀ

    ޠኮฏқԽͷධՁͷͨΊͷσʔληοτ

    ɾ*%ɼର৅ޠɼ\࠷΋ฏқͳޠ^ʜ\ɹ^ʜ\࠷΋೉ղͳޠ^
    ޠኮత׵ݴͱޠኮฏқԽͷධՁͷͨΊͷσʔληοτ
    *1"ࣙॻ㱯+6."/ࣙॻͷ಺༰ޠ ໊ࢺ ಈࢺ ܗ༰ࢺ ෭ࢺ

    ฏқͳޠΛ࡟আ ֶशجຊޠኮɿখֶੜͷͨΊͷཧղޠኮ

    ׵ݴ͕ͳ͍ޠΛ࡟আɹ ಺༰ޠ׵ݴࣙॻʹؚ·ΕΔޠͷΈ

    ௿ස౓ޠΛ࡟আ ৽ฉهࣄ೥෼Ͱͷग़ݱස౓͕Ҏ্

    ֤ର৅ޠʹ͍ͭͯจ຺ͣͭޠኮత׵ݴΛऩू
    Ϋϥ΢υιʔγϯάʹΑΓ֤̑ਓͷ࡞ۀऀΛืू
    จ຺தͰͷର৅ޠͷ׵ݴΛࢥ͍ͭ͘ݶΓྻڍ
    ࣙॻͷࢀর͸Մɺແهೖ΋Մɺଞਓʹฉ͘ͷ͸ෆՄ
    ਓͷ࡞ۀऀ͔Βฏۉޠͷޠኮత׵ݴΛऩू
    Ϋϥ΢υιʔγϯάʹΑΓ৽ͨʹ̑ਓͷ࡞ۀऀΛืू
    ̏ਓҎ্͕ʮద੾ͳ׵ݴͰ͋Δʯͱճ౴ͨ͠׵ݴΛ࠾༻
    ਓͷ࡞ۀऀʹΑͬͯฏۉޠͷ׵ݴ͕ೝΊΒΕͨ
    ࡞ۀऀؒͷҰக౓͸ʢे෼ʹߴ͍Ұக཰ʣ


    ޠ


    ׵
    ݴ
    σ

    λ
    η

    τ

    ʲഎܠʳޠኮฏқԽ͸ɺจதͷ೉ղͳޠΛΑΓฏқ
    ͳಉٛޠʹஔ׵͢Δٕज़ɻࢠͲ΋΍ݴޠֶशऀΛ͸
    ͡Ίͱ͢Δ෯޿͍ಡऀͷจষಡղΛࢧԉ͢Δɻ

    ʲߩݙʳޠኮత׵ݴ΍ޠኮฏқԽͷΞϧΰϦζϜͷ
    ਫ਼౓΍࠶ݱ཰Λܭࢉ͠ɺ'஋ΛٻΊΔࣗಈධՁͷ࿮
    ૊ΈΛఏڙˠݚڀͷαΠΫϧ͕ૉૣ͘ճΔΑ͏ʹʂ
    จ຺தͰର৅ޠͱͦͷ׵ݴΛฏқͳॱʹฒͼସ͑
    Ϋϥ΢υιʔγϯάͰ֤̑ਓͷ࡞ۀऀΛืू
    ਓͷ࡞ۀऀؒͷҰக౓͸
    ೉қ౓ॱҐͷฏۉΛͱͬͯσʔλΛ౷߹
    ฏۉ஋͕ಉ͡୯ޠಉ࢜ʹ͸ಉ͡ॱҐΛׂΓ౰ͯΔ
    ฏۉஈ֊ͷ೉қ౓ׂ͕Γ౰ͯΒΕͨ


    ޠ


    қ
    Խ
    σ

    λ
    η

    τ

    View Slide

  3. ධ൑෼ੳʹ͓͚Δ඼ࢺ৘ใͱҙຯྨܕ৘ใͷ༗ޮੑൺֱ
    ҙຯྨܕΛ༻͍Δ͜ͱͰ
    நग़Ͱ͖ΔΑ͏ʹͳͬͨධ൑දݱͷྫ
    ͻͱຯҧ͓ͬͨళͰ͢ ൵͠͞Λײ͡Δ
    ඼ࢺɹɹ ಈࢺ
    ҙຯྨܕ ܗ༰
    ඼ࢺɹɹ ಈࢺ
    ҙຯྨܕ ײ֮ɾײ৘
    ˠɹܗ༰දݱͷಈࢺ΍ɼײ֮ͷಇ͖Λද͢ಈࢺ
    ɹɹͱ͍ͬͨ৘ใΛѻ͑Δ
    ඼ࢺͱҙຯྨܕͷൺֱ
    ඼ࢺ
    ܗଶతͳ෼ྨ
    ޠʹରͯ͠ݪଇҰҙ
    ܗଶతͳಛ௃Λମܥతʹѻ͑Δ
    ɹྫʣ׆༻ܗͷมԽͳͲ
    ҙຯྨܕ
    ҙຯతͳ෼ྨʢਓʹΑΔ൑அʣ
    ಉ͡ޠͰ΋จ຺ͰҟͳΓಘΔ
    ޠͷҙຯΛΑΓਖ਼֬ʹѻ͑Δ
    ධ൑෼ੳ΁ͷԠ༻݁Ռ
    ධ൑จ ඇධ൑จͷ෼ྨਫ਼౓
    ඼ࢺͷΈˠҙຯྨܕར༻
    ධ൑จநग़ͷ ' ஋
    ඼ࢺͷΈˠҙຯྨܕར༻
    ϙΠϯτ্ঢ
    ্ঢ
    ༻ݴ͸ಈࢺͱܗ༰ࢺʹେผ͞Εɼ
    Ұൠతʹಈࢺ
    ͸ਓ΍෺ͷಈ͖Λɼ
    ܗ༰ࢺ͸෺ࣄͷಛ௃΍ੑ࣭Λ
    ද͢ͱ͞Ε͍ͯΔɽ
    ͔͠͠ɼ
    ࣮ࡍʹ͸
    ʮ༏ΕΔʯ
    ͷΑ
    ͏ͳܗ༰දݱʹͳΓಘΔಈࢺ͕ଘࡏ͢Δɽ
    ͜Εʹର͠ɼ
    ༻ݴΛҙຯతʹ෼ྨ͢Δ͜ͱΛ໨త
    ʹຊݚڀࣨͷઌߦݚڀʹͯ छྨͷҙຯྨܕ
    ʢಈ
    ࡞ɼ
    มԽɼ
    ײ֮ɾײ৘ɼ
    ܗ༰ʣ
    ͕ఏҊ͞Ε͍ͯΔɽ
    ຊݚڀͰ͸ɼ
    ධ൑෼ੳͷྫΛڍ͛ɼ
    ༻ݴΛҙຯྨ
    ܕʹج͍ͮͯ෼ྨ͢Δ͜ͱͷ༗༻ੑΛࣔ͢ɽ
    ۩ମతʹ͸ɼ
    จதͷ࠷ޙͷ༻ݴͷҙຯྨܕ͔Βɼ
    ͦͷจ͕Կ͔ʹରͯ͠ධ൑Λड़΂͍ͯΔ͔Ͳ͏͔
    ͷ ஋෼ྨΛߦ͍ɼ
    ඼ࢺ৘ใΛ༻͍ͯ෼ྨͨ͠৔
    ߹ͱൺֱͯ͠ਫ਼౓͕޲্͢Δ͜ͱΛ֬ೝͨ͠ɽ
    ֓ཁ

    View Slide





  4. IP-MAT
    *pro*(SBJ)
    *T*(SBJ)
    IP-REL



    View Slide

  5. What We Need is Word, Not Morpheme;
    Constructing Word Analyzer for Japanese
    • We constructed word analyzer SNOWMAN
    • SNOWMAN can solve the two Japanese problems
    • Orthographical Variants
    • Multiword Expressions
    • Constructing SNOWMAN Dictionary
    • gathered dictionaies related the two problems
    • checked entries by hand
    • We show two comparisons in UniDic and SNOWMAN
    • Merging Orthographical Variants
    • Synonyms in UniDic are not always orthographical variants
    • UniDic entries are merged ignoring a different senses
    • Coverages of Frequency
    • The difference
    between UniDic
    and SNOWMAN
    is 0.7 points
    • UniDic can not
    recognize
    as same entries
    OrthographicalVariants
    • same pronunciation an d same sense
    • different notations
    • e.g. りんご・リンゴ・林檎
    Multiword expressions
    • a word made up of two or more words
    • e.g. idioms, nominal verbs
    morphological
    analysis
    merging
    orthographical
    variants
    connecting
    multiple
    morphemes

    View Slide

  6. animal
    Construction of Japanese Semantically Compatible Words Resource
    Fido is a dog, I will also conclude that he’s an
    animal, that he is not a cat, and that he might or
    might not be a puppy (Kruszewski et al. 2015) .
    Future work, we evaluate this resource to some NLP tasks.
    Cllasification Word Types Concepts
    Hyponymy 1,196 343
    Synonymy 57,178 21,784
    Because we expect that semantically compatible words can solve a data sparseness
    problem with reducing a amount of words.
    Image of semantically compatible words
    The reason we collect
    Semantically compatible words and synonym are similar. However, the words refer to
    the same thing or have a hyponym-and-hypernym relation.
    Is it synonym?
    Fido
    Collecting from existing resources by automatically and manually
    Cat ⊂ ( Kitty, Meow)
    Baby, Child, Infant
    to assist, to lend a hand

    View Slide

  7. View Slide

  8. ೔ຊޠͷޠኮฏқԽγεςϜͷߏங
    ɹɹɹɹɹɹɹɹɹɹɹɹɹɹIUUQXXXKOMQPSHSFTPVSDFT
    ʲഎܠʳޠኮฏқԽ͸ɺจதͷ೉ղͳޠΛΑΓฏқ
    ͳಉٛޠʹஔ׵͢Δٕज़ɻࢠͲ΋΍ݴޠֶशऀΛ͸
    ͡Ίͱ͢Δ෯޿͍ಡऀͷจষಡղΛࢧԉ͢Δɻ

    ʲ՝୊ʳ೔ຊޠͰ͸ɺ͜ͷ෼໺ʹ͸ίʔύε΋ଞͷ
    ݴޠࢿݯ΋γεςϜ΋ɺެ։͞Εͨ΋ͷ͸Կ΋ͳ͍ɻ

    ʲߩݙʳ೔ຊޠͷޠኮฏқԽγεςϜΛ8FCͰॳΊ
    ͯެ։ˠϢʔβʹಧ͘ʂݚڀͷϕʔεϥΠϯʹ΋ʂ

    ʲಛ௃ʳޠኮฏқԽͷయܕతͳ̐ͭͷػߏʢԼਤʣ
    Λඋ͑ͨɺʮਅͬ౰ͳʯγεςϜΛߏஙͨ͠ɻ
    ւ֎͔Βͷʨ๚໰ऀˠ͓٬͞Μʩʹʨ഑Δˠ౉͢ʩ
    ΄͔ɺਆށࢢͷւ֎ࣄ຿ॴʹૹ෇͢Δɻ

    ֶशࢦಋཁྖͷʨ࿮ˠϑϨʔϜʩʹͱΒΘΕͳ͍ʜ

    ๺཮ۜͷߴ໦ʨ಄औˠࣾ௕ʩ΋ʮϦετϥࡦͱऩʜ
    ೖྗจ

    ະདྷ͸एऀ͕୲͏
    ޠኮత׵ݴͷੜ੒

    ୲͏఻ঝ͢Δ Ҿܧ͙ ࢧ͑Δ ड͚ܧ͙
    ೉ղޠͷݕग़

    ºɿ୲͏
    ग़ྗจ

    ະདྷ͸एऀ͕ࢧ͑Δ
    ޠٛᐆດੑͷղফ

    Ҿܧ͙ ࢧ͑Δ ड͚ܧ͙
    ೉қ౓ʹجͮ͘ฒͼସ͑

    ࢧ͑Δड͚ܧ͙Ҿܧ͙
    ᶅޠٛᐆດੑͷղফ

    ධՁɿड़ޠ߲ߏ଄ղੳʢ4ZO$IBʣͰಘͨड़ޠͱ߲ͷؔ܎͕ɺ֨ϑϨʔϜ
    ࣙॻʢژେ֨ϑϨʔϜʣʹهड़͞Ε͍ͯΕ͹ɺͦͷจ͸ద֨ͱݴ͑Δɻ
    ᶆ೉қ౓ʹجͮ͘ฒͼସ͑

    ֤ฏқԽީิʹ୯ޠ਌ີ౓σʔλϕʔεΛ༻͍ͯ೉қ౓Λ෇༩ɻ࠷΋਌ີ
    ౓ͷߴ͍ޠʢ࠷΋ฏқͳޠʣΛɺೖྗจதͷ೉ղޠͱஔ׵ͯ͠ग़ྗ͢Δɻ
    ᶄޠኮత׵ݴͷੜ੒

    ֤೉ղޠʹ͍ͭͯɺฏқԽީิͱͳΔޠኮత׵ݴΛྻڍ͢Δɻ͜ΕΒͷޠ
    ኮత׵ݴ͸ɺجຊతҙຯؔ܎ͷࣄྫϕʔεɺ಺༰ޠ׵ݴࣙॻɺಈࢺؚҙؔ
    ܎σʔλϕʔεɺ೔ຊޠ8PSE/FUಉٛޠσʔλϕʔε͔Βऩू͢Δɻ
    ᶃ೉ղޠͷݕग़

    ೖྗจΛܗଶૉղੳʢ.F$BCʣ͠ɺฏқޠϦετʢֶशجຊޠኮʣʹؚ
    ·Εͳ͍಺༰ޠʢ໊ࢺɺಈࢺɺܗ༰ࢺɺ෭ࢺʣΛ೉ղޠͱͯ͠நग़͢Δɻ

    View Slide

  9. 語彙統制のための同義語集合の資源化
    「赤ちゃん」「赤ん坊」といった、異なる単語だがほぼ同義の語を同一視することを語彙
    統制と呼ぶ。自然言語の表現の多様性による莫大な語の組み合わせを減らすために、語彙
    統制を行う。ほぼ同義の語を上位下位関係の語と類義語と分類し資源化を行った。
    語彙統制と概要
    語彙統制の例
    「解析する」と「分析する」
    が同義語集合として扱われる。
    ほぼ同義の語を同一視するこ
    とにより応用タスクでの性能
    向上が期待できる。
    分類 語数 概念数
    上位下位関係 1,196 343
    類義語 57,178 21,784
    収集方法と結果
    既存の資源から収集した。上位下位関係は
    主に手作業で、類義語は自動で収集した。
    例)上位下位関係「猫:子猫、黒猫、親猫」
    類義語「やかん、ケトル」

    View Slide

  10. Detecting Japanese character conversionerrors
    for proofreading insurance documents
    保健証券等に記載の自動車をいいます
    (Indicate car described in health policy)
    保健証券等(health policy), 記載
    (described),自動車(car),いい(indicate)
    保険証券等に記載の自動車をいいます
    (Indicate car described in insurance policy)
    保険証券等(insurance policy),記載
    (described),自動車(car),いい(indicate)
    保健証券等に記載の自動車をいいます
    input sentence
    Extract
    content words
    Error detection
    Basic
    documet
    Extract
    content words
    Extract corresponding
    sentence
    Purpose:
    I build a system to support proofreading which compare basic document(article and so on.) and
    derivation document(pamphlet and so on.).
    Method:
    To extract content words from the basic document and the derivation document sentence(input).
    The most inclusion same content words sentence is the corresponding sentence to the input
    sentence. If the content word is not contained in the corresponding sentence and contained in
    input sentences, the content word is detected as an error word.
    Experiment:
    I made test set by replace the content words in basic document. And I evaluated the system by
    extracting error in the test set.
    Result:
    In Association, precision is 77.7%. In error detection, recall is 99.6%.

    View Slide

  11. The  Effect  of  Paraphrases  in  Sta3s3cal  Machine  Transla3on
    Background
    In   Sta3s3cal   Machine   Transla3on,   the   transla3on   quality   is   mainly  
    dependent  on  the  corpus.    Out  of  vocabulary,  words  not  in  corpus  appears  
    as  not  translated,  is  considered  as  caused  by  data  sparseness  in  corpus.    In  
    related  work,  Ullman  et  al.  [1]  paraphrases  high  frequent  compound  nouns.    
    In   their   result,   the   BLEU   value,   quan3ta3ve   score,   has   lowered   with  
    paraphrased  corpus.    When  looking  at  the  graph  of  token  frequency  (fig.  1),  
    1-­‐frequency  tokens  occupy  the  majority.    Knowing  this,  we  inves3gate  how  
    much  reduced  size  of  1-­‐frequency  tokens  would  affect  in  BLEU  values.  
    1.  E.  Ullman  and  J.  Nivre,  “Paraphrasing  Swedish  compound  nouns  in  machine  transla3on,”  EACL  2014,  p.  99,  2014.    
    2.  P.  Koehn,  H.  Hoang,  A.  Birch,  C.  Callison-­‐Burch,  M.  Fed-­‐  erico,  N.  Bertoldi,  B.  Cowan,  W.  Shen,  C.  Moran,  R.  Zens,  C.  Dyer,  O.  Bojar,  A.  Constan3n,  and  E.  
    Herbst.  Moses:  Open  source  toolkit  for  sta3s3cal  machine  transla3on.  In  Proc.  45th  ACL,  Companion  Volume,  pages  177–180,  2007.    
    Experiment
    Instead   of   dele3ng   the   1-­‐frequency   words,   we   make   paraphrases   of   1-­‐
    frequency   verbs   according   to   Ullman’s   method.     By   paraphrasing   low-­‐
    frequent  words  to  more  frequent  words,  it  does  not  only  eliminate  the  low-­‐
    frequent  words  but  also  makes  the  paraphrase  verb  more  frequent.  (fig.  2)    
    The  corpus  is  KFTT  corpus  which  consists  of  440k  sentences  for  training.    We  
    have  paraphrased  randomly  selected  200  1-­‐frequency  verbs  to  some  other  
    more  common  verbs.      In  fig.  3,  it  shows  a  paraphrasing  example.      
    It  prevented  the  enemies  from  listening  .
    It  prevented  the  enemies  from  eavesdropping  .    
    fig.  2
    fig.  1
    fig.  3
    Token  Frequency
    For   the   experiment   setup,   MOSES[2]   is   used.    
    Paraphrasing   is   done   in   both   training   set   and  
    test   set   as   well.       For   evalua3ons,   we   have  
    conducted   both   quan3ta3ve,   BLEU,   and  
    subjec3ve   evalua3ons.       For   subjec3ve,   we  
    evaluated   transla3on   in   4-­‐scale:   0   is   being  
    incorrect  in  grammar  and  not  retaining  senses  
    and  3  is  vice-­‐versa.    The  fig.4  shows  the  result.    
    In   result,   BLEU   shows   the   drop   in   Open  
    Experiments  same  as  to  the  result  by  Ullman.  
    In   subjec3ve   evalua3on,   it   shows   scale   0   shows   increase   in  
    paraphrased   meaning   increase   in   low-­‐quality   transla3ons,   but   also  
    some  increase  in  scale  3  as  well.  
    fig.  4

    View Slide

  12. 雪だるまプロジェクト
     判別できないことは無理に判
    別しない
     曖昧性を許容
    単語解析器雪だるま
     表記統制部:
    表記ゆれのまとめ上げ
     形態素結合部:
    複数形態素の結合・追加
     語義統制部:
    同義語のまとめ上げ
    作成方法
     表記ゆれのまとめ上げ
     様々な資源から情報取得
     約25,000語を人手でチェック
     複数形態素の結合・追加
     辞書への追加(資源+人手)
     ルールを用いた結合
    展望
     統計手法におけるデータス
    パースネス性の緩和
    →応用タスクで検証予定
    単語解析器雪だるまの紹介
    -表記統制と形態素結合について-
    形態素解析
    UniDic
    表記統制 形態素結合 語義統制

    View Slide