Upgrade to Pro — share decks privately, control downloads, hide ads and more …

snlp-jp-2017

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for namgiH namgiH
September 12, 2017

 snlp-jp-2017

第9回最先端NLP勉強会の発表スライドです。
読んだ論文はEMNLP2017の Mimicking Word Embeddings using Subword RNNs [Pinter+ 2017]です。
後で修正されるかもしれません。

Avatar for namgiH

namgiH

September 12, 2017
Tweet

Other Decks in Research

Transcript

  1. / 25 Mimicking Word Embeddings using Subword RNNs Yuval Pinter,

    Robert Guthrie, Jacob Eisenstein School of Interactive Computing Georgia Institute of Technology {uvp,rguthrie3,jacobe}@gatech.edu ૯߹ݚڀେֶӃେֶ ෳࡶԽֶՊɹ৘ใֶઐ߈ ٶඌݚɹD1ɹHan Namgi ※਺ࣜɺֆɺνϟʔτͳͲ͸ɺجຊతʹ࿦จ΍ஶऀͷEMNLPൃදεϥΠυΑΓҾ༻ 1
  2. / 25 ݚڀഎܠ • ະ஌ޠʢOut Of Vocabularyʣͷ໰୊ • ͦ΋ͦ΋શͯͷ୯ޠΛֶश͢Δ͜ͱ͕ແཧ •

    ௿ස౓୯ޠɺݸମ໊ɺ৽଄ޠɺޡࣈͳͲ • େ͖ͳίʔύε͕ͳ͍ݴޠ͸ɺΑΓ໰୊͕ਂࠁʹ • طଘͷॲཧํ๏ɿ<UNK>ͳͲͷϥϕϧʹม׵ • શͯͷະ஌ޠ͕ҰͭͷϕΫτϧͰॲཧ͞ΕΔ 5
  3. / 25 ઌߦݚڀ • ܗଶૉ୯Ґͷ෼ࢄදݱ͔Β୯ޠͷ෼ࢄදݱΛߏ੒ • [Luong+ 2013]ɺ[Botha & Mlunsom

    2014]ɺ[Bhatia+ 2016] • exʣv(imperfection)=v(im)+v(perfect)+v(ion) • ໊લɺ֎ࠃޠͳͲʹऑ͍ɿܗଶૉ͕ͳ͔ͬͨΓɺରԠͰ͖ͳ͍ͨΊ • จࣈ୯Ґͷ෼ࢄදݱ͔Β୯ޠͷ෼ࢄදݱΛߏ੒ • [Kim+ 2016]ɺ[Wieting+ 2016] • ઌߦݚڀ͸ੜίʔύε͔Βจࣈ୯Ґͷ෼ࢄදݱΛֶश • MIMICK͸ֶशࡁΈͷϞσϧΛ׆༻͢Δɿ৭ʑָ 6
  4. / 25 ࣮ݧͷ֓ཁ • ֶशࡁΈͷ෼ࢄදݱϞσϧɿPolyglot [Al-Rfou+ 2013] • ࠷ۙ๣୳ࡧʢNearest-neighbors examinationsʣ

    • ෼ࢄදݱͷ࣭త෼ੳʹΑ͘࢖ΘΕΔ • Stanford RareWord Dataset [Luong+ 2013] • ະ஌ޠ͕ଟ͘ೖ͍ͬͯΔ୯ޠྨࣅ౓ධՁσʔληοτ • POS/Feature tagging for Universal Dependency 
 [De Marneffe+ 2014] • ࣮ࡍͷࣗવݴޠॲཧλεΫ΁ͷద༻ɿ23ݴޠʹద༻ 9
  5. / 25 ࣮ݧɿPolyglot [Al-Rfou+ 2013] • 137ݴޠͷ෼ࢄදݱֶ͕शࡁΈ • Bengio+ [2009]ͷϞσϧͰֶश

    • Word2vec΍GloveͰ͸ͳ͍ɿֶशํ๏͕ҟͳΔ • Wikipedia͔Βֶशɺߴස౓100,000୯ޠɺ64࣍ݩ • ଞͷ୯ޠ͸શͯ<UNK>ͱͯ͠ॲཧ • https://github.com/aboSamoor/polyglotͰެ։͞Ε͍ͯΔ • ୯ޠͷ1ˋΛϥϯμϜʹநग़͠ɺMIMICKϞσϧΛ܇࿅ 10
  6. / 25 ࣮ݧɿStanford RareWord [Luong+ 2013] 12 • ௿ස౓ޠͷॲཧೳྗΛνΣοΫ •

    Wikipediaͷස౓਺Ͱ୯ޠΛ୳͠ɺ
 WordNetʹSynset͕͋Δ΋ͷΛ࠾༻
 • 10ਓͰ0−10ͷྨࣅ౓ΛΞϊςʔτ͠ɺ
 ฏۉ஋Λ୯ޠϖΞͷྨࣅ౓ͱ͢Δ • ӳޠɺશ෦Ͱ2,304ηοτ TRVJTIJOH TRVJSU  VOEBUFE VOEBUBCMF  DJSDVNWFOUT CFBU  DJSDVNWFOUT FCC  EJTQPTTFTT EFQSJWF  ʜ
  7. / 25 13 • VarEmbed [Bhatia+ 2016] • ܗଶૉ୯ҐͷϞσϧ •

    128࣍ݩɺ10ສ୯ޠ • fastText [Bojanowski+ 2016] • ී௨ͷֶशࡁϞσϧ • 300࣍ݩɺ250ສ୯ޠ • ͦ΋ͦ΋ະ஌ޠ͕΄΅ͳ͍ɿ
 34ϖΞͰ͔͠ະ஌ޠ͕ͳ͔ͬͨ ࣮ݧɿStanford RareWord [Luong+ 2013] ϕΫτϧͩͱॻ͔Ε͍ͯΔ͕ɺ
 ࣮ࡍ͸܇࿅ίʔύεͷ୯ޠͷαΠζ
  8. / 25 • Universal Dependencies [De Marneffe+ 2014]ͱ͸ • ʮ…ଟݴޠͰҰ؏ͨ͠ߏจߏ଄ͱλάηοτΛఆٛ͢Δ


    ͱ͍͏׆ಈͰ͋Δʯ[ۚࢁ+ 2015] • 2017೥3݄ج४ɺ50ݴޠͷ70ݸͷπϦʔόϯΫ͕࢖༻Մೳ • https://github.com/UniversalDependenciesͰެ։͞Ε͍ͯΔ ࣮ݧɿλά෇͚ 14
  9. / 25 • ࣮ݧ͞Εͨ෼ࢄදݱϞσϧ • NONEɿී௨ʹPolyglotΛ࢖͍ɺະ஌ޠ΋
 Polyglot͕ఏڙ͢ΔUNKͷ෼ࢄදݱΛ࢖͏ʢʹϕʔεϥΠϯʣ • MIMICKɿPolyglotΛ࢖͏͕ɺະ஌ޠ͸MIMICKͰ෼ࢄදݱΛੜ੒ •

    CHAR2TAGɿจࣈ୯Ґͷ෼ࢄදݱ͔Β୯ޠͷ෼ࢄදݱΛ
 BiRNNͰܭࢉ͢ΔϞσϧΛೖΕͯɺͦͷ݁ՌΛNONEʹ࿈݁ • BOTHɿCHAR2TAGͱಉ͕ͩ͡ɺNONE͡Όͳ͘MIMICKʹ࿈݁ ࣮ݧɿλά෇͚ 16
  10. / 25 • CHAR2TAG [Plank+ 2016] • ͜ͷݚڀͰ͸όΠτ୯Ґ΋͋Δ • BiRNNͰ࡞ΒΕͨ୯ޠͷ෼ࢄදݱΛɺ


    Polyglot΍UNKɺMIMICKͳͲͰಘΒΕͨ
 ෼ࢄදݱͱ࿈݁ͯ͠࢖͏ • ී௨ʹ࣍ݩ਺͕૿͑Δ • ܭࢉ࣌ؒ΋૿͑Δɿ3ഒ ࣮ݧɿλά෇͚ 17
  11. / 25 • ࣮ࡍͷະ஌ޠͷग़ݱ౓ • ܇࿅ηοτ͸10k͚ͩ • Bold͸౷ܭతʹ༗ҙٛ • OOV͸܇࿅ηοτʹग़ݱͤͣɺ


    ධՁηοτʹ͸ग़ݱͯ͠Δ΋ͷ • ܇࿅؀ڥ͕ݫ͍͠΄Ͳ
 MIMICK͕໾ʹཱͭ ࣮ݧɿλά෇͚ 21
  12. / 25 ݸਓతͳײ૝ • Α͔ͬͨͱ͜Ζ • ༷ʑͳݴޠʹ࢖͑ΔΑ͏ʹΈ͑Δ • ίʔύε͕গͳ͍࣌ʹ΋໾ʹཱͪͦ͏ •

    Ϟσϧࣗମ͕ͦΜͳʹෳࡶͰͳ͘ɺଞͷλεΫʹద༻͠΍ͦ͢͏ • ୯ޠྨࣅ౓෼ੳ͚ͩͰͳ͘ɺԠ༻λεΫͰͷධՁ΋͍ͯ͠Δ • ίʔυɺϞσϧ͕ͪΌΜͱެ։͞Ε͍ͯΔ 23
  13. / 25 ݸਓతͳײ૝ • ੯͔ͬͨ͠ͱ͜Ζ • POSλά෇͚͸ߏจత৘ใ͕݁ߏॏཁͩͱࢥ͏͕ɺ
 Ԡ༻λεΫͷઃఆ͕MIMICKʹରͯ͠༗རʹͳͬͯΔΜ͡Όʁ • MIMICKͰະ஌ޠͷҙຯΛͪΌΜͱ֫ಘ͢Δͷ͕೉͍͠ͷͳΒɺ


    ͜ΕΛ׆༻Ͱ͖ΔԠ༻λεΫ͕ͦΜͳʹଟ͘͸ͳͦ͞͏ • PolyglotҎ֎ͷϞσϧͷ৔߹ͱ͔ɺ
 ֶशࡁΈͷ෼ࢄදݱҰ෦ΛMIMICKͰஔ͖׵͑ͨ࣌ͷൺֱͳͲɺ
 ৭ʑ࣮ݧ͢Δ༨஍͕࢒ͬͯΔͱࢥ͏ 24
  14. / 25 ࢀߟࢿྉ • Pinter, Y., Guthrie, R., & Eisenstein,

    J. (2017). Mimicking Word Embeddings using Subword RNNs. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 102-112). • https://www.cc.gatech.edu/~ypinter3/papers/EMNLP-2017- yuvalpinter.pdf • https://github.com/yuvalpinter/Mimick 25