Upgrade to Pro — share decks privately, control downloads, hide ads and more …

IALP2017

gumigumi7
December 11, 2017
110

 IALP2017

gumigumi7

December 11, 2017
Tweet

Transcript

  1. Analysis of Japanese WSD with Hiragana-Kanji conversion and Context Word

    Embeddings Yuki Gumizawa, Kazuhide Yamamoto Nagaoka University of Technology Natural Language Processing Lab IALP 2017 in Singapore
  2. Content Analysis of Japanese WSD with Hiragana-Kanji conversion and Context

    Word Embeddings 1. Introduction 2. Hiragana-Kanji conversion task 3. Our approach of WSD 4. The Effect of window size 5. The Effect of data size 1 #E15101
  3. Content Analysis of Japanese WSD with Hiragana-Kanji conversion and Context

    Word Embeddings 1. Introduction 2. Hiragana-Kanji conversion task 3. Our approach of WSD 4. The Effect of window size 5. The Effect of data size 2 #E15101
  4. Introduction • Word sense disambiguation (WSD) is very important for

    enabling computers to understand the sense of words. • Because of ambiguity, decline in accuracy occurs in many tasks such as machine translation. 3
  5. Introduction • Word sense disambiguation (WSD) is very important for

    enabling computers to understand the sense of words. • Because of ambiguity, decline in accuracy occurs in many tasks such as machine translation. 4 cool awesome Ex.) That new bike is cool. not warm, but not cold Ex.) My ice-cream is cool.
  6. Introduction • The field seems to be slowing down [Raganato

    et al. 2017] • The lack of groundbreaking improvements • The difficulty of integrating current WSD systems into downstream NLP applications 5
  7. Introduction • The field seems to be slowing down [Raganato

    et al. 2017] • The lack of groundbreaking improvements • The difficulty of integrating current WSD systems into downstream NLP applications 6 The system output result is not useful in downstream NLP applications
  8. Introduction • WSD systems output sense id. • Input :

    My ice-cream is cool. • System output : “senseid = 2” • Although this information is indeed useful, it is not easy to utilize this in the following procedures. 7
  9. Introduction • We proposed new Word Sense Disambiguation task called

    Hiragana- Kanji Conversion task [Yamamoto et al., 2016] • System input is sentence, and output is word. Not sense id. • Use the characteristics of Japanese language 8
  10. Introduction • Today, we are talking about • The window

    size problem in Japanese WSD task. • New WSD method using word embeddings. • Changes in accuracy when changing the size of the training data. 9
  11. Introduction • Today, we are talking about • The window

    size problem in Japanese WSD task. • New WSD method using word embeddings. • Changes in accuracy when changing the size of the training data. 10 In Japanese, there is no research about window size. Most of study use words that appear in two words to right and left.
  12. Introduction • Today, we are talking about • The window

    size problem in Japanese WSD task. • New WSD method using word embeddings and PMI. • Changes in accuracy when changing the size of the training data. 11
  13. Introduction • Today, we are talking about • The window

    size problem in Japanese WSD task. • New WSD method using word embeddings. • Changes in accuracy when changing the size of the training data. 12 There is no research about changes in accuracy when increase training data size due to cost problem.
  14. Content Analysis of Japanese WSD with Hiragana-Kanji conversion and Context

    Word Embeddings 1. Introduction 2. Hiragana-Kanji conversion task 3. Our approach of WSD 4. The Effect of window size 5. The Effect of data size 13 #E15101
  15. Hiragana-Kanji conversion task • Japanese has three types of scripts.

    • ͻΒ͕ͳ Hiragana • ͔͔ͨͳ Katakana • ׽ࣈ Kanji 14
  16. Hiragana-Kanji conversion task • Japanese has three types of scripts.

    • ͻΒ͕ͳ Hiragana • ͔͔ͨͳ Katakana • ׽ࣈ Kanji 15 Characters that are originally Japanese (Ex., あ, い, う)
  17. Hiragana-Kanji conversion task • Japanese has three types of scripts.

    • ͻΒ͕ͳ Hiragana • ͔͔ͨͳ Katakana • ׽ࣈ Kanji 16 Characters that are originally Chinese (Ex., 漢, 字, 意)
  18. Hiragana-Kanji conversion task • We combine these character to assemble

    sentence in Japanese 17 I go to Tokyo and I buy a game in there. ࢲ͸౦ژʹߦͬͯήʔϜΛ͔͏ɻ ׽ࣈ ,BOKJ ΧλΧφ ,BUBLBOB ͻΒ͕ͳ )JSBHBOB
  19. Hiragana-Kanji conversion task • Many hiragana words have multiple senses

    and multiple corresponding Kanji words 18 I go to Tokyo and I buy a game in there. ࢲ͸౦ژʹߦͬͯήʔϜΛ͔͏ɻ ͔͏ LBV ങ͏ LBV UPCVZ TPNFUIJOH ࣂ͏ LBV UPIBWF QFUT Hiragana • phonographic • ambiguous Kanji • ideographic • clear sense
  20. Hiragana-Kanji conversion task • Convert Hiragana words to Kanji words

    regarded as a WSD task. • We can get Kanji word as a WSD result and we can use it in the following procedures without any changes. 19 ͔͏ LBV ങ͏ LBV UPCVZ TPNFUIJOH ࣂ͏ LBV UPIBWF QFUT Hiragana • phonographic • ambiguous Kanji • ideographic • clear sense
  21. Hiragana-Kanji conversion task • Advantages of Hiragana-Kanji conversion task •

    The ease of making data sets We can make data sets from large amount of raw corpus automatically since convert Hiragana to Kanji is very easy. • The ease of analysis Since we can create data sets freely, we can create training data that matches the problem we want to investigate. Ex. ) Window size problem, Data size problem 20
  22. Hiragana-Kanji conversion task • Advantages of Hiragana-Kanji conversion task •

    The ease of making data sets We can make data sets from large amount of raw corpus automatically since convert Hiragana to Kanji is very easy. • The ease of analysis Since we can create data sets freely, we can create training data that matches the problem we want to investigate. Ex. ) Window size problem, Data size problem 21
  23. Hiragana-Kanji conversion task • Advantages of Hiragana-Kanji conversion task •

    The ease of making data sets We can make data sets from large amount of raw corpus automatically since convert Hiragana to Kanji is very easy. • The ease of analysis Since we can create data sets freely, we can create training data that matches the problem we want to investigate. Ex. ) Window size problem, Data size problem 22
  24. Content Analysis of Japanese WSD with Hiragana-Kanji conversion and Context

    Word Embeddings 1. Introduction 2. Hiragana-Kanji conversion task 3. Our approach to WSD 4. The Effect of window size 5. The Effect of data size 23 #E15101
  25. Approach • We proposed a method using Pointwise Mutual Information

    (PMI) last year. [Yamamoto et al., 2016] • We propose new method using PMI and word embeddings. 24
  26. Approach • Sugawara et al. proposed the method CWE and

    AVE that employ word embeddings. [Sugawara et al., 2015] 25 CWE: AVE:
  27. Approach • Sugawara et al. proposed the method CWE and

    AVE that employ word embeddings. [Sugawara et al., 2015] • The similarity in CWE is not reflected depending on the appearance position, because the word position is restricted. 26 CWE: AVE:
  28. Approach • We propose new method CWE+AVE and CWE+AVE+PMI. •

    CWE+AVE is a method that adds the average vector to the surrounding word vectors used in method CWE. 27
  29. Approach • We propose new method CWE+AVE and CWE+AVE+PMI. •

    CWE+AVE+PMI is a method that joins vectors of the word with the maximum PMI as features. • PMI is calculated between the target Kanji and all the words in the sentence. 28
  30. Content Analysis of Japanese WSD with Hiragana-Kanji conversion and Context

    Word Embeddings 1. Introduction 2. Hiragana-Kanji conversion task 3. Our approach to WSD 4. The Effect of window size 5. The Effect of data size 30 #E15101
  31. Experiment • We used the data set created from Balanced

    Corpus of Contemporary Written Japanese (BCCWJ) • 200 sentences per sense (Kanji) were sampled from BCCWJ as training data. • The corpus used for constructing word embedding was created from the data of Japanese Wikipedia. • LinearSVC implemented by scikit-learn was used as a classifier. 31
  32. Experiment • The CWE+AVE method achieved better performance than CWE,

    AVE and CWE+BoW. • CWE+AVE+PMI method provided the best accuracy. 32
  33. Error Analysis • The CWE method is not capable of

    deriving the meaning of “playing” from the musical term “Adagio” • Adding AVE as a feature, CWE+AVE solved the problem caused by a restriction in word position and derived the correct Kanji. 33
  34. Error Analysis • By using PMI, the word embedding of

    the word “idea” is added as the feature. • The CWE+AVE method does not utilize “idea” which introduces “devote” in window size 10; nevertheless, this is solved by using PMI which adds “idea” ’s word embedding. 34
  35. Content Analysis of Japanese WSD with Hiragana-Kanji conversion and Context

    Word Embeddings 1. Introduction 2. Hiragana-Kanji conversion task 3. Our approach to WSD 4. The Effect of window size 5. The Effect of data size 35 #E15101
  36. Experiment • We sampled about 5,000 training data per Kanji-word

    from the text archive of the Japanese web corpus 2010. • We used the same one used in experiment of window size as test data. 36
  37. Experiment • The graph shows that accuracy changes remarkably due

    to the increase of training data. • The graph also shows very little change when the training data per word exceeds roughly 5,000. 37
  38. Conclusion • We show that it is important to considering

    words that appear in the whole sentence. • We also show that the data required to obtain high accuracy increases exponentially. • Future work: developing the tool for WSD with higher accuracy. 38
  39. Appendix • Statistics of BCCWJ • 5,913,714 sentences are contained.

    • There are 28,467,950 Hiragana words. • BCCWJ contain 438,360 Hiragana words that are target of Hiragana-Kanji conversion task. • One hiragana word appears per about 13 sentences. 39
  40. Appendix • Detail of result about window size experiment •

    CWE+AVE+PMI is achieved about two percentage points than CWE. • Considering 5-7 or more window size is important for accurate Word Sense Disambiguation. 40