IALP2017

Analysis of Japanese WSD with Hiragana-Kanji conversion and Context Word
Embeddings Yuki Gumizawa, Kazuhide Yamamoto Nagaoka University of Technology Natural Language Processing Lab IALP 2017 in Singapore

Content Analysis of Japanese WSD with Hiragana-Kanji conversion and Context
Word Embeddings 1. Introduction 2. Hiragana-Kanji conversion task 3. Our approach of WSD 4. The Effect of window size 5. The Effect of data size 1 #E15101

Introduction • Word sense disambiguation (WSD) is very important for
enabling computers to understand the sense of words. • Because of ambiguity, decline in accuracy occurs in many tasks such as machine translation. 3

Introduction • Word sense disambiguation (WSD) is very important for
enabling computers to understand the sense of words. • Because of ambiguity, decline in accuracy occurs in many tasks such as machine translation. 4 cool awesome Ex.) That new bike is cool. not warm, but not cold Ex.) My ice-cream is cool.

Introduction • The field seems to be slowing down [Raganato
et al. 2017] • The lack of groundbreaking improvements • The difficulty of integrating current WSD systems into downstream NLP applications 5

Introduction • The field seems to be slowing down [Raganato
et al. 2017] • The lack of groundbreaking improvements • The difficulty of integrating current WSD systems into downstream NLP applications 6 The system output result is not useful in downstream NLP applications

Introduction • WSD systems output sense id. • Input :
My ice-cream is cool. • System output : “senseid = 2” • Although this information is indeed useful, it is not easy to utilize this in the following procedures. 7

Introduction • We proposed new Word Sense Disambiguation task called
Hiragana- Kanji Conversion task [Yamamoto et al., 2016] • System input is sentence, and output is word. Not sense id. • Use the characteristics of Japanese language 8

Introduction • Today, we are talking about • The window
size problem in Japanese WSD task. • New WSD method using word embeddings. • Changes in accuracy when changing the size of the training data. 9

size problem in Japanese WSD task. • New WSD method using word embeddings. • Changes in accuracy when changing the size of the training data. 10 In Japanese, there is no research about window size. Most of study use words that appear in two words to right and left.

size problem in Japanese WSD task. • New WSD method using word embeddings and PMI. • Changes in accuracy when changing the size of the training data. 11

size problem in Japanese WSD task. • New WSD method using word embeddings. • Changes in accuracy when changing the size of the training data. 12 There is no research about changes in accuracy when increase training data size due to cost problem.

Hiragana-Kanji conversion task • Japanese has three types of scripts.
• ͻΒ͕ͳ Hiragana • ͔͔ͨͳ Katakana • ׽ࣈ Kanji 14

• ͻΒ͕ͳ Hiragana • ͔͔ͨͳ Katakana • ׽ࣈ Kanji 15 Characters that are originally Japanese (Ex., あ, い, う)

• ͻΒ͕ͳ Hiragana • ͔͔ͨͳ Katakana • ׽ࣈ Kanji 16 Characters that are originally Chinese (Ex., 漢, 字, 意)

Hiragana-Kanji conversion task • We combine these character to assemble
sentence in Japanese 17 I go to Tokyo and I buy a game in there. ࢲ͸౦ژʹߦͬͯήʔϜΛ͔͏ɻ ׽ࣈ ,BOKJ ΧλΧφ ,BUBLBOB ͻΒ͕ͳ )JSBHBOB

Hiragana-Kanji conversion task • Many hiragana words have multiple senses
and multiple corresponding Kanji words 18 I go to Tokyo and I buy a game in there. ࢲ͸౦ژʹߦͬͯήʔϜΛ͔͏ɻ ͔͏ LBV ങ͏ LBV UPCVZ TPNFUIJOH ࣂ͏ LBV UPIBWF QFUT Hiragana • phonographic • ambiguous Kanji • ideographic • clear sense

Hiragana-Kanji conversion task • Convert Hiragana words to Kanji words
regarded as a WSD task. • We can get Kanji word as a WSD result and we can use it in the following procedures without any changes. 19 ͔͏ LBV ങ͏ LBV UPCVZ TPNFUIJOH ࣂ͏ LBV UPIBWF QFUT Hiragana • phonographic • ambiguous Kanji • ideographic • clear sense

Hiragana-Kanji conversion task • Advantages of Hiragana-Kanji conversion task •
The ease of making data sets We can make data sets from large amount of raw corpus automatically since convert Hiragana to Kanji is very easy. • The ease of analysis Since we can create data sets freely, we can create training data that matches the problem we want to investigate. Ex. ) Window size problem, Data size problem 20

Word Embeddings 1. Introduction 2. Hiragana-Kanji conversion task 3. Our approach to WSD 4. The Effect of window size 5. The Effect of data size 23 #E15101

Approach • We proposed a method using Pointwise Mutual Information
(PMI) last year. [Yamamoto et al., 2016] • We propose new method using PMI and word embeddings. 24

Approach • Sugawara et al. proposed the method CWE and
AVE that employ word embeddings. [Sugawara et al., 2015] 25 CWE: AVE:

Approach • Sugawara et al. proposed the method CWE and
AVE that employ word embeddings. [Sugawara et al., 2015] • The similarity in CWE is not reflected depending on the appearance position, because the word position is restricted. 26 CWE: AVE:

Approach • We propose new method CWE+AVE and CWE+AVE+PMI. •
CWE+AVE is a method that adds the average vector to the surrounding word vectors used in method CWE. 27

Approach • We propose new method CWE+AVE and CWE+AVE+PMI. •
CWE+AVE+PMI is a method that joins vectors of the word with the maximum PMI as features. • PMI is calculated between the target Kanji and all the words in the sentence. 28

Approach • Overview of these methods. 29

Experiment • We used the data set created from Balanced
Corpus of Contemporary Written Japanese (BCCWJ) • 200 sentences per sense (Kanji) were sampled from BCCWJ as training data. • The corpus used for constructing word embedding was created from the data of Japanese Wikipedia. • LinearSVC implemented by scikit-learn was used as a classifier. 31

Experiment • The CWE+AVE method achieved better performance than CWE,
AVE and CWE+BoW. • CWE+AVE+PMI method provided the best accuracy. 32

Error Analysis • The CWE method is not capable of
deriving the meaning of “playing” from the musical term “Adagio” • Adding AVE as a feature, CWE+AVE solved the problem caused by a restriction in word position and derived the correct Kanji. 33

Error Analysis • By using PMI, the word embedding of
the word “idea” is added as the feature. • The CWE+AVE method does not utilize “idea” which introduces “devote” in window size 10; nevertheless, this is solved by using PMI which adds “idea” ’s word embedding. 34

Experiment • We sampled about 5,000 training data per Kanji-word
from the text archive of the Japanese web corpus 2010. • We used the same one used in experiment of window size as test data. 36

Experiment • The graph shows that accuracy changes remarkably due
to the increase of training data. • The graph also shows very little change when the training data per word exceeds roughly 5,000. 37

Conclusion • We show that it is important to considering
words that appear in the whole sentence. • We also show that the data required to obtain high accuracy increases exponentially. • Future work: developing the tool for WSD with higher accuracy. 38

Appendix • Statistics of BCCWJ • 5,913,714 sentences are contained.
• There are 28,467,950 Hiragana words. • BCCWJ contain 438,360 Hiragana words that are target of Hiragana-Kanji conversion task. • One hiragana word appears per about 13 sentences. 39

Appendix • Detail of result about window size experiment •
CWE+AVE+PMI is achieved about two percentage points than CWE. • Considering 5-7 or more window size is important for accurate Word Sense Disambiguation. 40

IALP2017

IALP2017

More Decks by gumigumi7

Featured

Transcript