for Japanese • We constructed word analyzer SNOWMAN • SNOWMAN can solve the two Japanese problems • Orthographical Variants • Multiword Expressions • Constructing SNOWMAN Dictionary • gathered dictionaies related the two problems • checked entries by hand • We show two comparisons in UniDic and SNOWMAN • Merging Orthographical Variants • Synonyms in UniDic are not always orthographical variants • UniDic entries are merged ignoring a different senses • Coverages of Frequency • The difference between UniDic and SNOWMAN is 0.7 points • UniDic can not recognize as same entries OrthographicalVariants • same pronunciation an d same sense • different notations • e.g. りんご・リンゴ・林檎 Multiword expressions • a word made up of two or more words • e.g. idioms, nominal verbs morphological analysis merging orthographical variants connecting multiple morphemes
a dog, I will also conclude that he’s an animal, that he is not a cat, and that he might or might not be a puppy (Kruszewski et al. 2015) . Future work, we evaluate this resource to some NLP tasks. Cllasification Word Types Concepts Hyponymy 1,196 343 Synonymy 57,178 21,784 Because we expect that semantically compatible words can solve a data sparseness problem with reducing a amount of words. Image of semantically compatible words The reason we collect Semantically compatible words and synonym are similar. However, the words refer to the same thing or have a hyponym-and-hypernym relation. Is it synonym? Fido Collecting from existing resources by automatically and manually Cat ⊂ ( Kitty, Meow) Baby, Child, Infant to assist, to lend a hand
car described in health policy） 保健証券等(health policy), 記載 (described),自動車(car),いい(indicate) 保険証券等に記載の自動車をいいます (Indicate car described in insurance policy) 保険証券等(insurance policy),記載 (described),自動車(car),いい(indicate) 保健証券等に記載の自動車をいいます input sentence Extract content words Error detection Basic documet Extract content words Extract corresponding sentence Purpose: I build a system to support proofreading which compare basic document(article and so on.) and derivation document(pamphlet and so on.). Method: To extract content words from the basic document and the derivation document sentence(input). The most inclusion same content words sentence is the corresponding sentence to the input sentence. If the content word is not contained in the corresponding sentence and contained in input sentences, the content word is detected as an error word. Experiment: I made test set by replace the content words in basic document. And I evaluated the system by extracting error in the test set. Result: In Association, precision is 77.7%. In error detection, recall is 99.6%.
Sta3s3cal Machine Transla3on, the transla3on quality is mainly dependent on the corpus. Out of vocabulary, words not in corpus appears as not translated, is considered as caused by data sparseness in corpus. In related work, Ullman et al.  paraphrases high frequent compound nouns. In their result, the BLEU value, quan3ta3ve score, has lowered with paraphrased corpus. When looking at the graph of token frequency (ﬁg. 1), 1-‐frequency tokens occupy the majority. Knowing this, we inves3gate how much reduced size of 1-‐frequency tokens would aﬀect in BLEU values. 1. E. Ullman and J. Nivre, “Paraphrasing Swedish compound nouns in machine transla3on,” EACL 2014, p. 99, 2014. 2. P. Koehn, H. Hoang, A. Birch, C. Callison-‐Burch, M. Fed-‐ erico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constan3n, and E. Herbst. Moses: Open source toolkit for sta3s3cal machine transla3on. In Proc. 45th ACL, Companion Volume, pages 177–180, 2007. Experiment Instead of dele3ng the 1-‐frequency words, we make paraphrases of 1-‐ frequency verbs according to Ullman’s method. By paraphrasing low-‐ frequent words to more frequent words, it does not only eliminate the low-‐ frequent words but also makes the paraphrase verb more frequent. (ﬁg. 2) The corpus is KFTT corpus which consists of 440k sentences for training. We have paraphrased randomly selected 200 1-‐frequency verbs to some other more common verbs. In ﬁg. 3, it shows a paraphrasing example. It prevented the enemies from listening . It prevented the enemies from eavesdropping . ﬁg. 2 ﬁg. 1 ﬁg. 3 Token Frequency For the experiment setup, MOSES is used. Paraphrasing is done in both training set and test set as well. For evalua3ons, we have conducted both quan3ta3ve, BLEU, and subjec3ve evalua3ons. For subjec3ve, we evaluated transla3on in 4-‐scale: 0 is being incorrect in grammar and not retaining senses and 3 is vice-‐versa. The ﬁg.4 shows the result. In result, BLEU shows the drop in Open Experiments same as to the result by Ullman. In subjec3ve evalua3on, it shows scale 0 shows increase in paraphrased meaning increase in low-‐quality transla3ons, but also some increase in scale 3 as well. ﬁg. 4