Japanese Orthographical Normalization Does Not Work for Statistical Machine Translation What’s Japanese Orthographical Variants? The words refer to the same word and have the same pronunciation. However, their notation are different. Why did we investigate? 10 % of Japanese words in a corpus have more than one orthographical variants. If they are normalized, a data sparseness problem is alleviated. Why normalization does not work? Our investigated statistics shows a real corpus contains orthographical variants a little. The impact of normalization is a weak. *All experimental scripts are available on https://github.com/kanjirz50/mt_ialp2016 BLEU RIBES Baseline 19.3 66.4 Normalized 19.7 66.2 Table I. The Evaluation of Japanese to English Translation Evaluation “りんご”, “リンゴ”, “林檎”, “苹果” An apple can be written in 4 ways. Figure I. The N-gram Types in the Japanese Training Corpus 1 10 100 1000 10000 100000 1000000 10000000 1 2 3 4 5 The types of N-gram N-gram Normalized Baseline
believe that our tools and resource could boost Vietnamese NLP. Fundamental Tools and Resource are Available for Vietnamese Analysis Joint Word Segmentation and POS Tagging Word segmentation and POS tagging is a necessary first step for Vietnamese NLP. The figure shows an example of word segmentation and POS tagging on the web. Diacritics Restoration Tool Diacritics dropped words have ambiguity. A syllable “cho” has 16 kinds of notations. cho(give), chó(dog), chờ(wait) etc. Our tool can restore diacritic marks. It is useful for pre- processing of Vietnamese NLP. Normalization Dictionary and script The dictionary normalizes orthographical variants. The script contains Vietnamese Unicode normalization. Word segmentation and POS tagging Web demonstration All tools and resource are available from “https://github.com/kanjirz50/vnlp-outline”.
(IALP2016) Yemane Tedla and Kazuhide Yamamoto Nagaoka University of Technology - Tigrinya token complexity makes word alignment difficult Tigrinya(unsegmented) English (translation) Tigrinya (segmented) Eritrea Ethiopia Tigrinya Language • Semitic • Native to Eritrea and Ethiopia • Over 7 million speakers • Root-template morphology ? Alignment difficult Segmentation improves alignment 3 Translation system 1 The Tigrinya Language 2 Word alignment problem 4 Effect of word segmentation Tokens N-grams 2 n-grams - Language model improved - Perplexity decreased - BLEU System-1 = 19.8 System-2 = 20.9 - TER System-1 = 71.0 System-2 = 72.7 InItezeyIHatetIkayo If you did not ask him InIte zeyI HatetI ka yo - Moses translation system with segmented and unsegmented text
5IFO XFBQQMZEJTDSJNJOBUJWFQSFPSEFSJOHUPUIFBVHNFOUFE DPOTUJUVFOUUSFFJOXIJDIFNQUZDBUFHPSJFTBSFUSFBUFEBTJGUIFZBSF OPSNBMMFYJDBMTZNCPMT 8FGJOEUIBUJUJTFGGFDUJWFUPGJMUFSFNQUZDBUFHPSJFTCBTFEPOUIF DPOGJEFODFPGFTUJNBUJPO 0VSFYQFSJNFOUTTIPXUIBU GPSUIF*84-5EBUBTFUDPOTJTUJOHPGTIPSU USBWFMDPOWFSTBUJPOT UIFJOTFSUJPOPGFNQUZDBUFHPSJFTBMPOF JNQSPWFTUIF#-&6TDPSFGSPNUPBOEUIF3*#&4TDPSFGSPN UP XIJDIJNQMZUIBUSFPSEFSJOHIBTJNQSPWFE'PSUIF,'55 EBUBTFUDPOTJTUJOHPG8JLJQFEJBTFOUFODFT UIFQSPQPTFEQSFPSEFSJOH NFUIPEDPOTJEFSJOHFNQUZDBUFHPSJFTJNQSPWFTUIF#-&6TDPSFGSPN UPBOEUIF3*#&4TDPSFGSPNUP XIJDITIPXTCPUI USBOTMBUJPOBOESFPSEFSJOHIBWFJNQSPWFETMJHIUMZ 家 に は 早く 帰る ほう が よい 。 (pro)1 (pro)2 家 に は 早く 帰る ほう が よい 。 EC detection (Takeno+2015) (pro)1 よい が ほう (pro)2 帰る 早くは に 家 。 Proposal: Reordering (Hoshino+2015) It ʼs better if you come home early. • 1SFPSEFSJOHNPEFMalleviate the word order problem w/ EC Plain insertion of EC slightly improve due to XPSEPSEFS problem including ECs Word alignments about EC are needed for building the model • &MJNJOBUJPO PGVOSFMJBCMF&$Trefines EC detection Accuracy of structural parse is insufficient for practical usage Cutting lower confidence of ECs alleviate the problem