Development of a Japanese-English Software Manual Paralell Corpus
Tatsuya Ishisaka, Masao Utiyama, Eiichiro Sumita and Kazuhide Yamamoto. Development of a Japanese-English Software Manual Paralell Corpus. Proceedings of the Twelfth Machine Translation Summit ( MT Summit XII), no page numbers (2009.8)
Japanese-English parallel corpora are scarce. Domains of current corpora are limited. Patent has 18 million parallel sentences. Newspaper has 180 thousand parallel sentences. Goal We develop and publish Large-scale open source parallel corpus. Background
ex.) Command name Format differs from English to Japanese ex.) HTML files and Text file Manuals are being updated The latest original document version may be newer than the translated document version.
Web search engines manually to search for open source software manuals. We searched for Japanese Web pages containing phrases such as ༁ ϓϩδΣΫτ (translation project). If the Web pages are found, we downloaded manuals.
Following example licenses allow redistribution and modification. • MIT License • GNU Free Documentation License • FreeBSD Documentation License • Creative Commons We limited manuals to which such licenses were applied. Target license
We used Utiyama and Isahara’s (2007) alignment method. Overview of the algorithm 1. We translate words, using dictionary. 2. We calculate the similarity based on word overlap. 3. We calculate the maximum similarity sentences, using DP matching.
how errors are handled. ৽͍͠ϑΝΠϧ͕XMLΤσΟλͰ։͖·͢ɻ ˱The new file opens in the XML editor. ϝοηʔδͷHTTPϓϩτίϧόʔδϣϯΛઃఆ͠·͢ɻ ˱ Set the HTTP Protocol version of the message. ը૾ͷϚοτνϟωϧΛઃఆ͠·͢ɻ ˱ Sets the image matte channel. ͜ΕΒͦΕͧΕ௨ৗϢʔβʔͱroot ͷσϑΥϧτύεͰ͢ɻ ˱ That will be a default path for normal and root users respectively.
We simulated a situation where SMT systems helps translators. We only translated from English. Test data is 500 sentences. We extracted from the aligned JF sentences. The highest BLEU score was 44.36.
Japanese- English corpus. But we have to be careful about our experimental result. Reason: Our test sentences might not be representative samples. Sentences were relatively short. Our Japanese word segmenter segmented ASCII words into characters. ex.) “word” was segmented into “w o r d”