Development of a Japanese-English Software Manual Paralell Corpus

Development of a Japanese-English Software Manual Paralell Corpus

Tatsuya Ishisaka, Masao Utiyama, Eiichiro Sumita and Kazuhide Yamamoto. Development of a Japanese-English Software Manual Paralell Corpus. Proceedings of the Twelfth Machine Translation Summit ( MT Summit XII), no page numbers (2009.8)



August 31, 2009


  1. 1 Development of a Japanese-English Software Manual Parallel Corpus Tatsuya

    Ishisaka1 , Masao Utiyama2 Eiichiro Sumita2 , and Kazuhide Yamamoto1 1 Nagaoka University of Technology (Japan) 2 National Institute of Information and Communications Tehnology MASTAR Project (Japan)
  2. 2 „ Machine Translation (MT) needs parallel corpus. „ Large-scale

    Japanese-English parallel corpora are scarce. „ Domains of current corpora are limited. „ Patent has 18 million parallel sentences. „ Newspaper has 180 thousand parallel sentences. Goal We develop and publish Large-scale open source parallel corpus. Background
  3. 3 Strategy to make parallel corpus that can be distributed

    „ Web has many translated documents „ volunteer translators have made them We collected translated documents on the Web Open source software manuals z Large scale z The quality of translation is high
  4. 4 Characteristics of manuals „ Japanese manuals may contain English

    ex.) Command name „ Format differs from English to Japanese ex.) HTML files and Text file „ Manuals are being updated „ The latest original document version may be newer than the translated document version.
  5. 5 Searching for manuals on the Web „ We used

    Web search engines manually to search for open source software manuals. „ We searched for Japanese Web pages containing phrases such as ຋༁ ϓϩδΣΫτ (translation project). „ If the Web pages are found, we downloaded manuals.
  6. 6 We will publish a parallel corpus constructed from manuals.

    Following example licenses allow redistribution and modification. • MIT License • GNU Free Documentation License • FreeBSD Documentation License • Creative Commons We limited manuals to which such licenses were applied. Target license
  7. 7 Cleaning up documents We clean up documents before alignment.

    1. We deleted HTML tags 2. We deleted newlines which break sentences „ One line has one sentence „ We checked sentences by using simple pattern match rules
  8. 8 Aligning sentences We align Japanese sentences and English sentences.

    „ We used Utiyama and Isahara’s (2007) alignment method. „ Overview of the algorithm 1. We translate words, using dictionary. 2. We calculate the similarity based on word overlap. 3. We calculate the maximum similarity sentences, using DP matching.
  9. 9 Number of aligned sentences „ JF 120,000 „ PHP

    70,000 „ JM 40,000 „ NetBeans 30,000 „ Python 30,000 „ PEAR 20,000 „ PostgreSQL 20,000 „ FreeBSD 10,000 „ XFree86 10,000 „ Gentoo Linux 10,000 „ RFC 130,000 Total is approx. 500,000 sentences
  10. 10 Example of parallel sentences „ ͜ͷઅͰ͸Τϥʔॲཧํ๏ʹ͍ͭͯઆ໌͠·͢ɻ ˱This section describes

    how errors are handled. „ ৽͍͠ϑΝΠϧ͕XMLΤσΟλͰ։͖·͢ɻ ˱The new file opens in the XML editor. „ ϝοηʔδͷHTTPϓϩτίϧόʔδϣϯΛઃఆ͠·͢ɻ ˱ Set the HTTP Protocol version of the message. „ ը૾ͷϚοτνϟωϧΛઃఆ͠·͢ɻ ˱ Sets the image matte channel. „ ͜ΕΒ͸ͦΕͧΕ௨ৗϢʔβʔͱroot ͷσϑΥϧτύεͰ͢ɻ ˱ That will be a default path for normal and root users respectively.
  11. 11 The alignment accuracy „ Over 80% of sentence alignments

    were precisely aligned. „ The further improvements are possible, since we have failed to clean up some noisy sentences. „ The alignment accuracy would improve if we remove such noisy sentences.
  12. 12 MT Experiments To verify the usefulness for SMT. „

    We simulated a situation where SMT systems helps translators. „ We only translated from English. „ Test data is 500 sentences. „ We extracted from the aligned JF sentences. The highest BLEU score was 44.36.
  13. 13 Discussion „ The BLEU score were relatively high for

    Japanese- English corpus. „ But we have to be careful about our experimental result. Reason: „ Our test sentences might not be representative samples. „ Sentences were relatively short. „ Our Japanese word segmenter segmented ASCII words into characters. „ ex.) “word” was segmented into “w o r d”
  14. 14 Conclusion „ A Japanese-English parallel corpus is made from

    software manuals. „ The corpus has approx. 500,000 sentence pairs. „ The corpus will be available at: / .
  15. 15 Tool of MT Experiments „ Moses system (Koehn et

    al., 2007) „ GIZA++ (Och and Ney, 2003) „ SRILM (Stolcke, 2002) „ We used 5-gram language model „ BLEU (Papinei et al., 2002)