Tatsuya Ishisaka, Masao Utiyama, Eiichiro Sumita and Kazuhide Yamamoto. Development of a Japanese-English Software Manual Paralell Corpus. Proceedings of the Twelfth Machine Translation Summit ( MT Summit XII), no page numbers (2009.8)



August 31, 2009


    Ishisaka1 , Masao Utiyama2 Eiichiro Sumita2 , and Kazuhide Yamamoto1 1 Nagaoka University of Technology (Japan) 2 National Institute of Information and Communications Tehnology MASTAR Project (Japan)
  Machine Translation (MT) needs parallel corpus. „ Large-scale

    Japanese-English parallel corpora are scarce. „ Domains of current corpora are limited. „ Patent has 18 million parallel sentences. „ Newspaper has 180 thousand parallel sentences. Goal We develop and publish Large-scale open source parallel corpus. Background
  Strategy to make parallel corpus that can be distributed

    „ Web has many translated documents „ volunteer translators have made them We collected translated documents on the Web Open source software manuals z Large scale z The quality of translation is high
  Characteristics of manuals „ Japanese manuals may contain English

    ex.) Command name „ Format differs from English to Japanese ex.) HTML files and Text file „ Manuals are being updated „ The latest original document version may be newer than the translated document version.
  Searching for manuals on the Web „ We used

    Web search engines manually to search for open source software manuals. „ We searched for Japanese Web pages containing phrases such as ຋༁ ϓϩδΣΫτ (translation project). „ If the Web pages are found, we downloaded manuals.
  We will publish a parallel corpus constructed from manuals.

    Following example licenses allow redistribution and modification. • MIT License • GNU Free Documentation License • FreeBSD Documentation License • Creative Commons We limited manuals to which such licenses were applied. Target license
  Cleaning up documents We clean up documents before alignment.

    1. We deleted HTML tags 2. We deleted newlines which break sentences „ One line has one sentence „ We checked sentences by using simple pattern match rules
  Aligning sentences We align Japanese sentences and English sentences.

    „ We used Utiyama and Isahara’s (2007) alignment method. „ Overview of the algorithm 1. We translate words, using dictionary. 2. We calculate the similarity based on word overlap. 3. We calculate the maximum similarity sentences, using DP matching.
  Number of aligned sentences „ JF 120,000 „ PHP

    70,000 „ JM 40,000 „ NetBeans 30,000 „ Python 30,000 „ PEAR 20,000 „ PostgreSQL 20,000 „ FreeBSD 10,000 „ XFree86 10,000 „ Gentoo Linux 10,000 „ RFC 130,000 Total is approx. 500,000 sentences
  Example of parallel sentences „ ͜ͷઅͰ͸Τϥʔॲཧํ๏ʹ͍ͭͯઆ໌͠·͢ɻ ˱This section describes

    how errors are handled. „ ৽͍͠ϑΝΠϧ͕XMLΤσΟλͰ։͖·͢ɻ ˱The new file opens in the XML editor. „ ϝοηʔδͷHTTPϓϩτίϧόʔδϣϯΛઃఆ͠·͢ɻ ˱ Set the HTTP Protocol version of the message. „ ը૾ͷϚοτνϟωϧΛઃఆ͠·͢ɻ ˱ Sets the image matte channel. „ ͜ΕΒ͸ͦΕͧΕ௨ৗϢʔβʔͱroot ͷσϑΥϧτύεͰ͢ɻ ˱ That will be a default path for normal and root users respectively.
  The alignment accuracy „ Over 80% of sentence alignments

    were precisely aligned. „ The further improvements are possible, since we have failed to clean up some noisy sentences. „ The alignment accuracy would improve if we remove such noisy sentences.
  MT Experiments To verify the usefulness for SMT. „

    We simulated a situation where SMT systems helps translators. „ We only translated from English. „ Test data is 500 sentences. „ We extracted from the aligned JF sentences. The highest BLEU score was 44.36.
  Discussion „ The BLEU score were relatively high for

    Japanese- English corpus. „ But we have to be careful about our experimental result. Reason: „ Our test sentences might not be representative samples. „ Sentences were relatively short. „ Our Japanese word segmenter segmented ASCII words into characters. „ ex.) “word” was segmented into “w o r d”
  Conclusion „ A Japanese-English parallel corpus is made from

    software manuals. „ The corpus has approx. 500,000 sentence pairs. „ The corpus will be available at: / .
  Tool of MT Experiments „ Moses system (Koehn et

    al., 2007) „ GIZA++ (Och and Ney, 2003) „ SRILM (Stolcke, 2002) „ We used 5-gram language model „ BLEU (Papinei et al., 2002)