Development of a Japanese-English Software Manual Paralell Corpus

1 Development of a Japanese-English Software Manual Parallel Corpus Tatsuya
Ishisaka1 , Masao Utiyama2 Eiichiro Sumita2 , and Kazuhide Yamamoto1 1 Nagaoka University of Technology (Japan) 2 National Institute of Information and Communications Tehnology MASTAR Project (Japan)

2 Machine Translation (MT) needs parallel corpus. Large-scale
Japanese-English parallel corpora are scarce. Domains of current corpora are limited. Patent has 18 million parallel sentences. Newspaper has 180 thousand parallel sentences. Goal We develop and publish Large-scale open source parallel corpus. Background

3 Strategy to make parallel corpus that can be distributed
Web has many translated documents volunteer translators have made them We collected translated documents on the Web Open source software manuals z Large scale z The quality of translation is high

4 Characteristics of manuals Japanese manuals may contain English
ex.) Command name Format differs from English to Japanese ex.) HTML files and Text file Manuals are being updated The latest original document version may be newer than the translated document version.

5 Searching for manuals on the Web We used
Web search engines manually to search for open source software manuals. We searched for Japanese Web pages containing phrases such as ຋༁ ϓϩδΣΫτ (translation project). If the Web pages are found, we downloaded manuals.

6 We will publish a parallel corpus constructed from manuals.
Following example licenses allow redistribution and modification. • MIT License • GNU Free Documentation License • FreeBSD Documentation License • Creative Commons We limited manuals to which such licenses were applied. Target license

7 Cleaning up documents We clean up documents before alignment.
1. We deleted HTML tags 2. We deleted newlines which break sentences One line has one sentence We checked sentences by using simple pattern match rules

8 Aligning sentences We align Japanese sentences and English sentences.
We used Utiyama and Isahara’s (2007) alignment method. Overview of the algorithm 1. We translate words, using dictionary. 2. We calculate the similarity based on word overlap. 3. We calculate the maximum similarity sentences, using DP matching.

9 Number of aligned sentences JF 120,000 PHP
70,000 JM 40,000 NetBeans 30,000 Python 30,000 PEAR 20,000 PostgreSQL 20,000 FreeBSD 10,000 XFree86 10,000 Gentoo Linux 10,000 RFC 130,000 Total is approx. 500,000 sentences

10 Example of parallel sentences ͜ͷઅͰ͸Τϥʔॲཧํ๏ʹ͍ͭͯઆ໌͠·͢ɻ ˱This section describes
how errors are handled. ৽͍͠ϑΝΠϧ͕XMLΤσΟλͰ։͖·͢ɻ ˱The new file opens in the XML editor. ϝοηʔδͷHTTPϓϩτίϧόʔδϣϯΛઃఆ͠·͢ɻ ˱ Set the HTTP Protocol version of the message. ը૾ͷϚοτνϟωϧΛઃఆ͠·͢ɻ ˱ Sets the image matte channel. ͜ΕΒ͸ͦΕͧΕ௨ৗϢʔβʔͱroot ͷσϑΥϧτύεͰ͢ɻ ˱ That will be a default path for normal and root users respectively.

11 The alignment accuracy Over 80% of sentence alignments
were precisely aligned. The further improvements are possible, since we have failed to clean up some noisy sentences. The alignment accuracy would improve if we remove such noisy sentences.

12 MT Experiments To verify the usefulness for SMT.
We simulated a situation where SMT systems helps translators. We only translated from English. Test data is 500 sentences. We extracted from the aligned JF sentences. The highest BLEU score was 44.36.

13 Discussion The BLEU score were relatively high for
Japanese- English corpus. But we have to be careful about our experimental result. Reason: Our test sentences might not be representative samples. Sentences were relatively short. Our Japanese word segmenter segmented ASCII words into characters. ex.) “word” was segmented into “w o r d”

14 Conclusion A Japanese-English parallel corpus is made from
software manuals. The corpus has approx. 500,000 sentence pairs. The corpus will be available at: http://www2.nict.go.jp/x/x161/members/mutiyama / .

15 Tool of MT Experiments Moses system (Koehn et
al., 2007) GIZA++ (Och and Ney, 2003) SRILM (Stolcke, 2002) We used 5-gram language model BLEU (Papinei et al., 2002)

Development of a Japanese-English Software Manu...

Development of a Japanese-English Software Manual Paralell Corpus

自然言語処理研究室

More Decks by 自然言語処理研究室

Other Decks in Research

Featured

Transcript

1 Development of a Japanese-English Software Manual Parallel Corpus Tatsuya

2 Machine Translation (MT) needs parallel corpus. Large-scale

3 Strategy to make parallel corpus that can be distributed

4 Characteristics of manuals Japanese manuals may contain English

5 Searching for manuals on the Web We used

6 We will publish a parallel corpus constructed from manuals.

7 Cleaning up documents We clean up documents before alignment.

8 Aligning sentences We align Japanese sentences and English sentences.

9 Number of aligned sentences JF 120,000 PHP

10 Example of parallel sentences ͜ͷઅͰ͸Τϥʔॲཧํ๏ʹ͍ͭͯઆ໌͠·͢ɻ ˱This section describes

11 The alignment accuracy Over 80% of sentence alignments

12 MT Experiments To verify the usefulness for SMT.

13 Discussion The BLEU score were relatively high for

14 Conclusion A Japanese-English parallel corpus is made from

15 Tool of MT Experiments Moses system (Koehn et