Natural Language Processing (12) Machine Translation (2)

1 1 / 24 Natural Language Processing (12) Machine translation
(2) Kazuhide Yamamoto Dept. of Electrical Engineering Nagaoka University of Technology

2 / 24 Summary of the last week Conventional syntactic/semantic
transfer methods and interlingual methods has the following problems: • it is hard (or impossible) to create rules or maintain them as the size of the rules increases. • there are so many exceptions that make difficult to write some phenomena as rules.

3 / 24 MT evaluation (1) The basic evaluation method
is to do that by human. • until the late 1990s all of evaluation is done by human. Evaluation items include: • readability : a degree of expression • informativeness : a degree of contents However, it is expensive and time consuming to conduct evaluations so many times, where we have to do them in research & development phase.

4 / 24 MT evaluation (2) automatic Automatic evaluation methods
are proposed recently (2002), that compute similarity between the output and the previously-stored answer(s). BLEU (BiLingual Evaluation Understudy) • compares translation output with (one or multiple) reference(s) using a modified form of precision. • reported that it correlates well with human judgement. • it is also criticized as well.

5 / 24 Statistical machine translation, SMT • Statistical machine
translation – is a new translation method proposed in early 1990s. – replaces words in the source language into the corresponding words in the target language, and – reorder them by using statistics of likelihood. • It is a kind of direct translation method.

6 / 24 SMT : ideas SMT uses an idea
of Shannon's noisy channel model. • Idea: a sentence S in the source language goes through a noisy channel, that is observed as a sentence T in the target language. • At this time, the task is to infer S from the observed T. SMT consists of two modules: translation model and language model.

7 / 24 SMT : features Advantages • enables full
use of computer power. • enables full use of corpus. Disadvantages • lots of computation are required. • performance depends on size of corpus; large amount of parallel corpus is required for better performance. • everything is done by statistics, that makes difficult to improve performance other than increasing corpus.

8 / 24 Language model • In 1990, Brown has
conducted an experiment to recover word English word order given words in a sentence. • The results show that 63% can be recovered correctly, and 84% can be understood. But what about the other language such as Japanese?

9 / 24 Translation model Translation model requires collection of
word-to-word correspondences. • In some cases word-to-phrase or phrase-to-phrase correspondences are required instead. • Moreover, none-to-word alignment may be necessary. – 中学生 vs. junior high school student – particle は

10 / 24 SMT : translation examples 英語翻訳結果参照訳
after nakamaro fell from the position in 764 , the titles were returned to the former names , however , the tang name was often used as another name or an elegant name of the government post . 764 年（天平宝字 8 年）、仲麻呂は唐の別称として用いられることが多かったが、名は元に戻った後の立場からの称号は、優美な官職名天平宝字 8 年（ 764 年）仲麻呂失脚後は旧に戻されたが、その後も官職の別名・雅称として用いられることが多かった。 various ryoge no kan ( posts outside the original ritsuryo code created by imperial edicts ) , which were created from the late nara period to the heian period , had tang names . 本来の律令では、奈良時代後期から平安時代に創作された勅令によって様々な令外官（を作って、唐の名。奈良時代後半から平安時代にかけて生じた様々な令外官についても、唐名がつけられた。 ( the emperor saga who set the posts of ryoge no kan such as kurodo no to [ head chamberlain ] and kebiishi [ officials with judicial and police powers ] also idolized the tang culture . ) 嵯峨天皇の蔵人頭に設置された令外官の役職で、蔵人頭（司法・警察権も唐の文化の idolized 〕〕（検非違使などの官人。（蔵人頭・検非違使などの令外官を置いた嵯峨天皇も唐風文化の心酔者であった。） these tang names do not completely match the office organizations originated in chinese successive dynasties , so that they cannot be always replaced by one-to-one correspondence . これは、中国の歴代王朝に対応できないというものであり、唐とは全くない職制に取って代わった one- to-one 。これらの唐名は、本家中国歴代王朝の職制と完全に一致するわけではないため、必ずしも一対一で置換ができるものではない。 therefore , some organizations overlapped in using the tang name ; to the contrary , there were many cases in which a single organization had multiple tang names . そのため、組織であるが、逆に、唐の組織である場合が多く、唐の名を一名。そのためいくつかの職においては重複するものあり、逆にひとつの職に対し複数の唐名があるものも少なくない。

11 / 24 SMT : problems • The larger, the
better; huge parallel corpus that aligns word- to-word is necessary. • Current model uses n-gram statistics for language model that enables to recover only local information. • Major behaviors get priority, while minor behaviors are difficult to be translated correctly. • Difficult to implement humans wisdom into the engine. Some researchers are skeptical for statistical MT since it has many problems to solve. But it is paid more attentions as time goes.

12 / 24 Example-based machine translation Example-based MT is proposed
by Professor Nagao of Kyoto University in 1984. • Imagine how do human translates a sentence? Do we use some rules for translation? Statistics? Anything else? • The basic idea for example-based machine translation is that when we translate a sentence, we get somewhat similar sentences out of the memory (=brain), and modify them to adopt the given sentence.

13 / 24 Example-based MT : mechanism Example-based MT does
the similar process as explained in the next slides: • segments an input expression into several short phrases, • looks for similar phrases for each phrases, and • transforms and combines phrases according to the similar phrases.

14 / 24 Translation by instances : example Input (Chinese)
「下午可能会下雨」 Instances • 明年可能会去北京／来年は北京に行くかもしれません • 下午／午後 • 下雨／雨が降る Even though we don't understand Chinese, we can translate the input sentence by the following procedures; (1) choosing the first instance 「来年は北京に行くかもしれません」 (2) replacing the words in red into the corresponding translation, again by using instances. 「午後は雨が降るかもしれません」

15 / 24 Example: 「東京の会議」 X の Y → Y'
of X' (京都のツアー、講演の日付、...) Y' in X' (京都の会議、...) Y' on X' (電話の会議、...) Y' for X' (ホテルの登録、...) Although there are many translations of "の", it can be translated correctly if we imitate the translation of the input, i.e., "東京の会議". In this case "京都の会議" is chosen as most similar instance and it is translated to "meeting IN Tokyo".

16 / 24 Rules vs. examples: what's different? rule 2
rule 1 rule 3 instance 2 instance 3 instance 1 input input In rule-based we define regions where each rule is applied, while it is automatically defined in example-based.

17 / 24 Example-based MT: problems So many problems exist:
• Instances: – How many examples should be provided? • thesaurus: – What type of thesaurus is good? – How do you manage if unknown words are given? • essential question: – Can we translate it only by similarity?

18 / 24 Speech translation • Speech translation inputs speech
and outputs speech in a different language. • Several laboratories in Japan, including NICT, NEC, Panasonic, and university such as NAIST are working for this.

19 / 24 Modules in speech translation Theoretically, three modules
are necessary to achieve speech translation: • speech recognition – speech is given, its description is produced. • machine translation – language is converted into different one. • speech synthesis – the translation is spoken. However, it is not enough if we simply pipeline them.

20 / 24 Speech recognition • Speech recognizer is the
most difficult task among the three processes in speech translation. • It also requires processing time. Suppose that the speech sound has some noise, determining words in speech recognition basically requires language processing before that. Human considers situation, language knowledge, facial expression, prosody, and so on, to disambiguate speech into words, but current speech recognizers never use such information at all.

21 / 24 Machine Translation • This module requires skill
for translating spoken language, that tends to be shorter in input, but highly requires context. • Realtime processing is required. No one can wait for 10 seconds at communication situation. • It allows no give up in translation. • It also allows no pre-editing and no post-editing, that is a big difference to automatic document translation.

22 / 24 Speech synthesis Speech synthesis is thought to
be easier than other two processes, since the output is given to human that adapts machine's unnaturalness.

23 / 24 Other difficulties in speech translation (1) use
of phonetic information • 「ももももも」「明日映画行く」 (2) when to start translation (3) how to recognize that the hearer understand the utterance.

24 / 24 Summary: today's key words • evaluation of
MT • statistical MT • example-based MT • spoken language translation (speech MT)

Natural Language Processing (12) Machine Transl...

Natural Language Processing (12) Machine Translation (2)

自然言語処理研究室

More Decks by 自然言語処理研究室

Other Decks in Education

Featured

Transcript

1 1 / 24 Natural Language Processing (12) Machine translation

2 / 24 Summary of the last week Conventional syntactic/semantic

3 / 24 MT evaluation (1) The basic evaluation method

4 / 24 MT evaluation (2) automatic Automatic evaluation methods

5 / 24 Statistical machine translation, SMT • Statistical machine

6 / 24 SMT : ideas SMT uses an idea

7 / 24 SMT : features Advantages • enables full

8 / 24 Language model • In 1990, Brown has

9 / 24 Translation model Translation model requires collection of

10 / 24 SMT : translation examples 英語翻訳結果参照訳

11 / 24 SMT : problems • The larger, the

12 / 24 Example-based machine translation Example-based MT is proposed

13 / 24 Example-based MT : mechanism Example-based MT does

14 / 24 Translation by instances : example Input (Chinese)

15 / 24 Example: 「東京の会議」 X の Y → Y'

16 / 24 Rules vs. examples: what's different? rule 2

17 / 24 Example-based MT: problems So many problems exist:

18 / 24 Speech translation • Speech translation inputs speech

19 / 24 Modules in speech translation Theoretically, three modules

20 / 24 Speech recognition • Speech recognizer is the

21 / 24 Machine Translation • This module requires skill

22 / 24 Speech synthesis Speech synthesis is thought to

23 / 24 Other difficulties in speech translation (1) use

24 / 24 Summary: today's key words • evaluation of