Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Natural Language Processing (12) Machine Translation (2)

Natural Language Processing (12) Machine Translation (2)

自然言語処理研究室

December 05, 2013
Tweet

More Decks by 自然言語処理研究室

Other Decks in Education

Transcript

  1. 1 1 / 24 Natural Language Processing (12) Machine translation

    (2) Kazuhide Yamamoto Dept. of Electrical Engineering Nagaoka University of Technology
  2. 2 / 24 Summary of the last week Conventional syntactic/semantic

    transfer methods and interlingual methods has the following problems: • it is hard (or impossible) to create rules or maintain them as the size of the rules increases. • there are so many exceptions that make difficult to write some phenomena as rules.
  3. 3 / 24 MT evaluation (1) The basic evaluation method

    is to do that by human. • until the late 1990s all of evaluation is done by human. Evaluation items include: • readability : a degree of expression • informativeness : a degree of contents However, it is expensive and time consuming to conduct evaluations so many times, where we have to do them in research & development phase.
  4. 4 / 24 MT evaluation (2) automatic Automatic evaluation methods

    are proposed recently (2002), that compute similarity between the output and the previously-stored answer(s). BLEU (BiLingual Evaluation Understudy) • compares translation output with (one or multiple) reference(s) using a modified form of precision. • reported that it correlates well with human judgement. • it is also criticized as well.
  5. 5 / 24 Statistical machine translation, SMT • Statistical machine

    translation – is a new translation method proposed in early 1990s. – replaces words in the source language into the corresponding words in the target language, and – reorder them by using statistics of likelihood. • It is a kind of direct translation method.
  6. 6 / 24 SMT : ideas SMT uses an idea

    of Shannon's noisy channel model. • Idea: a sentence S in the source language goes through a noisy channel, that is observed as a sentence T in the target language. • At this time, the task is to infer S from the observed T. SMT consists of two modules: translation model and language model.
  7. 7 / 24 SMT : features Advantages • enables full

    use of computer power. • enables full use of corpus. Disadvantages • lots of computation are required. • performance depends on size of corpus; large amount of parallel corpus is required for better performance. • everything is done by statistics, that makes difficult to improve performance other than increasing corpus.
  8. 8 / 24 Language model • In 1990, Brown has

    conducted an experiment to recover word English word order given words in a sentence. • The results show that 63% can be recovered correctly, and 84% can be understood. But what about the other language such as Japanese?
  9. 9 / 24 Translation model Translation model requires collection of

    word-to-word correspondences. • In some cases word-to-phrase or phrase-to-phrase correspondences are required instead. • Moreover, none-to-word alignment may be necessary. – 中学生 vs. junior high school student – particle は
  10. 10 / 24 SMT : translation examples 英語 翻訳結果 参照訳

    after nakamaro fell from the position in 764 , the titles were returned to the former names , however , the tang name was often used as another name or an elegant name of the government post . 764 年 ( 天平宝字 8 年 ) 、 仲麻呂 は 唐 の 別称 と し て 用い られ る こと が 多 かっ た が 、 名 は 元 に 戻 っ た 後 の 立場 から の 称号 は 、 優美 な 官職 名 天平 宝字 8 年 ( 764 年 ) 仲麻呂 失脚 後 は 旧 に 戻 さ れ た が 、 その 後 も 官職 の 別名 ・ 雅称 と し て 用い られ る こと が 多 かっ た 。 various ryoge no kan ( posts outside the original ritsuryo code created by imperial edicts ) , which were created from the late nara period to the heian period , had tang names . 本来 の 律令 で は 、 奈良 時代 後期 から 平安 時代 に 創作 さ れ た 勅令 に よ っ て 様々 な 令外 官 ( を 作 っ て 、 唐 の 名 。 奈良 時代 後半 から 平安 時代 に かけ て 生 じ た 様々 な 令外 官 に つ い て も 、 唐名 が つけ られ た 。 ( the emperor saga who set the posts of ryoge no kan such as kurodo no to [ head chamberlain ] and kebiishi [ officials with judicial and police powers ] also idolized the tang culture . ) 嵯峨 天皇 の 蔵人 頭 に 設置 さ れ た 令外 官 の 役職 で 、 蔵人 頭 ( 司法 ・ 警察 権 も 唐 の 文化 の idolized 〕 〕 ( 検非 違使 など の 官人 。 ( 蔵人 頭 ・ 検非違使 など の 令外 官 を 置 い た 嵯峨 天皇 も 唐風 文化 の 心酔 者 で あ っ た 。 ) these tang names do not completely match the office organizations originated in chinese successive dynasties , so that they cannot be always replaced by one-to-one correspondence . これ は 、 中国 の 歴代 王朝 に 対応 でき な い と い う もの で あ り 、 唐 と は 全 く な い 職制 に 取 っ て 代わ っ た one- to-one 。 これ ら の 唐名 は 、 本家 中国 歴代 王朝 の 職制 と 完全 に 一致 する わけ で は な い ため 、 必ず しも 一 対 一 で 置換 が でき る もの で は な い 。 therefore , some organizations overlapped in using the tang name ; to the contrary , there were many cases in which a single organization had multiple tang names . その ため 、 組織 で あ る が 、 逆 に 、 唐 の 組織 で あ る 場合 が 多 く 、 唐 の 名 を 一 名 。 その ため いく つ か の 職 に お い て は 重複 する もの あ り 、 逆 に ひと つ の 職 に 対 し 複数 の 唐名 が あ る もの も 少な く な い 。
  11. 11 / 24 SMT : problems • The larger, the

    better; huge parallel corpus that aligns word- to-word is necessary. • Current model uses n-gram statistics for language model that enables to recover only local information. • Major behaviors get priority, while minor behaviors are difficult to be translated correctly. • Difficult to implement humans wisdom into the engine. Some researchers are skeptical for statistical MT since it has many problems to solve. But it is paid more attentions as time goes.
  12. 12 / 24 Example-based machine translation Example-based MT is proposed

    by Professor Nagao of Kyoto University in 1984. • Imagine how do human translates a sentence? Do we use some rules for translation? Statistics? Anything else? • The basic idea for example-based machine translation is that when we translate a sentence, we get somewhat similar sentences out of the memory (=brain), and modify them to adopt the given sentence.
  13. 13 / 24 Example-based MT : mechanism Example-based MT does

    the similar process as explained in the next slides: • segments an input expression into several short phrases, • looks for similar phrases for each phrases, and • transforms and combines phrases according to the similar phrases.
  14. 14 / 24 Translation by instances : example Input (Chinese)

    「下午可能会下雨」 Instances • 明年可能会去北京/来年は北京に行くかもしれません • 下午/午後 • 下雨/雨が降る Even though we don't understand Chinese, we can translate the input sentence by the following procedures; (1) choosing the first instance 「来年は北京に行くかもしれません」 (2) replacing the words in red into the corresponding translation, again by using instances. 「午後は雨が降るかもしれません」
  15. 15 / 24 Example: 「東京の会議」 X の Y → Y'

    of X' (京都のツアー、講演の日付、...) Y' in X' (京都の会議、...) Y' on X' (電話の会議、...) Y' for X' (ホテルの登録、...) Although there are many translations of "の", it can be translated correctly if we imitate the translation of the input, i.e., "東京の会議". In this case "京都の会議" is chosen as most similar instance and it is translated to "meeting IN Tokyo".
  16. 16 / 24 Rules vs. examples: what's different? rule 2

    rule 1 rule 3 instance 2 instance 3 instance 1 input input In rule-based we define regions where each rule is applied, while it is automatically defined in example-based.
  17. 17 / 24 Example-based MT: problems So many problems exist:

    • Instances: – How many examples should be provided? • thesaurus: – What type of thesaurus is good? – How do you manage if unknown words are given? • essential question: – Can we translate it only by similarity?
  18. 18 / 24 Speech translation • Speech translation inputs speech

    and outputs speech in a different language. • Several laboratories in Japan, including NICT, NEC, Panasonic, and university such as NAIST are working for this.
  19. 19 / 24 Modules in speech translation Theoretically, three modules

    are necessary to achieve speech translation: • speech recognition – speech is given, its description is produced. • machine translation – language is converted into different one. • speech synthesis – the translation is spoken. However, it is not enough if we simply pipeline them.
  20. 20 / 24 Speech recognition • Speech recognizer is the

    most difficult task among the three processes in speech translation. • It also requires processing time. Suppose that the speech sound has some noise, determining words in speech recognition basically requires language processing before that. Human considers situation, language knowledge, facial expression, prosody, and so on, to disambiguate speech into words, but current speech recognizers never use such information at all.
  21. 21 / 24 Machine Translation • This module requires skill

    for translating spoken language, that tends to be shorter in input, but highly requires context. • Realtime processing is required. No one can wait for 10 seconds at communication situation. • It allows no give up in translation. • It also allows no pre-editing and no post-editing, that is a big difference to automatic document translation.
  22. 22 / 24 Speech synthesis Speech synthesis is thought to

    be easier than other two processes, since the output is given to human that adapts machine's unnaturalness.
  23. 23 / 24 Other difficulties in speech translation (1) use

    of phonetic information • 「ももももも」「明日映画行く」 (2) when to start translation (3) how to recognize that the hearer understand the utterance.
  24. 24 / 24 Summary: today's key words • evaluation of

    MT • statistical MT • example-based MT • spoken language translation (speech MT)