「機械翻訳」Chapter 2 機械翻訳の自動評価と統計的検定

２．３．３ BLEU ２．３．４ METEOR ２．３．５ RIBES ２．３．６ Meta Evaluation ２．４
Statistical Testing MT study / May 14 , 2015 Seitaro Shinagawa , AHC-lab 1 機械翻訳 Chapter2

２．３．３ BLEU | } { } { | ) ,
( | | ) ( r g e g e r m e g e c n n n n n        Evaluate matching rate of n-gram between r(ref) and e(translated). ☆N-gram position is ignored.  The number of n-gram of e  The number of match between reference and translated text Calculate a geometric mean from 1-gram to 4-gram.                    4 1 4 / 1 ) ( 1 ) ( ) ( 1 ) ( ) 9 . 2 ( ) , ( ) ( ) }, , ... , ({ ) , ( n i n i i M i i n E R BP e c e r r m E R BLEU brevity penalty | } { } { | ) }, , ... , ({ 1 1   M j j n n M n r g e g e r r m      If you have M references per e, choose max ) , ( e r m j n )) , ( , ), , ( ), , ( max( ) }, , ... , ({ 2 1 1 e r m e r m e r m e r r m M n n n M n   2

11 7 4 2 9 4 1 0  
                 4 1 4 / 1 ) ( 1 ) ( ) ( 1 ) ( ) 9 . 2 ( ) , ( ) ( ) }, , ... , ({ ) , ( n i n i i M i i n E R BP e c e r r m E R BLEU ２．３．３ BLEU example r1 : I’d like to stay there for five nights , from June sixth . r2 : I want to stay for five nights , from June sixth . e : Well , I’d like to stay five nights beginning June sixth. ) , ( 1 e r ) , ( 2 e r 1 m 2 m 4 m 3 m n : n-gram 13 12 11 10 1 c 2 c 4 c 3 c e  = 1 accepted ( ) |2 | < |1 | < || 1 , 2 , = 1 , = 11 13 ⋅ 7 12 ⋅ 4 11 ⋅ 2 10 1 4 ⋅ 1 , ≅ 0.4353 ⋅ 1 , 3

２．３．４ METEOR  Lack of recall  Indirectly measure fluency
and grammaticality  Using geometric averaging There are problems to use BLEU naively. (※ref->134) Brevity Penalty does not adequately compensate for the lack of recall. [Lavie 2004] Explicit word-matching is required. Geometric averaging results in score of zero whenever one of the component n-gram scores is zero. Metric for Evaluation of Translation with Explicit Ordering assess them. 5

２．３．４ METEOR r : I ‘d like to stay there
for five nights , from June sixth . e : Well , I ‘d like to stay five nights beginning June sixth . To Explicit word-matching, taking alignment between r and e. Ex) (2.11) F-measure The number of words aligned. The number of words in e. The number of words aligned. The number of words in r. 14 words 13 words 11 alignments 6 (if = 0.5)

Harmonic mean is desirable for METEOR (2.11) Both high precision
and high recall rate are essential. 7

２．３．４ METEOR : fragmentation penalty (2.11) : The number of
groups of sequential words r : I ‘d like to stay there for five nights , from June sixth . e : Well , I ‘d like to stay five nights beginning June sixth . Ex) (1) (2) (3) (4) Summary of METEOR ・High precision and high recall are desirable. ・FP intends to divide a text to long sentences. ・Necessary to tune hyper parameter , , 8

For scoring Japanese-to-English translation, (※ref -> 111) There are problems
to use BLEU naively. ２．３．５ RIBES Rank-based Intuitive Bilingual Evaluation Score assess this problem. 9 http://www.researchgate.net/profile/Katsuhito_Sudoh/publication/221012636_Automatic_Evaluatio n_of_Translation_Quality_for_Distant_Language_Pairs/links/00b4952d8d9f8ab140000000.pdf

２．３．５ RIBES r : I ‘d like to stay there
for five nights , from June sixth . e : Well , I ‘d like to stay five nights beginning June sixth . Ex) (1)(2) (3) (4) (5) (6) (7) (8) (9) (10)(11) (12) (13) (14) (10)(1)(2) (3) (8) (9) (12) (13) (14) (4) (5) Position number Aligned by r Rank vector = 8 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 9 , 10 , 11 Scoring by using a rank correlation coefficient. To evaluate bilingual translations required to sort extremely. Rank vector Rank correlation coefficients Spearman’s Kendall’s Considering coefficients as score. 10

Spearman’s Kendall’s If rank vector = 8 , 1 ,
2 , 3 , 4 , 5 , 6 , 7 , 9 , 10 , 11 is given … ＼′ 8 1 2 3 4 8 × × × × 1 ◦ ◦ ◦ 2 ◦ ◦ 3 ◦ 4 ∶ length of ⋯ ⋮ If ℎ < ℎ′ then return 1 (2.13) (2.14) = ℎ1 , ℎ2 , ⋯ , ℎ|| ℎ ∈ , ( = 1,2, ⋯ , ) Calculate distance between and = (1,2, ⋯ , ||) 11

(Spearman) (Kendall) , : rank vector aligned by r and
e Brevity Penalty ≅ is better (, ) = is desirable (∵ , ≤ ) Summary of RIBES ・Rank correlation coefficient is useful for bilingual translation. ・Spearman score is almost equal to Kendall score. ・Necessary to tune hyper parameter , (2.15) 12

２．３．６ Meta Evaluation of Automatic Evaluation Good Automatic Evaluation correlates
with Human Evaluation. Human Evaluation 1 2 3 4 ⋯ −1 Automatic Evaluation 1 2 3 4 ⋯ −1 Assuming that score sample xs , ys = (1,2, ⋯ ) are given, Calculate Pearson product-moment correlation coefficient (2.19) 13

２．４ Statistical Testing How can we judge which evaluation is
the best ?  Score may be different by another system or evaluators.  Our test resources (data, human) are limited. Statistical Testing Problem Calculating confidence interval “You can get score which is out of confidence interval with probability p.” 14

Bootstrapping 200 texts Make N test sets from whole texts
as below. Choose randomly 100 texts 100 texts 100 texts ・・・ Ex) 1st 2nd Nth Statistical Machine Translation s1 s2 ⋯ s Get Score After ascending sort of , delete extreme scores. s1 s2 3 ・・・ −2 s−1 s ⋅ /2 ⋅ /2 confidence interval < 3 , −2 > Assuming p=0.05 N=1000 Delete 5025,25 15

Comparing SMT system using bootstrapping 200 texts Choose randomly 100
texts 100 texts 100 texts ・・・ Ex) 1st 2nd Nth Statistical Machine Translation s 1 (a) s 2 (a) ⋯ s () s 1 (b) s 2 (b) ⋯ s (b) Get Score s (System) Win rate of system a If is over 95% , System a is better than b with p=0.05 16

References [Lavie 2004] Lavie, Alon, Kenji Sagae, and Shyamsundar Jayaraman.
"The significance of recall in automatic metrics for MT evaluation." Machine Translation: From Real Users to Research. Springer Berlin Heidelberg, 2004. 134-143. 111) Isozaki, Hideki, et al. "Automatic evaluation of translation quality for distant language pairs." Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2010. 134) Lavie, Alon, and Michael J. Denkowski. "The METEOR metric for automatic evaluation of machine translation." Machine translation 23.2-3 (2009): 105-115. 17

「機械翻訳」Chapter 2 機械翻訳の自動評価と統計的検定

「機械翻訳」Chapter 2 機械翻訳の自動評価と統計的検定

Seitaro Shinagawa

More Decks by Seitaro Shinagawa

Other Decks in Research

Featured

Transcript

２．３．３ BLEU ２．３．４ METEOR ２．３．５ RIBES ２．３．６ Meta Evaluation ２．４

２．３．３ BLEU | } { } { | ) ,

11 7 4 2 9 4 1 0  

※ : Choose one whose length is close to e.

２．３．４ METEOR  Lack of recall  Indirectly measure fluency

２．３．４ METEOR r : I ‘d like to stay there

Harmonic mean is desirable for METEOR (2.11) Both high precision

２．３．４ METEOR : fragmentation penalty (2.11) : The number of

For scoring Japanese-to-English translation, (※ref -> 111) There are problems

２．３．５ RIBES r : I ‘d like to stay there

Spearman’s Kendall’s If rank vector = 8 , 1 ,

(Spearman) (Kendall) , : rank vector aligned by r and

２．３．６ Meta Evaluation of Automatic Evaluation Good Automatic Evaluation correlates

２．４ Statistical Testing How can we judge which evaluation is

Bootstrapping 200 texts Make N test sets from whole texts

Comparing SMT system using bootstrapping 200 texts Choose randomly 100

References [Lavie 2004] Lavie, Alon, Kenji Sagae, and Shyamsundar Jayaraman.