$30 off During Our Annual Pro Sale. View Details »

「機械翻訳」Chapter 2 機械翻訳の自動評価と統計的検定

「機械翻訳」Chapter 2 機械翻訳の自動評価と統計的検定

2015年にNAIST(旧)松本研で行った機械翻訳の勉強会資料です

Seitaro Shinagawa

August 12, 2020
Tweet

More Decks by Seitaro Shinagawa

Other Decks in Research

Transcript

  1. 2.3.3 BLEU
    2.3.4 METEOR
    2.3.5 RIBES
    2.3.6 Meta Evaluation
    2.4 Statistical Testing
    MT study / May 14 , 2015
    Seitaro Shinagawa , AHC-lab
    1
    機械翻訳 Chapter2

    View Slide

  2. 2.3.3 BLEU
    |
    }
    {
    }
    {
    |
    )
    ,
    (
    |
    |
    )
    (
    r
    g
    e
    g
    e
    r
    m
    e
    g
    e
    c
    n
    n
    n
    n
    n







    Evaluate matching rate of n-gram between r(ref) and e(translated).
    ☆N-gram position is ignored.
     The number of n-gram of e
     The number of match between
    reference and translated text
    Calculate a geometric mean from 1-gram to 4-gram.




     













    4
    1
    4
    /
    1
    )
    (
    1
    )
    (
    )
    (
    1
    )
    (
    )
    9
    .
    2
    (
    )
    ,
    (
    )
    (
    )
    },
    ,
    ...
    ,
    ({
    )
    ,
    (
    n
    i
    n
    i
    i
    M
    i
    i
    n
    E
    R
    BP
    e
    c
    e
    r
    r
    m
    E
    R
    BLEU
    brevity penalty
    |
    }
    {
    }
    {
    |
    )
    },
    ,
    ...
    ,
    ({
    1
    1


    M
    j
    j
    n
    n
    M
    n
    r
    g
    e
    g
    e
    r
    r
    m




     If you have M references per e,
    choose max )
    ,
    ( e
    r
    m
    j
    n
    ))
    ,
    (
    ,
    ),
    ,
    (
    ),
    ,
    (
    max(
    )
    },
    ,
    ...
    ,
    ({
    2
    1
    1
    e
    r
    m
    e
    r
    m
    e
    r
    m
    e
    r
    r
    m
    M
    n
    n
    n
    M
    n

     2

    View Slide

  3. 11 7 4 2
    9 4 1 0




     













    4
    1
    4
    /
    1
    )
    (
    1
    )
    (
    )
    (
    1
    )
    (
    )
    9
    .
    2
    (
    )
    ,
    (
    )
    (
    )
    },
    ,
    ...
    ,
    ({
    )
    ,
    (
    n
    i
    n
    i
    i
    M
    i
    i
    n
    E
    R
    BP
    e
    c
    e
    r
    r
    m
    E
    R
    BLEU
    2.3.3 BLEU example
    r1 : I’d like to stay there for five nights , from June sixth .
    r2 : I want to stay for five nights , from June sixth .
    e : Well , I’d like to stay five nights beginning June sixth.
    )
    ,
    (
    1
    e
    r
    )
    ,
    (
    2
    e
    r
    1
    m
    2
    m
    4
    m
    3
    m n : n-gram
    13 12 11 10
    1
    c
    2
    c
    4
    c
    3
    c
    e
     = 1
    accepted
    ( )
    |2
    | < |1
    | < ||
    1
    , 2
    , = 1
    , =
    11
    13

    7
    12

    4
    11

    2
    10
    1
    4
    ⋅ 1
    , ≅ 0.4353 ⋅ 1
    ,
    3

    View Slide

  4. ※ : Choose one whose length is close to e. (and short)
    2.3.3 brevity penalty
































    N
    i
    i
    N
    i
    i
    e
    r
    E
    R
    BP
    1
    )
    (
    1
    )
    (
    |
    |
    |
    ~
    |
    1
    exp
    ,
    1
    min
    )
    ,
    ( (2.10)


    N
    i
    i
    e
    1
    )
    ( |
    | 

    N
    i
    i
    r
    1
    )
    ( |
    ~
    |


    N
    i
    i
    e
    1
    )
    ( |
    | 

    N
    i
    i
    r
    1
    )
    ( |
    ~
    |


    N
    i
    i
    e
    1
    )
    ( |
    | 

    N
    i
    i
    r
    1
    )
    ( |
    ~
    |
    <<
    >

    0
    )
    ,
    ( 
    E
    R
    BP
    1
    )
    ,
    ( 
    E
    R
    BP
    1
    )
    ,
    ( 
    E
    R
    BP
    BP penalizes translated text is too short against reference.
    )
    (
    ~ i
    r
    4

    View Slide

  5. 2.3.4 METEOR
     Lack of recall
     Indirectly measure fluency and grammaticality
     Using geometric averaging
    There are problems to use BLEU naively. (※ref->134)
    Brevity Penalty does not adequately compensate for the lack of recall. [Lavie 2004]
    Explicit word-matching is required.
    Geometric averaging results in score of zero whenever one of the
    component n-gram scores is zero.
    Metric for Evaluation of Translation with Explicit Ordering
    assess them.
    5

    View Slide

  6. 2.3.4 METEOR
    r : I ‘d like to stay there for five nights , from June sixth .
    e : Well , I ‘d like to stay five nights beginning June sixth .
    To Explicit word-matching, taking alignment between r and e.
    Ex)
    (2.11)
    F-measure
    The number of words aligned.
    The number of words in e.
    The number of words aligned.
    The number of words in r.
    14 words
    13 words
    11
    alignments
    6
    (if = 0.5)

    View Slide

  7. Harmonic mean is desirable for METEOR
    (2.11)
    Both high precision
    and high recall rate
    are essential.
    7

    View Slide

  8. 2.3.4 METEOR
    : fragmentation penalty
    (2.11)
    : The number of groups of sequential words
    r : I ‘d like to stay there for five nights , from June sixth .
    e : Well , I ‘d like to stay five nights beginning June sixth .
    Ex)
    (1) (2) (3) (4)
    Summary of METEOR
    ・High precision and high recall are desirable.
    ・FP intends to divide a text to long sentences.
    ・Necessary to tune hyper parameter , ,
    8

    View Slide

  9. For scoring Japanese-to-English translation,
    (※ref -> 111)
    There are problems to use BLEU naively.
    2.3.5 RIBES
    Rank-based Intuitive Bilingual Evaluation Score assess this problem.
    9
    http://www.researchgate.net/profile/Katsuhito_Sudoh/publication/221012636_Automatic_Evaluatio
    n_of_Translation_Quality_for_Distant_Language_Pairs/links/00b4952d8d9f8ab140000000.pdf

    View Slide

  10. 2.3.5 RIBES
    r : I ‘d like to stay there for five nights , from June sixth .
    e : Well , I ‘d like to stay five nights beginning June sixth .
    Ex)
    (1)(2) (3) (4) (5) (6) (7) (8) (9) (10)(11) (12) (13) (14)
    (10)(1)(2) (3) (8) (9) (12) (13) (14)
    (4) (5)
    Position
    number
    Aligned
    by r
    Rank vector = 8 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 9 , 10 , 11
    Scoring by using a rank correlation coefficient.
    To evaluate bilingual translations required to sort extremely.
    Rank vector
    Rank correlation coefficients
    Spearman’s
    Kendall’s
    Considering
    coefficients as score.
    10

    View Slide

  11. Spearman’s
    Kendall’s
    If rank vector = 8 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 9 , 10 , 11 is given …

    \′ 8 1 2 3 4
    8 × × × ×
    1 ○ ○ ○
    2 ○ ○
    3 ○
    4
    ∶ length of


    If ℎ
    < ℎ′ then
    return 1
    (2.13)
    (2.14)
    = ℎ1
    , ℎ2
    , ⋯ , ℎ||

    ∈ , ( = 1,2, ⋯ , )
    Calculate distance between
    and = (1,2, ⋯ , ||)
    11

    View Slide

  12. (Spearman)
    (Kendall)
    , : rank vector aligned
    by r and e
    Brevity Penalty
    ≅ is better
    (, ) = is desirable
    (∵ , ≤ )
    Summary of RIBES
    ・Rank correlation coefficient is useful for bilingual translation.
    ・Spearman score is almost equal to Kendall score.
    ・Necessary to tune hyper parameter ,
    (2.15)
    12

    View Slide

  13. 2.3.6 Meta Evaluation of Automatic Evaluation
    Good Automatic Evaluation correlates with Human Evaluation.
    Human Evaluation 1
    2
    3
    4
    ⋯ −1

    Automatic Evaluation 1
    2
    3
    4
    ⋯ −1

    Assuming that score sample xs
    , ys
    = (1,2, ⋯ ) are given,
    Calculate Pearson product-moment correlation coefficient
    (2.19)
    13

    View Slide

  14. 2.4 Statistical Testing
    How can we judge which evaluation is the best ?
     Score may be different by
    another system or evaluators.
     Our test resources (data,
    human) are limited.
    Statistical Testing Problem
    Calculating confidence interval
    “You can get score which is out of confidence interval with probability p.”
    14

    View Slide

  15. Bootstrapping
    200 texts
    Make N test sets from whole texts as below.
    Choose
    randomly
    100 texts 100 texts 100 texts
    ・・・
    Ex)
    1st 2nd Nth
    Statistical Machine Translation
    s1
    s2
    ⋯ s
    Get Score
    After ascending sort of , delete extreme scores.

    s1
    s2
    3
    ・・・ −2
    s−1
    s
    ⋅ /2 ⋅ /2
    confidence interval
    < 3
    , −2
    >
    Assuming
    p=0.05
    N=1000
    Delete
    5025,25
    15

    View Slide

  16. Comparing SMT system using bootstrapping
    200 texts
    Choose
    randomly
    100 texts 100 texts 100 texts
    ・・・
    Ex)
    1st 2nd Nth
    Statistical Machine Translation
    s
    1
    (a) s
    2
    (a) ⋯ s

    ()
    s
    1
    (b) s
    2
    (b) ⋯ s

    (b)
    Get Score

    s

    (System)
    Win rate of system a
    If
    is over 95% ,
    System a is better than b with p=0.05
    16

    View Slide

  17. References
    [Lavie 2004] Lavie, Alon, Kenji Sagae, and Shyamsundar Jayaraman. "The significance
    of recall in automatic metrics for MT evaluation." Machine Translation: From Real
    Users to Research. Springer Berlin Heidelberg, 2004. 134-143.
    111) Isozaki, Hideki, et al. "Automatic evaluation of translation quality for distant
    language pairs." Proceedings of the 2010 Conference on Empirical Methods in
    Natural Language Processing. Association for Computational Linguistics, 2010.
    134) Lavie, Alon, and Michael J. Denkowski. "The METEOR metric for automatic
    evaluation of machine translation." Machine translation 23.2-3 (2009): 105-115.
    17

    View Slide