Upgrade to Pro — share decks privately, control downloads, hide ads and more …

大規模疑似データを用いた高性能文法誤り訂正モデルの構築

 大規模疑似データを用いた高性能文法誤り訂正モデルの構築

Shun Kiyono

March 18, 2020
Tweet

More Decks by Shun Kiyono

Other Decks in Research

Transcript

  1. େن໛ٙࣅσʔλΛ༻͍ͨ
    ߴੑೳจ๏ޡΓగਖ਼Ϟσϧͷߏங
    ࣮૷ɾϞσϧ͸ https://github.com/butsugiri/gec-pseudodata Ͱެ։த
    ਗ਼໺ॢ ླ໦५ ࡾాխਓ ਫຊஐ໵ ס݈ଠ࿠
    ཧԽֶݚڀॴ ౦๺େֶ ϑϡʔνϟʔגࣜձࣾ

    View full-size slide

  2. λεΫɿจ๏ޡΓగਖ਼ʢGECʣ
    • ೖྗ: จ๏ޡΓΛؚΉจ
    • ग़ྗ: จ๏తʹਖ਼͍͠จ
    • ۙ೥͸຋༁ͷ࿮૊ΈͰऔΓ૊Ήͷ͕Ұൠత
    March 20, 2020 RIKEN AIP / Tohoku University 2
    I follows his advice I followed his advice
    Ϟσϧ (ྫ: Encoder-Decoder)
    ೖྗ ग़ྗ
    Grammatical Error Correction

    View full-size slide

  3. GECͷ໰୊ɿσʔλ͕଍Γͳ͍
    • ࠷΋ن໛ͷେ͖ͳίʔύε(Lang-8)Ͱ΋2Mจର
    • σʔλͷྔΛ૿΍͢͜ͱ͸ॏཁ
    • ۙ೥ɺGECʹ͓͚Δʮٙࣅσʔλੜ੒ʯ͕੝Μ
    • BEA-2019 Shared TaskͰ΋΄ͱΜͲͷνʔϜ͕࠾༻
    March 20, 2020 RIKEN AIP / Tohoku University 3
    5 4 + reduce batch size (4k ! 1k tokens) 12.40 ± 0.08 31.97 ± 0
    6 5 + lexical model 13.03 ± 0.49 31.80 ± 0
    7 5 + aggressive (word) dropout 15.87 ± 0.09 33.60 ± 0
    8 7 + other hyperparameter tuning (learning rate,
    16.57 ± 0.26 32.80 ± 0
    model depth, label smoothing rate)
    9 8 + lexical model 16.10 ± 0.29 33.30 ± 0
    Table 2: German!English IWSLT results for training corpus size of 100k words and 3.2M words (full c
    Mean and standard deviation of three training runs reported.
    105
    106
    0
    10
    20
    30
    32.8
    30.8
    28.7
    24.4
    20.6
    16.6
    26.6
    24.9
    23
    20.5
    18.3
    16
    25.7
    18.5
    11.6
    1.8
    1.3
    0
    corpus size (English words)
    BLEU
    neural MT optimized
    phrase-based SMT
    neural MT baseline
    Figure 2: German!English learning curve, showing
    BLEU as a function of the amount of parallel training
    data, for PBSMT and NMT.
    4.3 NMT Systems
    We train neural systems with Nematus (Sennrich
    et al., 2017b). Our baseline mostly follows the
    size, model depth, regularization paramete
    learning rate. Detailed hyperparameters
    ported in Appendix A.
    5 Results
    Table 2 shows the effect of adding different
    ods to the baseline NMT system, on the ult
    data condition (100k words of training dat
    the full IWSLT 14 training corpus (3.2M w
    Our ”mainstream improvements” add aroun
    BLEU in both data conditions.
    In the ultra-low data condition, reduci
    BPE vocabulary size is very effective
    BLEU). Reducing the batch size to 1000 to
    sults in a BLEU gain of 0.3, and the lexical
    yields an additional +0.6 BLEU. Howeve
    gressive (word) dropout6 (+3.4 BLEU) and
    other hyperparameters (+0.7 BLEU) has a st
    effect than the lexical model, and adding t
    Figure from [Sennrich and Zhang 2019]

    View full-size slide

  4. ʮਅʯͷσʔλΛ༻ֶ͍ͨश
    March 20, 2020 RIKEN AIP / Tohoku University 4
    ਅͷσʔλ (ྫ: Lang-8)
    Ϟσϧ
    ܇࿅
    ೖྗจ ग़ྗจ

    View full-size slide

  5. ٙࣅσʔλΛऔΓೖΕΔ৔߹
    March 20, 2020 RIKEN AIP / Tohoku University 5
    ਅͷσʔλ
    Ϟσϧ
    ܇࿅
    ੜ੒ݩίʔύε
    (e.g. Wikipedia)
    ٙࣅσʔλ
    ੜ੒ख๏
    จ๏తͳจͷू߹
    ٙࣅσʔλ
    ٙࣅσʔλ

    View full-size slide

  6. ໰୊ɿͲ͏΍ͬͯٙࣅσʔλΛ࢖͏ʁ
    • ٙࣅσʔλΛ࢖͏ͱ͖ʹؾʹͳΔ఺
    • ݱঢ়ɺ֤ݚڀͰઃఆ͕ҟͳΔ
    • ൺֱ࣮ݧ͸ଘࡏ͠ͳ͍
    • ݁ہͲ͏͢Ε͹͍͍ͷ͔Ṗ
    • ຊݚڀͷΰʔϧɿޮՌతͳઃఆͷ୳ࡧ
    • ֤ཁૉʹ͍ͭͯϕετͳઃఆΛݟ͚ͭग़͢
    March 20, 2020 RIKEN AIP / Tohoku University 6
    Q1: ٙࣅσʔλੜ੒ͷख๏ΛͲ͏͢Δʁ
    Q2: ੜ੒ݩίʔύεͷछྨΛͲ͏͢Δʁ
    Q3: Ϟσϧͷ࠷దԽΛͲ͏͢Δʁ

    View full-size slide

  7. ߩݙɿ͍͢͝GECϞσϧ
    1. ௒େن໛ʢ~70MจରʣͳٙࣅσʔλΛ༻͍ͯ
    2. طଘͷEncoder-DecoderϞσϧΛ܇࿅͠
    3. 2019೥౰࣌ͷੈք࠷ߴੑೳΛୡ੒
    March 20, 2020 RIKEN AIP / Tohoku University 7
    ੈքͰ࠷ॳʹF0.5
    ஋70௒͑Λୡ੒
    #&"ͷϦʔμʔϘʔυ ֶशࡁΈϞσϧΛ(JU)VCͰެ։த
    Fairseqϕʔε
    ࠓ೔͔Β࢖͑·͢
    TEAM RIKEN
    https://github.com/butsugiri/gec-pseudodata

    View full-size slide

  8. ໰୊ɿͲ͏΍ͬͯٙࣅσʔλΛ࢖͏ʁ
    • ٙࣅσʔλΛ࢖͏ͱ͖ʹؾʹͳΔ఺
    • ݱঢ়ɺ֤ݚڀͰઃఆ͕ҟͳΔ
    • ൺֱ࣮ݧ͸ଘࡏ͠ͳ͍
    • ݁ہͲ͏͢Ε͹͍͍ͷ͔Ṗ
    • ຊݚڀͷΰʔϧɿޮՌతͳઃఆͷ୳ࡧ
    • ֤ཁૉʹ͍ͭͯϕετͳઃఆΛݟ͚ͭग़͢
    March 20, 2020 RIKEN AIP / Tohoku University 8
    Q1: ٙࣅσʔλੜ੒ͷख๏ΛͲ͏͢Δʁ
    Q2: ੜ੒ݩίʔύεͷछྨΛͲ͏͢Δʁ
    Q3: Ϟσϧͷ࠷దԽΛͲ͏͢Δʁ

    View full-size slide

  9. Q1: ٙࣅσʔλੜ੒ͷख๏ΛͲ͏͢Δʁ
    March 20, 2020 RIKEN AIP / Tohoku University 9
    ਅͷσʔλ
    Ϟσϧ
    ܇࿅
    ੜ੒ݩίʔύε
    (e.g. Wikipedia)
    ٙࣅσʔλ
    ੜ੒ख๏
    ٙࣅσʔλ
    ٙࣅσʔλ

    View full-size slide

  10. Q1: ٙࣅσʔλੜ੒ͷख๏ΛͲ͏͢Δʁ
    • BACKTRANS (NOISY) [Xie+2018]
    • ϊΠζ෇͖ٯ຋༁ʹΑΓٖࣅσʔλΛੜ੒
    • DIRECTNOISE [Zhao+2019]
    • จʹϥϯμϜͳϊΠζΛ௚઀෇༩͢Δ
    • ֤ख๏ͷৄࡉ͸࿦จΛࢀর͍ͯͩ͘͠͞
    March 20, 2020 RIKEN AIP / Tohoku University 10
    rea@@ ter London .
    DIRECTNOISE: hmaski hmaski hmaski for hmaski hmaski hmaski hmaski
    Original: The cli@@ p is mixed with images of Toronto streets
    during power failure .
    BACKTRANS (NOISY): The cli@@ p is mix with images of Toronto streets during
    power failure .
    DIRECTNOISE: The hmaski is mixed hmaski images si@@ of The hmaski
    streets large hmaski power R@@ failure place hmaski
    Original: At the in@@ stitute , she introduced tis@@ sue culture
    methods that she had learned in the U.@@ S.
    BACKTRANS (NOISY): At in@@ stitute , She introduced tis@@ sue culture method
    that she learned in U.@@ S.
    DIRECTNOISE: hmaski the the hmaski hmaski hmaski hmaski tis@@ culture R@@
    methods , she P hmaski the s U.@@ hmaski
    Figure 5: Examples of sentences generated by BACKTRANS (NOISY) and DIRECTNOISE methods.
    Fig. 6 shows examples generated by DIRECTNOISE, when changing the mask probability (µmask).
    µmask
    Output Sentence
    N/A He threw the sand@@ wi@@ ch at his wife .
    0.1 He ale threw , ch his ne@@ wife dar@@ hmaski
    0.3 hmaski hmaski hmaski hmaski ch at ament his Research .
    0.5 He o threw the sand@@ ch hmaski his hmaski .
    0.7 hmaski hmaski sand@@ hmaski hmaski hmaski hmaski wife hmaski
    Figure 6: Examples generated when varying µmask
    . N/A denotes original text.

    View full-size slide

  11. Q2: ੜ੒ݩίʔύεͷछྨΛͲ͏͢Δʁ
    March 20, 2020 RIKEN AIP / Tohoku University 11
    ਅͷσʔλ
    Ϟσϧ
    ܇࿅
    ੜ੒ݩίʔύε
    (e.g. Wikipedia)
    ٙࣅσʔλ
    ੜ੒ख๏
    ٙࣅσʔλ
    ٙࣅσʔλ

    View full-size slide

  12. Q2: ੜ੒ݩίʔύεͷछྨΛͲ͏͢Δʁ
    • ͨ͘͞Μͷީิ͕ଘࡏ:
    • Wikipedia, 1-billion word benchmark, BookCorpus ͳͲ
    • [Ge+2018]: Wikipedia
    • [Zhao+2019]: 1-billion word LM benchmark
    • [Xie+2018]: NYT corpus
    • [Grundkiewicz+2019]: News Crawl
    • GECϞσϧʹ͸Ͳͷίʔύε͕ద͍ͯ͠Δͷ͔ʁ
    • ຊݚڀɿҎԼͷίʔύεΛൺֱݕ౼
    • Simple Wikipedia
    • Wikipedia
    • LDC Gigaword
    March 20, 2020 RIKEN AIP / Tohoku University 12
    υϝΠϯ͸ಉ͡ɻจ๏తͳෳࡶ͕͞ҟͳΔɻ
    Gigaword͸৽ฉهࣄ⇛ϊΠζখͱظ଴

    View full-size slide

  13. Q3: ٙࣅσʔλΛ༻͍ͨ࠷దԽख๏ΛͲ͏͢Δʁ
    March 20, 2020 RIKEN AIP / Tohoku University 13
    ਅͷσʔλ
    Ϟσϧ
    ܇࿅
    ੜ੒ݩίʔύε
    (e.g. Wikipedia)
    ٙࣅσʔλ
    ੜ੒ख๏
    ٙࣅσʔλ
    ٙࣅσʔλ

    View full-size slide

  14. Q3: ٙࣅσʔλΛ༻͍ͨ࠷దԽख๏ΛͲ͏͢Δʁ
    March 20, 2020 RIKEN AIP / Tohoku University 14
    ಉ࣌ʹֶश͢Δ JOINT

    ·ͣ1SFUSBJOɺͦͷޙ 'JOFUVOF PRETRAIN

    Training
    Pre-training Fine-tuning
    ͲͪΒ͕ΑΓ
    ߴੑೳ͔ʁ
    ਅͷσʔλ
    ٙࣅσʔλ
    ٙࣅσʔλ ਅͷσʔλ

    View full-size slide

  15. ࣮ݧઃఆɾσʔληοτ
    • ʮҰൠతʯͳઃఆΛ࠾༻
    • Ϟσϧ: Transformer (Big) [Vaswani+2017]
    • ࠷దԽ: Adam (pretrain) ɾ Adafactor (fine-tuning)
    • σʔληοτ
    • BEA-2019 dataset (train/valid/test) [Bryant+2019]
    • CoNLL2014 (test) [Ng+2014]
    • JFLEG (test) [Napoles+2017]
    March 20, 2020 RIKEN AIP / Tohoku University 15

    View full-size slide

  16. ࣮ݧ1: ੜ੒ݩίʔύεͷબఆ
    • ઃఆ: JOINT
    March 20, 2020 RIKEN AIP / Tohoku University 16
    • ੜ੒ݩίʔύεͷੑೳ΁ͷӨڹ͸খ͍͞ʁ
    • ͔͠͠ɺ(JHBXPSE͕Ұ؏ͯ͠ྑ͍
    • จ๏త ͖Ε͍ͳจΛ׆༻͢Δ͜ͱͷॏཁੑΛࣔࠦ
    Method Seed Corpus T Prec. Rec. F0.5
    Baseline N/A 46.6 23.1 38.8
    BACKTRANS (NOISY) Wikipedia 43.8 30.8 40.4
    BACKTRANS (NOISY) SimpleWiki 42.5 31.3 39.7
    BACKTRANS (NOISY) Gigaword 43.1 33.1 40.6
    DIRECTNOISE Wikipedia 48.3 25.5 41.0
    DIRECTNOISE SimpleWiki 48.9 25.7 41.4
    DIRECTNOISE Gigaword 48.3 26.9 41.7
    Table 3: Performance on BEA-valid when changing the
    seed corpus T used for generating pseudo data (|Dp
    | =
    1.4M).
    DIRECTNOISE with Gigaword achieves the best
    value of F0.5 among all the configurations.
    Optimization Metho
    N/A Baseli
    PRETRAIN BACK
    PRETRAIN DIREC
    JOINT BACK
    JOINT DIREC
    PRETRAIN BACK
    PRETRAIN DIREC
    JOINT BACK
    JOINT DIREC
    Table 4: Performan
    mization settings o
    Wikipedia.
    ੑೳมԽখ
    ੑೳมԽখ

    View full-size slide

  17. ࣮ݧ1: ੜ੒ݩίʔύεͷબఆ
    • ઃఆ: JOINT
    March 20, 2020 RIKEN AIP / Tohoku University 17
    • ੜ੒ݩίʔύεͷੑೳ΁ͷӨڹ͸খ͍͞ʁ
    • ͔͠͠ɺ(JHBXPSE͕Ұ؏ͯ͠ྑ͍
    • จ๏త ͖Ε͍ͳจΛ׆༻͢Δ͜ͱͷॏཁੑΛࣔࠦ
    Method Seed Corpus T Prec. Rec. F0.5
    Baseline N/A 46.6 23.1 38.8
    BACKTRANS (NOISY) Wikipedia 43.8 30.8 40.4
    BACKTRANS (NOISY) SimpleWiki 42.5 31.3 39.7
    BACKTRANS (NOISY) Gigaword 43.1 33.1 40.6
    DIRECTNOISE Wikipedia 48.3 25.5 41.0
    DIRECTNOISE SimpleWiki 48.9 25.7 41.4
    DIRECTNOISE Gigaword 48.3 26.9 41.7
    Table 3: Performance on BEA-valid when changing the
    seed corpus T used for generating pseudo data (|Dp
    | =
    1.4M).
    DIRECTNOISE with Gigaword achieves the best
    value of F0.5 among all the configurations.
    Optimization Metho
    N/A Baseli
    PRETRAIN BACK
    PRETRAIN DIREC
    JOINT BACK
    JOINT DIREC
    PRETRAIN BACK
    PRETRAIN DIREC
    JOINT BACK
    JOINT DIREC
    Table 4: Performan
    mization settings o
    Wikipedia.
    ੑೳมԽখ
    ੑೳมԽখ
    ౴1: GigawordΛ࢖͏΂͠

    View full-size slide

  18. ࣮ݧ2: ٙࣅσʔλͷ׆༻ํ๏
    • ઃఆ: WikipediaΛੜ੒ݩσʔλͱͯ͠ར༻
    March 20, 2020 RIKEN AIP / Tohoku University 18
    • ਅͷσʔλͱٙࣅσʔλͷྔ͕େମಉ͡ͷ৔߹
    à PRETRAIN ͱ JOINT ͸େମಉ͡ੑೳ
    PRETRAIN JOINT
    36
    38
    40
    42
    44
    46
    Backtrans (noisy) DirectNoise
    F0.5 Score
    Pseudo Data = 1.4M
    36
    38
    40
    42
    44
    46
    Backtrans (noisy) DirectNoise
    F0.5 Score
    Pseudo Data = 1.4M

    View full-size slide

  19. ࣮ݧ2: ٙࣅσʔλͷ׆༻ํ๏
    • ઃఆ: WikipediaΛੜ੒ݩσʔλͱͯ͠ར༻
    March 20, 2020 RIKEN AIP / Tohoku University 19
    • PretrainͰ͸ɺٙࣅσʔλͷྔΛ૿΍͢͜ͱͰݦஶʹੑೳ޲্
    • Ұํɺ JOINTͰ͸ੑೳ޲্Λ֬ೝͰ͖ͣ
    • ٙࣅσʔλ͔Βͷڭࢣ৴߸͕JOINTͰࢧ഑తʹͳͬͯ͠·͏໰୊
    PRETRAIN JOINT
    36
    38
    40
    42
    44
    46
    Backtrans (noisy) DirectNoise
    F0.5 Score
    Pseudo Data = 1.4M Pseudo Data = 14M
    36
    38
    40
    42
    44
    46
    Backtrans (noisy) DirectNoise
    F0.5 Score
    Pseudo Data = 1.4M Pseudo Data = 14M

    View full-size slide

  20. ࣮ݧ3: ٙࣅσʔλͷྔΛ૿΍͢
    • BACKTRANS (NOISY) ͕ DIRECTNOISE Λ্ճΔੑೳ
    March 20, 2020 RIKEN AIP / Tohoku University 20
    100 101 102
    Amount of Pseudo Data |Dp
    | (M)
    40
    42
    44
    46
    F0.5
    score
    Baseline
    Backtrans (noisy)
    DirectNoise

    View full-size slide

  21. ࣮ݧ3: ٙࣅσʔλͷྔΛ૿΍͢
    • BACKTRANS (NOISY) ͕ DIRECTNOISE Λ্ճΔੑೳ
    March 20, 2020 RIKEN AIP / Tohoku University 21
    100 101 102
    Amount of Pseudo Data |Dp
    | (M)
    40
    42
    44
    46
    F0.5
    score
    Baseline
    Backtrans (noisy)
    DirectNoise
    ౴2: PRETRAIN+BACKTRANS (NOISY)
    ઃఆ͕༗ޮ

    View full-size slide

  22. ࣮ݧ݁Ռͷ·ͱΊ
    March 20, 2020 RIKEN AIP / Tohoku University 22
    LARGEPRETRAIN
    ౴2: PRETRAIN+BACKTRANS (NOISY)
    ઃఆ͕༗ޮ
    ౴1: GigawordΛ࢖͏΂͠

    View full-size slide

  23. طଘͷݚڀͱͷੑೳൺֱ
    March 20, 2020 RIKEN AIP / Tohoku University 23
    48 50 52 54 56 58 60 62 64 66
    LargePretrain+Ensemble+SSE+R2L
    LargePretrain (Single Model)
    Grundkiewicz et al. (2019)
    Zhao et al. (2019)
    Lichtarge et al. (2019)
    Junczys-Dowmunt et al. (2018)
    Chollampatt and Ng (2018)
    F0.5
    (CoNLL2014)

    View full-size slide

  24. 48 50 52 54 56 58 60 62 64 66
    LargePretrain+Ensemble+SSE+R2L
    LargePretrain (Single Model)
    Grundkiewicz et al. (2019)
    Zhao et al. (2019)
    Lichtarge et al. (2019)
    Junczys-Dowmunt et al. (2018)
    Chollampatt and Ng (2018)
    F0.5
    (CoNLL2014)
    γϯάϧϞσϧͷ࣌఺Ͱߴੑೳ
    March 20, 2020 RIKEN AIP / Tohoku University 24
    શͯΞϯαϯϒϧϞσϧ
    γϯάϧϞσϧͷੑೳ͕ɺ[Grundkiewicz+2019]Λআ͘
    શͯͷΞϯαϯϒϧϞσϧΑΓ΋ߴੑೳ

    View full-size slide

  25. 48 50 52 54 56 58 60 62 64 66
    LargePretrain+Ensemble+SSE+R2L
    LargePretrain (Single Model)
    Grundkiewicz et al. (2019)
    Zhao et al. (2019)
    Lichtarge et al. (2019)
    Junczys-Dowmunt et al. (2018)
    Chollampatt and Ng (2018)
    F0.5
    (CoNLL2014)
    ௥Ճख๏ʹΑΓߋʹੑೳ޲্
    March 20, 2020 RIKEN AIP / Tohoku University 25
    զʑͷϞσϧ͕ੈք࠷ߴੑೳΛୡ੒ '


    View full-size slide

  26. ·ͱΊ
    • GECͷٙࣅσʔλʹ·ͭΘΔҎԼͷཁૉΛݕূ
    • GECϞσϧʹదͨ͠ઃఆΛൃݟ (LARGEPRETRAIN)
    • طଘͷϕϯνϚʔΫσʔλͰੈք࠷ߴੑೳΛߋ৽
    • ࣮૷ͱ܇࿅ࡁΈϞσϧΛެ։த
    • https://github.com/butsugiri/gec-pseudodata
    March 20, 2020 RIKEN AIP / Tohoku University 26
    Q1: ٙࣅσʔλੜ੒ͷख๏ΛͲ͏͢Δʁ
    Q2: ੜ੒ݩίʔύεͷछྨΛͲ͏͢Δʁ
    Q3: ٙࣅσʔλΛ༻͍ͨ࠷దԽख๏ΛͲ͏͢Δʁ

    View full-size slide