Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zhou et al. - arXiv - Improving Grammatical Error Correction with Machine Translation Pairs

wkwkgg
November 27, 2019

Zhou et al. - arXiv - Improving Grammatical Error Correction with Machine Translation Pairs

wkwkgg

November 27, 2019
Tweet

More Decks by wkwkgg

Other Decks in Science

Transcript

  1. Improving Grammatical Error Correction with
    Machine Translation Pairs
    Wangchunshu Zhou, Tao Ge, Chang Mu, Ke Xu,
    Furu Wei, Ming Zhou
    arXiv:1911.02825v1
    2019/11/27 ࿦จಡΈձ
    ঺հऀ : B4 ߴڮ ༔ਐ

    View Slide

  2. In short
    • Generate source and target sentences using machine translation model for
    improving GEC task
    Pair of source and target are respectively SMT and NMT output
    • Performance can be improved by manually decreasing language model weight in
    SMT
    • Indicated the effectiveness compared with synthetic data generated by random
    corruption

    View Slide

  3. Introduction
    • Synthetic error-corrected data is helpful for improving GEC models
    • An issue of existing data synthesis approaches
    Pre-defined rule sets : limited error types
    Back-translation : limited the seed error-corrected training data
    • Proposed method
    Employs two MT models of different qualities (SMT and NMT)
    MT models to translate the same sentence in a bridge language into English
    Pair them as a pseudo error-corrected sentence pair

    View Slide

  4. Related Work - 1
    Rule-based Monolingual Corpora Corruption [Zhao et al., 2019, NAACL]
    • Corruption monolingual corpora with pre-defined rules
    • Pros : Very simple and efficient to generate parallel data
    • Cons : Limited and only cover a small portion of grammatical error types
    Back-translation based Error Generation [Ge et al., 2018, ACL]
    • Training an error generation model by using the error-corrected corpora in
    opposite direction
    • Pros : Able to cover more diverse error types
    • Cons : Requires a large amount of annotated error corrected data

    View Slide

  5. Related Work - 2
    Data Generation from Round-trip Translations [Lichtarge et al., 2019, arXiv:1904.05780]
    • This approach uses two MT models
    • One from English to a bridge language and the other from the bridge to English
    • Pros : Easy to generate error-corrected pairs ?
    • Cons :
    Good MT model : quit clean and the coverage over error types is limited
    Poor MT model : more paraphrase-like or information loss
    Data Generation from Wikipedia Revision Histories
    • Extract revision histories from Wikipedia
    • Pros : Resemble real error-corrected data
    • Cons :
    Majority of extracted revisions are not grammatical error corrections
    Domain of revision history is limited and different from the target GEC
    domain

    View Slide

  6. Method - 1 : Beginner Translator (SMT)
    Beginner Translator (SMT)
    • Meaning-preserving with respect to the input sentences
    • Low fluency and contain many grammatical errors
    • Translation output resembles that written by non-native speakers
    • Previous study [Qiu and Park, 2019]
    more effectively when source sentences are of lower fluency
    • Manually reduce the weight of language model in the tuned SMT
    Google translate : Anyway, I am very satisfied with everyone’s performance.

    View Slide

  7. Method - 2 : Advanced Translator (NMT)
    Advanced Translator (NMT)
    • “valid translation” (meaning-preserving, fluent and grammatically correct)
    • Available parallel corpora for MT (generally large and cheaper)
    • Easily convert parallel corpora into GEC training data

    View Slide

  8. Evaluation
    Evaluation data
    • BEA 19 shared task on GEC
    • CoNLL-2014 test set
    Settings
    • Primary gold : explore and analyze the effect of pre-training with synthetic
    parallel data in the proposed approach
    • Without extensive tricks
    iterative decoding
    model ensembling
    edit-weighted MLE objective
    right-to-left reranking
    external spell checker, etc.

    View Slide

  9. Models
    Beginner translator model (SMT)
    • Moses [Koehn et al., 2007]
    • word-aligning : MGIZA++
    • language model : KenLM
    • tune the weights : MERT to optimize the system’s BLEU
    and creating two replicas of tuned model (total : 3 models, ! )
    tuned model by manually increasing or decreasing the weight
    Advanced Translator model (NMT)
    • Transformer-based NMT model (transformer big)
    • Chinese sentence : segmented into word-level
    • English word tokens : split into subword (BPE)
    SMThigh,tuned,low

    View Slide

  10. Dataset
    Chinese-English parallel data (for training translation models)
    • UN Corpus [Ziemski et al., 2016]
    • 15M parallel sentence pairs with around 400M tokens
    Monolingual Chinese corpora (for synthesize GEC data)
    • news2016zh [Xu, 2019]
    • news corpus containing 2.5M Chinese news articles
    In experiments,
    • 10M pseudo-parallel data (SMT-NMT)
    • 10M sentence pairs (SMT-gold)
    • To compare : NewsCrawl dataset + random corruption (40M sents) [Zhao et al., 2019]
    • Filter the generated corpora based on the fluency [Ge et al., 2018]
    • Discard : fluency of target sentence is lower than that of the source sentence

    View Slide

  11. Performance of translation model
    BLUE score of SMT and NMT
    • newstest17 Chinese-English
    translation test set
    • NMT are much better than all SMT
    • manually decreasing the language
    model weight in the SMT results in a
    worse BLUE score
    • ! : indicate more grammatical
    errors in translated sentences ?
    SMTlow

    View Slide

  12. Results on unsupervised GEC training
    Unsupervised GEC training
    • Ours : proposed method
    • Corruption : random corruption
    20M/40M sentence pairs
    • proposed method outperform both the
    rule-based corruption
    • It may contain more realistic errors
    compared with pre-defined rules
    Influence of the LM weight
    • decreasing the LM weight : better result
    Katsumata and Komachi, 2019, BEA

    View Slide

  13. Fine-tuning Results
    Dataset
    • Lang-8 + NUCLE
    Result
    • Combining both synthetic data sources
    yields consistent improvements
    • decreasing LM weight in SMT may help
    training GEC models

    View Slide

  14. Qualitative Analysis
    Advanced translator (NMT)
    • Translations are generally of good quality and are very similar to the ground-
    truth translation
    → Target sentences are generally grammatically correct
    Comparing erroneous sentences
    • Random corruption limited artificial errors (repetition and deletion of tokens)
    • Proposed method is able to introduce much more realistic errors which
    resemble that generated by ESL learners
    Comparing LM weights
    • ! : Tend to be more fluent and less grammatical errors
    • ! : Meaning-preserving but contain massive grammatical errors
    → decreasing LM weight yields better performance
    SMThigh
    SMTlow

    View Slide

  15. Examples of translation

    View Slide

  16. Ablation Study
    Analyze each component in this method
    • Both pairs contribute to GEC model
    • SMT-gold sentence pairs are slightly
    more effective than SMT-NMT
    • MT parallel corpora are limited in both
    size and domain
    → SMT-NMT : general and flexible

    View Slide

  17. Summary
    • MT pairs as the source and target sentences are effective to improve
    performance in GEC task
    • Performance can be improved by manually decreasing the LM weight in the SMT
    • This approach may contain more realistic errors compared with pre-defined rules

    View Slide