Zhou et al. - arXiv - Improving Grammatical Error Correction with Machine Translation Pairs

Improving Grammatical Error Correction with Machine Translation Pairs Wangchunshu Zhou,
Tao Ge, Chang Mu, Ke Xu, Furu Wei, Ming Zhou arXiv:1911.02825v1 2019/11/27 ࿦จಡΈձ ঺հऀ : B4 ߴڮ ༔ਐ

In short • Generate source and target sentences using machine
translation model for improving GEC task Pair of source and target are respectively SMT and NMT output • Performance can be improved by manually decreasing language model weight in SMT • Indicated the effectiveness compared with synthetic data generated by random corruption

Introduction • Synthetic error-corrected data is helpful for improving GEC
models • An issue of existing data synthesis approaches Pre-defined rule sets : limited error types Back-translation : limited the seed error-corrected training data • Proposed method Employs two MT models of different qualities (SMT and NMT) MT models to translate the same sentence in a bridge language into English Pair them as a pseudo error-corrected sentence pair

Related Work - 1 Rule-based Monolingual Corpora Corruption [Zhao et
al., 2019, NAACL] • Corruption monolingual corpora with pre-defined rules • Pros : Very simple and efficient to generate parallel data • Cons : Limited and only cover a small portion of grammatical error types Back-translation based Error Generation [Ge et al., 2018, ACL] • Training an error generation model by using the error-corrected corpora in opposite direction • Pros : Able to cover more diverse error types • Cons : Requires a large amount of annotated error corrected data

Related Work - 2 Data Generation from Round-trip Translations [Lichtarge
et al., 2019, arXiv:1904.05780] • This approach uses two MT models • One from English to a bridge language and the other from the bridge to English • Pros : Easy to generate error-corrected pairs ? • Cons : Good MT model : quit clean and the coverage over error types is limited Poor MT model : more paraphrase-like or information loss Data Generation from Wikipedia Revision Histories • Extract revision histories from Wikipedia • Pros : Resemble real error-corrected data • Cons : Majority of extracted revisions are not grammatical error corrections Domain of revision history is limited and different from the target GEC domain

Method - 1 : Beginner Translator (SMT) Beginner Translator (SMT)
• Meaning-preserving with respect to the input sentences • Low fluency and contain many grammatical errors • Translation output resembles that written by non-native speakers • Previous study [Qiu and Park, 2019] more effectively when source sentences are of lower fluency • Manually reduce the weight of language model in the tuned SMT Google translate : Anyway, I am very satisfied with everyone’s performance.

Method - 2 : Advanced Translator (NMT) Advanced Translator (NMT)
• “valid translation” (meaning-preserving, fluent and grammatically correct) • Available parallel corpora for MT (generally large and cheaper) • Easily convert parallel corpora into GEC training data

Evaluation Evaluation data • BEA 19 shared task on GEC
• CoNLL-2014 test set Settings • Primary gold : explore and analyze the effect of pre-training with synthetic parallel data in the proposed approach • Without extensive tricks iterative decoding model ensembling edit-weighted MLE objective right-to-left reranking external spell checker, etc.

Models Beginner translator model (SMT) • Moses [Koehn et al.,
2007] • word-aligning : MGIZA++ • language model : KenLM • tune the weights : MERT to optimize the system’s BLEU and creating two replicas of tuned model (total : 3 models, ! ) tuned model by manually increasing or decreasing the weight Advanced Translator model (NMT) • Transformer-based NMT model (transformer big) • Chinese sentence : segmented into word-level • English word tokens : split into subword (BPE) SMThigh,tuned,low

Dataset Chinese-English parallel data (for training translation models) • UN
Corpus [Ziemski et al., 2016] • 15M parallel sentence pairs with around 400M tokens Monolingual Chinese corpora (for synthesize GEC data) • news2016zh [Xu, 2019] • news corpus containing 2.5M Chinese news articles In experiments, • 10M pseudo-parallel data (SMT-NMT) • 10M sentence pairs (SMT-gold) • To compare : NewsCrawl dataset + random corruption (40M sents) [Zhao et al., 2019] • Filter the generated corpora based on the fluency [Ge et al., 2018] • Discard : fluency of target sentence is lower than that of the source sentence

Performance of translation model BLUE score of SMT and NMT
• newstest17 Chinese-English translation test set • NMT are much better than all SMT • manually decreasing the language model weight in the SMT results in a worse BLUE score • ! : indicate more grammatical errors in translated sentences ? SMTlow

Results on unsupervised GEC training Unsupervised GEC training • Ours
: proposed method • Corruption : random corruption 20M/40M sentence pairs • proposed method outperform both the rule-based corruption • It may contain more realistic errors compared with pre-defined rules Influence of the LM weight • decreasing the LM weight : better result Katsumata and Komachi, 2019, BEA

Fine-tuning Results Dataset • Lang-8 + NUCLE Result • Combining
both synthetic data sources yields consistent improvements • decreasing LM weight in SMT may help training GEC models

Qualitative Analysis Advanced translator (NMT) • Translations are generally of
good quality and are very similar to the ground- truth translation → Target sentences are generally grammatically correct Comparing erroneous sentences • Random corruption limited artificial errors (repetition and deletion of tokens) • Proposed method is able to introduce much more realistic errors which resemble that generated by ESL learners Comparing LM weights • ! : Tend to be more fluent and less grammatical errors • ! : Meaning-preserving but contain massive grammatical errors → decreasing LM weight yields better performance SMThigh SMTlow

Examples of translation

Ablation Study Analyze each component in this method • Both
pairs contribute to GEC model • SMT-gold sentence pairs are slightly more effective than SMT-NMT • MT parallel corpora are limited in both size and domain → SMT-NMT : general and flexible

Summary • MT pairs as the source and target sentences
are effective to improve performance in GEC task • Performance can be improved by manually decreasing the LM weight in the SMT • This approach may contain more realistic errors compared with pre-defined rules

Zhou et al. - arXiv - Improving Grammatical Err...

Zhou et al. - arXiv - Improving Grammatical Error Correction with Machine Translation Pairs

wkwkgg

More Decks by wkwkgg

Other Decks in Science

Featured

Transcript

Improving Grammatical Error Correction with Machine Translation Pairs Wangchunshu Zhou,

In short • Generate source and target sentences using machine

Introduction • Synthetic error-corrected data is helpful for improving GEC

Related Work - 1 Rule-based Monolingual Corpora Corruption [Zhao et

Related Work - 2 Data Generation from Round-trip Translations [Lichtarge

Method - 1 : Beginner Translator (SMT) Beginner Translator (SMT)

Method - 2 : Advanced Translator (NMT) Advanced Translator (NMT)

Evaluation Evaluation data • BEA 19 shared task on GEC

Models Beginner translator model (SMT) • Moses [Koehn et al.,

Dataset Chinese-English parallel data (for training translation models) • UN

Performance of translation model BLUE score of SMT and NMT

Results on unsupervised GEC training Unsupervised GEC training • Ours

Fine-tuning Results Dataset • Lang-8 + NUCLE Result • Combining

Qualitative Analysis Advanced translator (NMT) • Translations are generally of

Examples of translation

Ablation Study Analyze each component in this method • Both

Summary • MT pairs as the source and target sentences