Ive, Madhyastha, Specia_2019_EMNLP_Deep Copycat Networks for Text-to-Text Generation

Deep Copycat Networks for Text-to-Text Generation Julia Ive, Pranava Madhyastha,
Lucia Specia EMNLP 2019 紹介者：平澤寅庄（TMU, 小町研, M1） 3 February, 2020 @小町研

0 Overview • Proposed a flexible pointer network framework for
text-to-text generation task ◦ copycat network based on Transformer ◦ Dual-source copycat network ◦ Dual-source double-attention copycat network • Two tasks ◦ Summarisation ◦ Automatic Post-Edit (APE) • Competitive results for summarisation • Better novel/repetitive n-gram rates for APE

1 Introduction • Seq-to-seq models are “over creative” • copycat
framework ◦ Based on transformer network ◦ Do not use coverage penalty • Better performance wrt ◦ novel n-gram generation rate (higher is better) ◦ repetition rate (lower is better) • Dual-source setting: Automatic post-editing task ◦ Conditioned on two inputs and copy from either of them ▪ Source language ▪ Original machine translation

2 Related work • Abstractive Summarisation ◦ Pointer network [See
et al., 2017] ◦ Reinforcement learning ▪ ROUGE and maximum likelihood [Paulus et al., 2017] ▪ Global quality estimator [Li et al., 2018] ◦ Content selection [Gehrmann et al., 2018] • Automatic Post-Editing (APE) ◦ Machine translation [Simard and Foster, 2013] ◦ Multi-source transformer architectures [Junczys-Dowmunt and Grundkiewicz, 2018] ◦ Requires large-scale dataset: unrealistic scenarios ◦ SOTA APE models modify ~30% of inputs [Chatterjee et al., 2018] ▪ Only 50% of modification are positive changes

Pointer-Generator Network [See et al., 2017]

3 copycat Networks 1. Transformer Architecture 2. Pointer-generator Network 3.
Dual-source extension

3.1 Transformer Architecture • Based on Vaswani et al., 2017
◦ 6 layers for enc./dec. ◦ 512 model dimension • Encoder: ◦ Sub-layers: multi-head self-attention, position-wise feed-forward neural network ◦ Output: • Decoder: ◦ Sub-layers: multi-head self-attention, multi-head cross-attention, position-wise FFNN ◦ Output: • Other stuff ◦ Vocabulary: Byte Pair Encoding (BPE) applied ◦ Training: cross entropy loss

3.2 Pointer-generator Network • Based on See et al., 2017
• Attention over encoder ◦ Dot product attention ◦ States of last decoder layer, just before Norm. layer • Generation probability ◦ sigmoid function ◦ States of last decoder layer, just before Norm. layer ◦ Interpolation of decoder’s probability and copy probability

3.3 Dual-source extension 1/2 • Standard dual-source copycat framework ◦
Encoder for machine translated output: ◦ stacked decoder sub-layer:

3.3 Dual-source extension 2/2 • Dual-source with double attention ◦
Encoder for machine translated output: ◦ stacked decoder sub-layer: a ◦ Attention over encoder a ◦ Pointer-generation with multiple sources a ◦ Switching function

4 Data and Settings: Data • Abstractive summarisation: CNN/Daily Mail
dataset [Hermann et al., 2015; Nallapati et al., 2016] ◦ Online news articles (800 tokens on average) and their abstracts (60 tokens on average) ◦ 300k for training, 13k for validation, 11k for test ◦ BPE with 32k, truncate articles to 400 subwords • APE: two variants ◦ English-German (En-De): WMT18/WMT19 APE shared task (IT domain) ▪ translated by a NMT system (45.8 BLEU) ▪ 13k for training, 1k for validation, 1k for test ◦ English-Latvian (En-Lv): life sciences domain [Specia et al., 2018] ▪ Translated by a NMT system (38.4 BLEU) ▪ 13k for training, 1k for validation, 1k for test

4 Data and Settings: Metrics and hyperparameters • Evaluation metrics
for APE ◦ HTER [Snover et al., 2009] ▪ The minimum number of edits (substitution, insertion, deletion and shift) ▪ 0.15 HTER for En-De, 0.29 HTER for En-Lv on average ◦ BLEU [Papineni et al., 2002] ▪ n-gram precision between MT hypotheses and post-edits • Hyperparameters ◦ Shared embedding with enc./dec. ◦ beam search of size 5 (Summarisation), 10 (APE) ◦ Adam optimizer ▪ initial learning rate of 0.044, 4k warmup steps (Summarisation) ▪ initial learning rate of 0.050, 8k warmup steps (APE) ◦ Batch size of 50 ◦ Early stopping: patience of 10 epochs based on ROUGE-L (Summarisation) and BLEU (APE)

5 Results: Summarisation Pointer network Reinforcement Learning Content Selection

5 Results: repetitive/novel n-gram • Less effective for higher order
n-grams than See et al., 2017 (~0 for 4-gram) • Generated more unique quadrigrams than See et al. 2017 (~10%)

5 Results: Example of Summarisation

5 Results: APE • BASE: the raw MT output •
PNT-TRG: copycat with attention over the MT • PNT-TRG+SRC: copycat with double attention • Pre-training model is mandatory ◦ Pre-train models on synthetic data by randomly deletion, insertion, shift, substitution target side sentence in parallel corpus wrt HTER score in APE dataset ◦ eSCAPE corpus (En-De) ◦ Europarl and EMEA corpora (En-Lv) ◦ Fine-tuning with learning rate of 0.001 • The performance of copycat systems depends on the difficulty of the task ◦ Minor improvement for German • Major post-edits ◦ punctuation (e.g., deletion of a comma) ◦ auxiliary words (e.g., insertion/deletion of prepositions or articles) eSCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing EMEA: European Medicines Agency

5 Results: Examples of APE

6. Conclusion

Ive, Madhyastha, Specia_2019_EMNLP_Deep Copycat...

Ive, Madhyastha, Specia_2019_EMNLP_Deep Copycat Networks for Text-to-Text Generation

tosho

More Decks by tosho

Other Decks in Science

Featured

Transcript

Deep Copycat Networks for Text-to-Text Generation Julia Ive, Pranava Madhyastha,

0 Overview • Proposed a flexible pointer network framework for

1 Introduction • Seq-to-seq models are “over creative” • copycat

2 Related work • Abstractive Summarisation ◦ Pointer network [See

Pointer-Generator Network [See et al., 2017]

3 copycat Networks 1. Transformer Architecture 2. Pointer-generator Network 3.

3.1 Transformer Architecture • Based on Vaswani et al., 2017

3.2 Pointer-generator Network • Based on See et al., 2017

3.3 Dual-source extension 1/2 • Standard dual-source copycat framework ◦

3.3 Dual-source extension 2/2 • Dual-source with double attention ◦

4 Data and Settings: Data • Abstractive summarisation: CNN/Daily Mail

4 Data and Settings: Metrics and hyperparameters • Evaluation metrics

5 Results: Summarisation Pointer network Reinforcement Learning Content Selection

5 Results: repetitive/novel n-gram • Less effective for higher order

5 Results: Example of Summarisation

5 Results: APE • BASE: the raw MT output •

5 Results: Examples of APE

6. Conclusion