Slide 1

Slide 1 text

A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction Shamil Chollampatt and Hwee Tou Ng Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018  2018-11-13       0

Slide 2

Slide 2 text

• Convolutional Encoder-DecoderGEC •    • RNN • Pre-trained word embeddings  A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction 1

Slide 3

Slide 3 text

A Multilayer Convolutional Encoder-Decoder NN 2

Slide 4

Slide 4 text

• fastTextword embeddings  üEnglish   ü  Pre-trained word embeddings 3

Slide 5

Slide 5 text

  # • Edit Operation (EO) • !$"$" ! • Language model (LM) • 5-gram LM  • $"  Rescore 4

Slide 6

Slide 6 text

• Training • Lang-8 + NUCLE (1.3M sentence pairs) • Development • NUCLE (5.4K sentence pairs) • Pre-training word embeddings • Wikipedia (1.78B words) • Training language model • Common Crawl corpus (94B words) Dataset 5

Slide 7

Slide 7 text

Result 6

Slide 8

Slide 8 text

Result Pre-trained embeddings → 7

Slide 9

Slide 9 text

Result Ensemble → 8

Slide 10

Slide 10 text

Result +Rescore → 9

Slide 11

Slide 11 text

Result +SpellCheck → 10

Slide 12

Slide 12 text

Result → SoTA 11

Slide 13

Slide 13 text

RNN vs CNN 12

Slide 14

Slide 14 text

RNN vs CNN 13

Slide 15

Slide 15 text

RNN vs CNN 14 RNN*-'+)%$ -'"&   → Precision CNN,'#   → (!   → Recall

Slide 16

Slide 16 text

Embedding Initialization 15

Slide 17

Slide 17 text

• Convolutional Encoder-DecoderGEC • CNNRNN   •  SoTA • Pre-trained word embeddings • Language modelEdit Operation  •  Conclusion 16

Slide 18

Slide 18 text

17

Slide 19

Slide 19 text

Model and Training Details • Source and target embeddings: 500 dimensions • Source and target vocabularies: 30K (BPE) • Pre-trained word embeddings • Using fastText • On the Wikipedia corpus • Using a skip-gram model with a window size of 5 • Character N-gram sequences of size between 3 and 6 • Encoder-decoder • 7 convolutional layers • With a convolution window width of 3 • Output of each encoder and decoder layer: 1024 dimensions • Dropout: 0.2 • Batch size: 32 • Learning rate: 0.25 with learning rate annealing factor of 0.1 • Momentum value: 0.99 • Beam width: 12 • Training a single model tales around 18 hours 18

Slide 20

Slide 20 text

19 Other Result

Slide 21

Slide 21 text

20 Analysis

Slide 22

Slide 22 text

21 https://github.com/nusnlp/mlconvgec2018