A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction

Slide 1

Slide 1 text

A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction Shamil Chollampatt and Hwee Tou Ng Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018 2018-11-13 0

Slide 2

Slide 2 text

• Convolutional Encoder-DecoderGEC • • RNN • Pre-trained word embeddings A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction 1

Slide 3

Slide 3 text

A Multilayer Convolutional Encoder-Decoder NN 2

Slide 4

Slide 4 text

• fastTextword embeddings üEnglish ü Pre-trained word embeddings 3

Slide 5

Slide 5 text

# • Edit Operation (EO) • !$"$" ! • Language model (LM) • 5-gram LM • $" Rescore 4

Slide 6

Slide 6 text

• Training • Lang-8 + NUCLE (1.3M sentence pairs) • Development • NUCLE (5.4K sentence pairs) • Pre-training word embeddings • Wikipedia (1.78B words) • Training language model • Common Crawl corpus (94B words) Dataset 5

Slide 7

Slide 7 text

Result 6

Slide 8

Slide 8 text

Result Pre-trained embeddings → 7

Slide 9

Slide 9 text

Result Ensemble → 8

Slide 10

Slide 10 text

Result +Rescore → 9

Slide 11

Slide 11 text

Result +SpellCheck → 10

Slide 12

Slide 12 text

Result → SoTA 11

Slide 13

Slide 13 text

RNN vs CNN 12

Slide 14

Slide 14 text

RNN vs CNN 13

Slide 15

Slide 15 text

RNN vs CNN 14 RNN*-'+)%$ -'"& → Precision CNN,'# → (! → Recall

Slide 16

Slide 16 text

Embedding Initialization 15

Slide 17

Slide 17 text

• Convolutional Encoder-DecoderGEC • CNNRNN • SoTA • Pre-trained word embeddings • Language modelEdit Operation • Conclusion 16

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Model and Training Details • Source and target embeddings: 500 dimensions • Source and target vocabularies: 30K (BPE) • Pre-trained word embeddings • Using fastText • On the Wikipedia corpus • Using a skip-gram model with a window size of 5 • Character N-gram sequences of size between 3 and 6 • Encoder-decoder • 7 convolutional layers • With a convolution window width of 3 • Output of each encoder and decoder layer: 1024 dimensions • Dropout: 0.2 • Batch size: 32 • Learning rate: 0.25 with learning rate annealing factor of 0.1 • Momentum value: 0.99 • Beam width: 12 • Training a single model tales around 18 hours 18