Penn TreebankPTB • &:#5* – Penn Treebank Wall Street Journal %" – 096E 7C [Mikolov+ 11] • 9H10,000 – +1=)1=( – /= N 2? • 10 million → N million – 8G'9
3F2? • DB!<A9/887,521 – 4>1 billion word corpus 1/1000 – ;$
.@ 10
)6( • 10*'=69 > – 1 billion word corpus, English Wikipedia, … • 1@%
(07 > 44 I have a P(have | I) P(a | I have) P(dream | I have a) )6LSTMELMoTransformerGPT I [MASK] a have MASK "BERT 3648+A ?-!&B 9#.,; 5$1< …… 1
word2vec >$LS0< 45 RNN6C !R7-2 /J?C(@ 0B59 [Mikolov+ 13] "FAE,N8D
&H* RNNKIskip-gramCBOW [Mikolov+ 13] =+4+ 1 billion word corpus,N
LSTM6C !O;-2T# 4+.&0:G1 [Peter+ 17] ELMo [Peter+ 18] 1 billion word corpus Q'3LSTM6C !,N =+P0
4+ BERT [Devlin+ 19] &H%)M+LSTM → Transformer =+4+ man woman king queen I have a dream that I have dream that
1/5 • Mikolov et al., Empirical Evaluation and Combination of Advanced Language Modeling Techniques. INTERSPEECH 2011. • Mikolov et al., Context Dependent Recurrent Neural Network Language Model. SLT 2012. • Zaremba et al., Recurrent Neural Network Regularization. 2014. • Gal et al., A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. NIPS 2016. • Zilly et al., Recurrent Highway Networks. ICML 2017. • Inan et al., Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling. ICLR 2017. • Takase et al., Input-to-output gate to improve rnn language models. IJCNLP 2017. 48
2/5 • Zoph et al., Neural Architecture Search with Reinforcement Learning. ICLR 2017. • Lei et al., Simple Recurrent Units for Highly Parallelizable Recurrence. EMNLP 2018. • Melis et al., On the state of the art of evaluation in neural language models. ICLR 2018. • Merity et al., Regularizing and Optimizing LSTM Language Models. ICLR 2018. • Yang et al., Breaking the softmax bottleneck: A high-rank RNN language model. ICLR 2018. • Takase et al., Direct Output Connection for a High- Rank Language Model. EMNLP 2018. 49
3/5 • Wan et al., Regularization of Neural Networks using DropConnect. ICML 2013. • Press et al., Using the Output Embedding to Improve Language Models. EACL 2017. • Polyak et al., Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization 1992. • Vaswani et al., Attention Is All You Need. NIPS 2017. • Popel et al., Training Tips for the Transformer Model. PBML 2018. 50
4/5 • Zołna et al., Fraternal dropout. ICLR 2018. • Gong et al., FRAGE: Frequency-Agnostic Word Representation. NIPS 2018. • Liu et al., Deep Residual Output Layers for Neural Language Generation. ICML 2019. • Kanai et al., Sigsoftmax: Reanalysis of the Softmax Bottleneck. NIPS 2018. • Krause et al., Dynamic Evaluation of Neural Sequence Models. 2017. • Kuhn et al., A cache-based natural language model for speech recognition. PAMI 1990. • Grave et al., Improving Neural Language Models with a Continuous Cache. ICLR 2017. 51
5/5 • Radford et al., Language Models are Unsupervised Multitask Learners. 2019. • Mikolov et al., Linguistic Regularities in Continuous Space Word Representations. NAACL 2013. • Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality. NIPS 2013. • Peter et al., Semi-supervised sequence tagging with bidirectional language models. ACL 2017. • Peter et al., Deep Contextualized Word Representations. NAACL 2018. • Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. 52