P(a have I dream) – &(49716 49>05 • #!$C+perplexity – 8A=':',D+-B – 32 "!'< → ;#!$ – /,),? < * 6 NTTW 2i Encoder-Decoder 2 2RNN Encoder-Decoder P(I have a dream) > P(a have I dream) > P(fuga spam hoge : • 2RNN e • 2 P P(I have a dream) = P(I)P(have | I)P(a | I have)P(dream | I have a) I have a dream
corpus, English Wikipedia, … • 1 @% (07 > 44 I have a P(have | I) P(a | I have) P(dream | I have a) )6LSTMELMoTransformerGPT I [MASK] a have MASK " BERT 3648+A ?-!&B 9#.,; 5$1< …… 1
"FAE,N8D &H* RNNKIskip-gramCBOW [Mikolov+ 13] =+4+ 1 billion word corpus,N LSTM6C !O;-2T# 4+.&0:G1 [Peter+ 17] ELMo [Peter+ 18] 1 billion word corpus Q'3LSTM6C !,N =+P0 4+ BERT [Devlin+ 19] &H%)M+LSTM → Transformer =+4+ man woman king queen I have a dream that I have dream that
Advanced Language Modeling Techniques. INTERSPEECH 2011. • Mikolov et al., Context Dependent Recurrent Neural Network Language Model. SLT 2012. • Zaremba et al., Recurrent Neural Network Regularization. 2014. • Gal et al., A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. NIPS 2016. • Zilly et al., Recurrent Highway Networks. ICML 2017. • Inan et al., Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling. ICLR 2017. • Takase et al., Input-to-output gate to improve rnn language models. IJCNLP 2017. 48
Learning. ICLR 2017. • Lei et al., Simple Recurrent Units for Highly Parallelizable Recurrence. EMNLP 2018. • Melis et al., On the state of the art of evaluation in neural language models. ICLR 2018. • Merity et al., Regularizing and Optimizing LSTM Language Models. ICLR 2018. • Yang et al., Breaking the softmax bottleneck: A high-rank RNN language model. ICLR 2018. • Takase et al., Direct Output Connection for a High- Rank Language Model. EMNLP 2018. 49
DropConnect. ICML 2013. • Press et al., Using the Output Embedding to Improve Language Models. EACL 2017. • Polyak et al., Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization 1992. • Vaswani et al., Attention Is All You Need. NIPS 2017. • Popel et al., Training Tips for the Transformer Model. PBML 2018. 50
Gong et al., FRAGE: Frequency-Agnostic Word Representation. NIPS 2018. • Liu et al., Deep Residual Output Layers for Neural Language Generation. ICML 2019. • Kanai et al., Sigsoftmax: Reanalysis of the Softmax Bottleneck. NIPS 2018. • Krause et al., Dynamic Evaluation of Neural Sequence Models. 2017. • Kuhn et al., A cache-based natural language model for speech recognition. PAMI 1990. • Grave et al., Improving Neural Language Models with a Continuous Cache. ICLR 2017. 51
Learners. 2019. • Mikolov et al., Linguistic Regularities in Continuous Space Word Representations. NAACL 2013. • Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality. NIPS 2013. • Peter et al., Semi-supervised sequence tagging with bidirectional language models. ACL 2017. • Peter et al., Deep Contextualized Word Representations. NAACL 2018. • Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. 52