ニューラル言語モデルの研究動向（NL研招待講演資料）

!" 2019/6/13 # !"
$ % https://takase.github.io/ 1

2/ • 1. NL*& '"!3)+(2 • -6%
*AIP ,5#0$4(2 2

+,' • 2008 - 2017: !$ - )$ • 2017
- 2018: NTT CS&RA • 2018 - : #&( 3 "% [IJCNLP 17, EMNLP 18, AAAI 19] [EMNLP 16, NAACL 19] -* [ACL 16]

2%NL- 42/3 • $ (, -.#' – 2018 10
• $ -. *" – *" !) + • '& 4

2%NL- 42/3 • $ (, -.#' – 2018 10
• $ -. *" – *" !) + • '& 5

.=#!$ • %@.=4 05 – P(I have a dream) >
P(a have I dream) – &(49716 49>05 • #!$C+perplexity – 8A=':',D+-B – 32 "!'< → ;#!$ – /,),? < * 6 NTTW 2i Encoder-Decoder 2 2RNN Encoder-Decoder P(I have a dream) > P(a have I dream) > P(fuga spam hoge : •  2RNN e •  2 P P(I have a dream) = P(I)P(have | I)P(a | I have)P(dream | I have a) I have a dream

",*#' • Noisy channel model – P(T) ", – *352.+&-14
• – )%(", • /,0$ ! – Skip-gramELMoBERT 7

>)KM/;82N 2DO – Penn Treebank (PTB)WikiText-2 – "'# <3 –
0D4R,1! &HJ8@;* P? • 6-8@L9 7T 8 (DO – WikiText-1031 billion word corpus – S=5:4R<3ELMoBERT – (DO 4RF@,1$%Q3 • S=5: E3G8@7T .+$%DOI Q3 CAB

A-JL1=:2M 4DN# "$ – Penn Treebank (PTB)WikiText-2 – &+!'#
?5 – 2D6Q/3%"$*HI:C=. OB • 80:CK; 9S 9 ,DN# "$ – WikiText-1031 billion word corpus – R@7<6Q?5ELMoBERT – ,DN# "$6QFC/3(#)P5 • R@7<E5 G:C9S 9 >

,-$ Penn TreebankPTB • &: #5* – Penn
Treebank Wall Street Journal %" – 09 6E 7C [Mikolov+ 11] • 9H10,000 – +1=)1=( – /= N 2? • 10 million → N million – 8G'9 3F<unk> 2? • DB!< A9/887,521 – 4>1 billion word corpus 1/1000 – ;$ .@ 10

PTB Ob3*5?h 11 2012 141.2Kneser-Ney &1%8#5-gram 124.7RNNOb3*5 [Mikolov+ 12] 2014
78.4LSTMOb3*5 [Zaremba+ 14] 2016 75.0LSTMOb3*5 [Gal+ 16] (]o,6(0!+A< ) 2017 68.5Recurrent Highway Network [Zilly+ 17] 66.1kbrQ;E=p\Y [Inan+ 17] 64.4;EWC$+fM [Takase+ 17] 60.3Simple Recurrent Unit [Lei+ 17] 57.3LSTMOb3*5Iltd [Merity+ 18] 54.4Mixture of Softmaxes [Yang+ 18] 2018 52.4Mixture of Softmaxes oZ^P[[Takase+ 18] 47.2:TEnsemble [Takase+ 18] 62.4JGIlj aeqs [Zoph+ 17] 58.3LSTMOb3*5. //42'qs [Melis+ 18] IlBSc -(+7"aeVR 20189 -(+7"aeqU ↓ nuLSTMOb3*5 IlBSc@KXgL_X )&+*'`PNH)&+*'/42'mD >Fi

PTB !Tg8/:Dm 12 2012 141.2Kneser-Ney +6*=(5-gram 124.7RNNTg8/: [Mikolov+ 12] 2014
78.4LSTMTg8/: [Zaremba+ 14] 2016 75.0LSTMTg8/: [Gal+ 16] (bt1;-5$&0#FA!) 2017 68.5Recurrent Highway Network [Zilly+ 17] 66.1pgwV@JBu#a^ [Inan+ 17] 64.4@J\H)0#kR [Takase+ 17] 60.3Simple Recurrent Unit [Lei+ 17] 57.3LSTMTg8/:Nq#yi! [Merity+ 18] 54.4Mixture of Softmaxes [Yang+ 18] 2018 52.4Mixture of Softmaxest _c#U`[Takase+ 18] 47.2?YEnsemble [Takase+ 18] 62.4OLNqofj#vx [Zoph+ 17] 58.3LSTMTg8/:3%4497,vx [Melis+ 18] .+0/,eUSM.+0/,497,#rI!CKn 8/: NqGXh 2-0<'fj[W 2018> " 2-0<'fjvZ ↓ szLSTMTg8/: NqGXhEP]lQd]

ASGD Weight-Dropped LSTM (AWD-LSTM) [Merity+ 18] • <846 1+ –
Variational dropout [Gal+ 16] – DropConnect [Wan+ 13] – Weight tying [Inan+ 17, Press+ 17] – Averaged SGD [Polyak+ 92] • 0 • %;*$ – -9 PTB .5 13 &)!LSTM.5 [Zaremba+ 14] AWD-LSTM ':3 1 > /= 30 "2( , 407# 500 ~ 1000 Gradient clipping 5 0.25

ASGD Weight-Dropped LSTM (AWD-LSTM) [Merity+ 18] • B=8; 5!/ –
Variational dropout [Gal+ 16] – DropConnect [Wan+ 13] – Weight tying [Inan+ 17, Press+ 17] – Averaged SGD [Polyak+ 92] • 4 • )@.( – 1> PTB 2: 14 *-$ LSTM2: [Zaremba+ 14] AWD-LSTM +?7 1 D 3C 30#%" 6, 0 40<' 500 ~ 1000 Gradient clipping 5 0.25 A 9&

0!LSTM/8 [Zaremba+ 14] • LSTM 3-"<8EAB – N-gram/8 1&
N 7 – Perplexity141.2Kneser-Ney 5-gram→ 78.4 • C?C? 25+ * 15 LSTM <BOS> I have P(I) P(have | I) P(a | I have) LSTM LSTM LSTM LSTM LSTM f (x, ht-1 )LSTM Dropout#),> 25 p 61/(1 - p)@ 4Dropout 9(;: ↑ = %'D$.Dropout

Variational dropout [Gal+ 16] • A&"# .*;>. Dropout %F6 –
01J/9Dropout F6C= – LSTM(Dropout -3[Moon+ 15] + • @,$?@E5B2874 17 LSTM <BOS> I have P(I) P(have | I) P(a | I have) LSTM LSTM LSTM LSTM LSTM LSTM <BOS> I have P(I) P(have | I) P(a | I have) LSTM LSTM LSTM LSTM LSTM <)LSTM:D'!( Variational dropoutF6 LSTM:D'!( .GIH. d F6

R8."&QRW1/3 • aS? – *123+(),20']TRA • 17VZ^ 9,20'!OC –
,20' 8_ $4-3 – Dropout !K`,20'Dropout ; $4-3 – []TDropout RA • 96HP9Dropout /%#!\C • MN@ S? – :5>XY G – @=,20' θ 7;8_ p(θ | X, Y) !I – EFLX p(θ | X, Y) !Y<U → q(θ) JbKL( q(θ) || p(θ | X, Y) ) !DBA 18

S7/"%RSW2/3 • @< KL( q(θ) || p(θ | X, Y)
) !FBA 19 +234,'*! fθ y = fθ(x) =7 O 16(#45Y7O !D 1 $6.4N^ • :> xi ;9-30&!P • :> xi LVTCGZ]LV1)4[V_\ • LV1)4 8J i [VZ]MQ +234,'*-30&SA • HKDropout !X-30&! EI S7?U

I2)#GIO3/3 • Dropout BV q(θ) B 20 ',+%
θ 8J1T1) &- θk J θk EM2U q(θk ) N() >K2U) &- mk I2',+% • !/(- ',+%Dropout Q:X<P – 3 ',+%@F3 Dropout *" Q: • 67W5?Dropout NC9H= !/(- 1TEM p 1) &- $. EM p Dropout Q:1T W v a 0 0 1T1 $.) &- AD) &- 4L;R $. Dropout AD0S

Variational dropout! =7 • Dropout <$ ,1@( *>8%5-B –
'/)&6 • [Gal+ 16] 1000 3*> • Perplexity19#" – 79.7 → 78.675.2 → 73.4 • 20RNN: +; – .A?<$4/ 21

DropConnect [Wan+ 13] • Dropout ?' & • 9<Dropout @.3
p • DropConnect@*=.3 p – !86 • AWD-LSTMLSTMA47 ;/(@:( – #$B"+% 0)<5 23 W v a 0 0 @ ,- !2 *= Dropout ,- > LSTM@;/ M 1*= Bernoulli(1 - p)

Weight tying [Inan+ 17, Press+ 17] • GAL2J -8I5= K16)
J94 • [Inan+ 17] 94*,(<$@& – D?%.MB/F ! – GA">C K3 N# ; H 7E 25 n I'LSTM fn GA+ 1-hot xt ←:NILSTM0A E = WT

Averaged SGD (ASGD) [Polyak+ 92] • (. % $/,! 27
*SGD ") θt -! ASGD "' -! • AWD-LSTM SGD #& . ASGD+ • *SGD % ) % Perplexity SGDASGD+

PTB Ob3*5?h 28 2012 141.2Kneser-Ney &1%8#5-gram 124.7RNNOb3*5 [Mikolov+ 12] 2014
78.4LSTMOb3*5 [Zaremba+ 14] 2016 75.0LSTMOb3*5 [Gal+ 16] (]o,6(0!+A< ) 2017 68.5Recurrent Highway Network [Zilly+ 17] 66.1kbrQ;E=p\Y [Inan+ 17] 64.4;EWC$+fM [Takase+ 17] 60.3Simple Recurrent Unit [Lei+ 17] 57.3LSTMOb3*5Iltd [Merity+ 18] 54.4Mixture of Softmaxes [Yang+ 18] 2018 52.4Mixture of Softmaxes oZ^P[[Takase+ 18] 47.2:TEnsemble [Takase+ 18] 62.4JGIlj aeqs [Zoph+ 17] 58.3LSTMOb3*5. //42'qs [Melis+ 18] )&+*'`PNH)&+*'/42'mD >Fi IlBSc -(+7"aeVR 20189 -(+7"aeqU ↓ nuLSTMOb3*5 IlBSc@KXgL_X

Mixture of Softmaxes 29 P LSTM LSTM P2 LSTM LSTM
P1 P3 2: / P 2: / P2 LSTM P1 P3 P LSTM • 7=(HSoftmaxI038 2: /&+ – 7=(H$&. <D@C 'G !#%L 9B ELSTM1@ " [Yang+ 18] -;FI07=(H38 A3,415? [Takase+ 18] $)F7=(H38 D@K56J *> A3,420?

Mixture of Softmax FAQ • =I 9Softmax /a2 –
Yes5K0Softmax /UD – [Yang+ 18] 3S0Softmax / O A • &)(# ? – '!%* Qc:XR-\ JE ? • 4LW$# 5 ~ 10%?< • >H [ – YesSoftmax >H ^:P[ – PTB 5ZV1[ • M6#"!T 7`b_C – YesKF+M6#"!8B;, G@ 30 W Y .]>HJE 1PGN.]>H

C9$# A • Variational Dropout – [Gal+ 16]
I?4N&#('$/ E >@ – 1!%'!%A • OpenNMTlua+".1 HB <J • DropConnect – 1!%'!%A • Weight Tying – Transformer [Vaswani+ 17] A9 • ASGD – Tranformer [Vaswani+ 17] OQFA9 • ;4 -'036:5 -'0)/,$=L8 • BLEU 0.2 G72KEn-Cz[Popel+ 18] • MoS – MLSTM1!%'!%OpenNMT '*0(DP BLEU 1.7 G72KEn-FrIWSLT 2016[Takase+ 18] 31

• CX,'.S8\I – ;hjg^UG]`XfF:?… • 7R['%$&(<JWVki • LSTM +
9bHYa5<JB0 – cLSTM_K)!**-+% >4 3<J [Melis+ 18] – Q/=AWD-LSTM [Merity+ 18] • Variational dropout, DropConnect, Weight tying, ASGD • eANT2d EODP @ 16 <JB0 [Yang+ 18, Takase+ 18] • <JB0HYS8%#"ZM – ZMNG L 32

2%NL- 42/3 • $ (, -.#' – 2018 10
• $ -. *" – *" !) + • '& 33

34 57.3AWD-LSTM [Merity+ 18] 54.4AWD-LSTM-MoS [Yang+ 18] 47.2AWD-LSTM-DOC (Ensemble) [Takase+
18] 6: &"'-0 PTB Perplexity 56.8AWD-LSTM + Fraternal Dropout [Zołna+ 18] 56.5AWD-LSTM%52 [Merity+ 18] 52.4AWD-LSTM-DOC [Takase+ 18] 47.7AWD-LSTM-MoS + Dynamic evaluation [Yang+ 18] 48.4AWD-LSTM-MoS + Dynamic evaluation %52 [Yang+ 18] 53.3AWD-LSTM-MoS + FRAGE [Gong+ 18] 47.7AWD-LSTM-MoS + FRAGE + Dynamic evaluation 8 ,) 2 [Gong+ 18] 46.5AWD-LSTM-MoS + FRAGE + Dynamic evaluation [Gong+ 18] 56.1AWD-LSTM + FRAGE [Gong+ 18] 55.7DARTS/179[Liu+ 19] 47.7AWD-LSTM-MoS (SigSoftmax) + Dynamic evaluation [Kanai+ 18] 4# !$ +3. (*

#?* Dynamic evaluation [Krause+ 17] • -8 $514 26:03
• 51)&=@A#?* • !" $ [Kuhn+ 90, Grave+ 17] '< – (+>. $51)&=@ 70;/% 35 )&,9 ( +>. $#?*

U >D • UYH 4 – c_(&,(0g9?JYH •
IA?`f – )!**/+&dM – b23RIA • AWD-LSTM-MoS47.7NK→ 48.48^ • AWD-LSTM-MoS + FRAGE46.5NK→ 47.78^ • Perplexity 1 V?JC1<T • O:&%#PW ] – BFZa 57 – "-'$.,(0Le@6 • 3;[QGXES,(0\O=X,(0 36

37 57.3AWD-LSTM [Merity+ 18] 54.4AWD-LSTM-MoS [Yang+ 18] 47.2AWD-LSTM-DOC (Ensemble) [Takase+
18] 6: &"'-0 PTB Perplexity 56.8AWD-LSTM + Fraternal Dropout [Zołna+ 18] 56.5AWD-LSTM%52 [Merity+ 18] 52.4AWD-LSTM-DOC [Takase+ 18] 47.7AWD-LSTM-MoS + Dynamic evaluation [Yang+ 18] 48.4AWD-LSTM-MoS + Dynamic evaluation %52 [Yang+ 18] 53.3AWD-LSTM-MoS + FRAGE [Gong+ 18] 47.7AWD-LSTM-MoS + FRAGE + Dynamic evaluation 8 ,) 2 [Gong+ 18] 46.5AWD-LSTM-MoS + FRAGE + Dynamic evaluation [Gong+ 18] 56.1AWD-LSTM + FRAGE [Gong+ 18] 55.7DARTS/179[Liu+ 19] 47.7AWD-LSTM-MoS (SigSoftmax) + Dynamic evaluation [Kanai+ 18] 4# !$ +3. (*

Fraternal Dropout [Zołna+ 18] • Dropout mask H#"BUE?",0.) H= •
X:/*1BUH=" – Dropout mask H=9"BU5V$> • 6;R" Dropout mask $Q?/*1BU 73 " " • CFPerplexity57.3 → 56.8 @M – -(0&2 AWD-LSTM <T56.5 – IG AWD-LSTM '+$E" – !-(0&27JZP ('% – 45/*1AWD-LSTM-DOCOF 38 p1, p2R" Dropout mask BU ← BUDLYN$AS=WK8

FRAGE [Gong+ 18] • &#619, ;% + • $!;%0;% .453
:"#(7 • -/Perplexity57.3 → 56.1 )2 – AWD-LSTM '856.5 39 I have P(have | I) P(a | I have) LSTM LSTM LSTM LSTM .45 .45 LT LD LT + LD θT*1 θD.453

AWD-LSTM-MoS + FRAGE • )#& – Perplexity53.3→
53.8( • AWD-LSTM-MoS *$ ' – (! + Dropout "53.6 • AWD-LSTM-DOC % 40

• PTB *%+47=?' – ! )6
– !"+4.#9( • 3-+ 15/20 – ! <&3-8 – >$ 3- • :, ;:, 41

• v${^ "e\ – +(-,*174*C GPak –
g[}"kf • ;u,*WJGPT-2[Radford+ 19] – qp |kjI0) • `RPlH"Ft$> – KD&. + YB%3'6:A~Zc=" – idXMy$h TF>V?$]z id$b< • L@Pam S$\!#"wr – PaxUOoE925 • /578FtNn_ "s Q"…… 42

2%NL- 42/3 • $ (, -.#' – 2018 10
• $ -. *" – *" !) + • '& 43

2/: > )6( • 10*'=69 > – 1 billion word
corpus, English Wikipedia, … • 1 @% (07 > 44 I have a P(have | I) P(a | I have) P(dream | I have a) )6LSTMELMoTransformerGPT I [MASK] a have MASK " BERT 3648+A ?-!&B 9#.,; 5$1< …… 1

word2vec >$LS0< 45 RNN6C !R7-2 /J?C(@ 0B5 9 [Mikolov+ 13]
"FAE,N8D &H* RNNKIskip-gramCBOW [Mikolov+ 13] =+4+ 1 billion word corpus,N LSTM6C !O;-2T# 4+.&0:G1 [Peter+ 17] ELMo [Peter+ 18] 1 billion word corpus Q'3LSTM6C !,N =+P0 4+ BERT [Devlin+ 19] &H%)M+LSTM → Transformer =+4+ man woman king queen I have a dream that I have dream that

(% & • p"Q!ELMoBERT se9L – _bRM*@<(' & –
$&8vta5079' • ELMoBERT O Gig\8[ • word2vec RNN *t • Nb507C]f8]fY0/?':BoS – PTB I;KU507l • AWD-LSTM-DOC [Takase+ 18] AWD-LSTM [Merity+ 18] 5kl • 8]f0/V)( :B#P dVariational dropout – ^D/.,#8]f0/*q ZJ ' • HuwrJF • 0/]f=.-7':B V % – A> Transformer 3.507WX • Transformer-XL [Dai+ 19]PTB 54.44 AWD-LSTM57.3 • hnGiTj 1+2264/E`c ma 46

• 5=V;J&"(XM – LSTM + 4S>KAWD-LSTM [Merity+ 18] –
CG*W<D 8[Yang+ 18, Takase+ 18] • U[5-6BI – )YH ? • !#" $'% T0 • A7 9N:@ • 1/QPF3,ZLO – " ER.(+2 47

1/5 • Mikolov et al., Empirical Evaluation and Combination of
Advanced Language Modeling Techniques. INTERSPEECH 2011. • Mikolov et al., Context Dependent Recurrent Neural Network Language Model. SLT 2012. • Zaremba et al., Recurrent Neural Network Regularization. 2014. • Gal et al., A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. NIPS 2016. • Zilly et al., Recurrent Highway Networks. ICML 2017. • Inan et al., Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling. ICLR 2017. • Takase et al., Input-to-output gate to improve rnn language models. IJCNLP 2017. 48

2/5 • Zoph et al., Neural Architecture Search with Reinforcement
Learning. ICLR 2017. • Lei et al., Simple Recurrent Units for Highly Parallelizable Recurrence. EMNLP 2018. • Melis et al., On the state of the art of evaluation in neural language models. ICLR 2018. • Merity et al., Regularizing and Optimizing LSTM Language Models. ICLR 2018. • Yang et al., Breaking the softmax bottleneck: A high-rank RNN language model. ICLR 2018. • Takase et al., Direct Output Connection for a High- Rank Language Model. EMNLP 2018. 49

3/5 • Wan et al., Regularization of Neural Networks using
DropConnect. ICML 2013. • Press et al., Using the Output Embedding to Improve Language Models. EACL 2017. • Polyak et al., Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization 1992. • Vaswani et al., Attention Is All You Need. NIPS 2017. • Popel et al., Training Tips for the Transformer Model. PBML 2018. 50

4/5 • Zołna et al., Fraternal dropout. ICLR 2018. •
Gong et al., FRAGE: Frequency-Agnostic Word Representation. NIPS 2018. • Liu et al., Deep Residual Output Layers for Neural Language Generation. ICML 2019. • Kanai et al., Sigsoftmax: Reanalysis of the Softmax Bottleneck. NIPS 2018. • Krause et al., Dynamic Evaluation of Neural Sequence Models. 2017. • Kuhn et al., A cache-based natural language model for speech recognition. PAMI 1990. • Grave et al., Improving Neural Language Models with a Continuous Cache. ICLR 2017. 51

5/5 • Radford et al., Language Models are Unsupervised Multitask
Learners. 2019. • Mikolov et al., Linguistic Regularities in Continuous Space Word Representations. NAACL 2013. • Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality. NIPS 2013. • Peter et al., Semi-supervised sequence tagging with bidirectional language models. ACL 2017. • Peter et al., Deep Contextualized Word Representations. NAACL 2018. • Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. 52

ニューラル言語モデルの 研究動向（NL研招待講演資料）

ニューラル言語モデルの 研究動向（NL研招待講演資料）

More Decks by Sho Takase

Other Decks in Research

Featured

Transcript

ニューラル言語モデルの研究動向（NL研招待講演資料）

ニューラル言語モデルの研究動向（NL研招待講演資料）