Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ニューラル言語モデルの 研究動向(NL研招待講演資料)

ニューラル言語モデルの 研究動向(NL研招待講演資料)

Sho Takase

June 13, 2019
Tweet

More Decks by Sho Takase

Other Decks in Research

Transcript

  1.      !" 2019/6/13  # !"

    $ %  https://takase.github.io/ 1
  2. +,' • 2008 - 2017: !$ - )$ • 2017

    - 2018: NTT CS&RA • 2018 - : #&( 3     "%  [IJCNLP 17, EMNLP 18, AAAI 19]  [EMNLP 16, NAACL 19]  -* [ACL 16]
  3. 2%NL- 42/3 • $ (, -.#' –  2018 10

    • $ -. *" – *" !) + •  '&  4
  4. 2%NL- 42/3 • $ (, -.#' –  2018 10

    • $ -. *" – *" !) + •  '&  5
  5. .=#!$ • %@.=4 05 – P(I have a dream) >

    P(a have I dream) – &(49716 49>05 • #!$C+perplexity – 8A=':',D+-B – 32 "!'< → ;#!$ – /,),? < *   6 NTTW 2i Encoder-Decoder 2 2RNN Encoder-Decoder P(I have a dream) > P(a have I dream) > P(fuga spam hoge  : •  2RNN e •  2 P P(I have a dream) = P(I)P(have | I)P(a | I have)P(dream | I have a) I have a dream
  6. ",*#' • Noisy channel model – P(T) ", – *352.+&-14

     •  – )%(",   • /,0$ ! – Skip-gramELMoBERT 7
  7. >)KM/;82N 2DO – Penn Treebank (PTB)WikiText-2 – "'# <3 –

    0D4R,1! &HJ8@;* P? • 6-8@L9 7T 8 (DO – WikiText-1031 billion word corpus – S=5:4R<3ELMoBERT – (DO 4RF@,1$%Q3 • S=5: E3G8@7T .+$%DOI Q3 CAB
  8. A-JL1=:2M 4DN# "$ – Penn Treebank (PTB)WikiText-2 – &+!'# 

    ?5 – 2D6Q/3%"$*HI:C=. OB • 80:CK; 9S 9 ,DN# "$ – WikiText-1031 billion word corpus – R@7<6Q?5ELMoBERT – ,DN# "$6QFC/3(#)P5 • R@7<E5 G:C9S 9 > 
  9. ,-$  Penn TreebankPTB • &: #5*  – Penn

    Treebank Wall Street Journal %" – 09 6E  7C [Mikolov+ 11] • 9H10,000 – +1=)1=( – /= N 2? • 10 million → N million – 8G'9 3F<unk> 2? • DB!< A9/887,521 – 4>1 billion word corpus 1/1000 – ;$  .@ 10
  10. PTB Ob3*5?h 11 2012 141.2Kneser-Ney &1%8#5-gram 124.7RNNOb3*5 [Mikolov+ 12] 2014

    78.4LSTMOb3*5 [Zaremba+ 14] 2016 75.0LSTMOb3*5 [Gal+ 16] (]o,6(0!+A< ) 2017 68.5Recurrent Highway Network [Zilly+ 17] 66.1kbrQ;E=p\Y [Inan+ 17] 64.4;EWC$+fM [Takase+ 17] 60.3Simple Recurrent Unit [Lei+ 17] 57.3LSTMOb3*5Iltd [Merity+ 18] 54.4Mixture of Softmaxes [Yang+ 18] 2018 52.4Mixture of Softmaxes oZ^P[[Takase+ 18] 47.2:TEnsemble [Takase+ 18] 62.4JGIlj aeqs [Zoph+ 17] 58.3LSTMOb3*5. //42'qs [Melis+ 18] IlBSc -(+7"aeVR 20189 -(+7"aeqU ↓ nuLSTMOb3*5 IlBSc@KXgL_X )&+*'`PNH)&+*'/42'mD >Fi
  11. PTB !Tg8/:Dm 12 2012 141.2Kneser-Ney +6*=(5-gram 124.7RNNTg8/: [Mikolov+ 12] 2014

    78.4LSTMTg8/: [Zaremba+ 14] 2016 75.0LSTMTg8/: [Gal+ 16] (bt1;-5$&0#FA!) 2017 68.5Recurrent Highway Network [Zilly+ 17] 66.1pgwV@JBu#a^ [Inan+ 17] 64.4@J\H)0#kR [Takase+ 17] 60.3Simple Recurrent Unit [Lei+ 17] 57.3LSTMTg8/:Nq#yi! [Merity+ 18] 54.4Mixture of Softmaxes [Yang+ 18] 2018 52.4Mixture of Softmaxest _c#U`[Takase+ 18] 47.2?YEnsemble [Takase+ 18] 62.4OLNqofj#vx [Zoph+ 17] 58.3LSTMTg8/:3%4497,vx [Melis+ 18] .+0/,eUSM.+0/,497,#rI!CKn  8/:  NqGXh 2-0<'fj[W 2018> " 2-0<'fjvZ ↓ szLSTMTg8/: NqGXhEP]lQd]
  12. ASGD Weight-Dropped LSTM (AWD-LSTM) [Merity+ 18] • <846 1+ –

    Variational dropout [Gal+ 16] – DropConnect [Wan+ 13] – Weight tying [Inan+ 17, Press+ 17] – Averaged SGD [Polyak+ 92] • 0  • %;*$ – -9 PTB .5  13 &)!LSTM.5 [Zaremba+ 14] AWD-LSTM ':3 1  > /= 30 "2( , 407# 500 ~ 1000 Gradient clipping 5 0.25
  13. ASGD Weight-Dropped LSTM (AWD-LSTM) [Merity+ 18] • B=8; 5!/ –

    Variational dropout [Gal+ 16] – DropConnect [Wan+ 13] – Weight tying [Inan+ 17, Press+ 17] – Averaged SGD [Polyak+ 92] • 4  • )@.( – 1> PTB 2:  14 *-$ LSTM2: [Zaremba+ 14] AWD-LSTM +?7 1 D 3C 30#%" 6, 0 40<' 500 ~ 1000 Gradient clipping 5 0.25 A 9& 
  14. 0!LSTM/8 [Zaremba+ 14] • LSTM 3-"<8EAB  – N-gram/8 1&

    N 7 – Perplexity141.2Kneser-Ney 5-gram→ 78.4 • C?C? 25+ * 15 LSTM <BOS> I have P(I) P(have | I) P(a | I have) LSTM LSTM LSTM LSTM LSTM f (x, ht-1 )LSTM Dropout#),> 25 p 61/(1 - p)@  4Dropout 9(;: ↑ = %'D$.Dropout 
  15. ASGD Weight-Dropped LSTM (AWD-LSTM) [Merity+ 18] • <846 1+ –

    Variational dropout [Gal+ 16] – DropConnect [Wan+ 13] – Weight tying [Inan+ 17, Press+ 17] – Averaged SGD [Polyak+ 92] • 0  • %;*$ – -9 PTB .5  16 &)!LSTM.5 [Zaremba+ 14] AWD-LSTM ':3 1  > /= 30 "2( , 407# 500 ~ 1000 Gradient clipping 5 0.25
  16. Variational dropout [Gal+ 16] • A&"# .*;>. Dropout %F6 –

    01J/9Dropout F6C= – LSTM(Dropout -3[Moon+ 15] + • @,$?@E5B2874 17 LSTM <BOS> I have P(I) P(have | I) P(a | I have) LSTM LSTM LSTM LSTM LSTM LSTM <BOS> I have P(I) P(have | I) P(a | I have) LSTM LSTM LSTM LSTM LSTM <)LSTM:D'!( Variational dropoutF6 LSTM:D'!( .GIH. d F6
  17. R8."&QRW1/3 • aS? – *123+(),20']TRA • 17VZ^  9,20'!OC –

    ,20' 8_ $4-3   – Dropout !K`,20'Dropout ; $4-3   – []TDropout RA • 96HP9Dropout /%#!\C • MN@ S? – :5>XY G – @=,20' θ 7;8_ p(θ | X, Y) !I – EFLX p(θ | X, Y) !Y<U → q(θ) JbKL( q(θ) || p(θ | X, Y) ) !DBA 18
  18. S7/"%RSW2/3 • @< KL( q(θ) || p(θ | X, Y)

    ) !FBA 19 +234,'*! fθ   y = fθ(x)   =7 O 16(#45Y7O !D 1 $6.4N^ •  :> xi ;9-30&!P • :> xi LVTCGZ]LV1)4[V_\ • LV1)4 8J i Z]MQ  +234,'*-30&SA • HKDropout !X-30&! EI  S7?U
  19. I2)#GIO3/3 • Dropout BV  q(θ) B  20 ',+%

    θ 8J1T1) &- θk  J θk EM2U q(θk )  N() >K2U) &- mk I2',+%  • !/(- ',+%Dropout Q:X<P – 3 ',+%@F3 Dropout *" Q: • 67W5?Dropout NC9H=  !/(- 1TEM p 1) &- $. EM p Dropout Q:1T W v a 0 0  1T1 $.) &- AD) &- 4L;R $. Dropout AD0S
  20. Variational dropout! =7 • Dropout <$ ,1@(  *>8%5-B –

    '/)&6 • [Gal+ 16] 1000 3*> • Perplexity19#" – 79.7 → 78.675.2 → 73.4 • 20RNN: +; –  .A?<$4/ 21
  21. ASGD Weight-Dropped LSTM (AWD-LSTM) [Merity+ 18] • <846 1+ –

    Variational dropout [Gal+ 16] – DropConnect [Wan+ 13] – Weight tying [Inan+ 17, Press+ 17] – Averaged SGD [Polyak+ 92] • 0  • %;*$ – -9 PTB .5  22 &)!LSTM.5 [Zaremba+ 14] AWD-LSTM ':3 1  > /= 30 "2( , 407# 500 ~ 1000 Gradient clipping 5 0.25
  22. DropConnect [Wan+ 13] • Dropout ?' & • 9<Dropout @.3

    p  • DropConnect@*=.3 p   –  !86  • AWD-LSTMLSTMA47 ;/(@:( – #$B"+% 0)<5 23 W v a 0 0  @  ,- !2 *= Dropout ,- > LSTM@;/  M 1*= Bernoulli(1 - p)
  23. ASGD Weight-Dropped LSTM (AWD-LSTM) [Merity+ 18] • <846 1+ –

    Variational dropout [Gal+ 16] – DropConnect [Wan+ 13] – Weight tying [Inan+ 17, Press+ 17] – Averaged SGD [Polyak+ 92] • 0  • %;*$ – -9 PTB .5  24 &)!LSTM.5 [Zaremba+ 14] AWD-LSTM ':3 1  > /= 30 "2( , 407# 500 ~ 1000 Gradient clipping 5 0.25
  24. Weight tying [Inan+ 17, Press+ 17] • GAL2J -8I5= K16)

    J94 • [Inan+ 17] 94*,(<$@& – D?%.MB/F ! – GA">C K3 N#  ; H  7E 25 n I'LSTM fn  GA+ 1-hot  xt  ←:NILSTM0A   E = WT 
  25. ASGD Weight-Dropped LSTM (AWD-LSTM) [Merity+ 18] • <846 1+ –

    Variational dropout [Gal+ 16] – DropConnect [Wan+ 13] – Weight tying [Inan+ 17, Press+ 17] – Averaged SGD [Polyak+ 92] • 0  • %;*$ – -9 PTB .5  26 &)!LSTM.5 [Zaremba+ 14] AWD-LSTM ':3 1  > /= 30 "2( , 407# 500 ~ 1000 Gradient clipping 5 0.25
  26. Averaged SGD (ASGD) [Polyak+ 92] • (. % $/,! 27

    *SGD  ") θt -! ASGD "'  -! • AWD-LSTM SGD  #& .   ASGD+ • *SGD %  ) %   Perplexity SGDASGD+
  27. PTB Ob3*5?h 28 2012 141.2Kneser-Ney &1%8#5-gram 124.7RNNOb3*5 [Mikolov+ 12] 2014

    78.4LSTMOb3*5 [Zaremba+ 14] 2016 75.0LSTMOb3*5 [Gal+ 16] (]o,6(0!+A< ) 2017 68.5Recurrent Highway Network [Zilly+ 17] 66.1kbrQ;E=p\Y [Inan+ 17] 64.4;EWC$+fM [Takase+ 17] 60.3Simple Recurrent Unit [Lei+ 17] 57.3LSTMOb3*5Iltd [Merity+ 18] 54.4Mixture of Softmaxes [Yang+ 18] 2018 52.4Mixture of Softmaxes oZ^P[[Takase+ 18] 47.2:TEnsemble [Takase+ 18] 62.4JGIlj aeqs [Zoph+ 17] 58.3LSTMOb3*5. //42'qs [Melis+ 18] )&+*'`PNH)&+*'/42'mD >Fi IlBSc -(+7"aeVR 20189 -(+7"aeqU ↓ nuLSTMOb3*5 IlBSc@KXgL_X
  28. Mixture of Softmaxes 29 P LSTM LSTM P2 LSTM LSTM

    P1 P3 2: / P 2: / P2 LSTM P1 P3 P LSTM • 7=(HSoftmaxI038 2: /&+ – 7=(H$&. <D@C 'G !#%L 9B ELSTM1@ " [Yang+ 18] -;FI07=(H38 A3,415? [Takase+ 18] $)F7=(H38 D@K56J *> A3,420?
  29. Mixture of Softmax  FAQ • =I 9Softmax /a2 –

    Yes5K0Softmax /UD  – [Yang+ 18] 3S0Softmax / O  A • &)(# ?   – '!%* Qc:XR-\ JE  ?  • 4LW$# 5 ~ 10%?< • >H [ – YesSoftmax >H ^:P[ – PTB  5ZV1[ • M6#"!T 7`b_C  – YesKF+M6#"!8B;, G@ 30 W Y .]>HJE 1PGN.]>H
  30. C9$# A   • Variational Dropout – [Gal+ 16]

    I?4N&#('$/ E >@ – 1!%'!%A  • OpenNMTlua+".1 HB <J  • DropConnect – 1!%'!%A  • Weight Tying – Transformer [Vaswani+ 17] A9 • ASGD – Tranformer [Vaswani+ 17] OQFA9 • ;4 -'036:5  -'0)/,$=L8 • BLEU 0.2 G72KEn-Cz[Popel+ 18] • MoS – MLSTM1!%'!%OpenNMT '*0(DP BLEU 1.7 G72KEn-FrIWSLT 2016[Takase+ 18] 31
  31.  • CX,'.S8\I – ;hjg^UG]`XfF:?… • 7R['%$&(<JWVki • LSTM +

    9bHYa5<JB0 – cLSTM_K)!**-+% >4 3<J [Melis+ 18] – Q/=AWD-LSTM [Merity+ 18] • Variational dropout, DropConnect, Weight tying, ASGD • eANT2d EODP @ 16   <JB0 [Yang+ 18, Takase+ 18] • <JB0HYS8%#"ZM – ZMNG  L  32
  32. 2%NL- 42/3 • $ (, -.#' –  2018 10

    • $ -. *" – *" !) + •  '&  33
  33. 34 57.3AWD-LSTM [Merity+ 18] 54.4AWD-LSTM-MoS [Yang+ 18] 47.2AWD-LSTM-DOC (Ensemble) [Takase+

    18] 6: &"'-0 PTB Perplexity 56.8AWD-LSTM + Fraternal Dropout [Zołna+ 18] 56.5AWD-LSTM%52 [Merity+ 18] 52.4AWD-LSTM-DOC [Takase+ 18] 47.7AWD-LSTM-MoS + Dynamic evaluation [Yang+ 18] 48.4AWD-LSTM-MoS + Dynamic evaluation %52 [Yang+ 18] 53.3AWD-LSTM-MoS + FRAGE [Gong+ 18] 47.7AWD-LSTM-MoS + FRAGE + Dynamic evaluation 8 ,)  2 [Gong+ 18] 46.5AWD-LSTM-MoS + FRAGE + Dynamic evaluation [Gong+ 18] 56.1AWD-LSTM + FRAGE [Gong+ 18] 55.7DARTS/179[Liu+ 19] 47.7AWD-LSTM-MoS (SigSoftmax) + Dynamic evaluation [Kanai+ 18]  4# !$ +3. (*
  34.  #?* Dynamic evaluation [Krause+ 17] • -8 $514 26:03

    • 51)&=@A#?* • !" $ [Kuhn+ 90, Grave+ 17]  '< – (+>. $51)&=@ 70;/%  35 )&,9 ( +>. $#?*
  35. U >D  • UYH 4  – c_(&,(0g9?JYH •

    IA?`f – )!**/+&dM  – b23RIA  • AWD-LSTM-MoS47.7NK→ 48.48^ • AWD-LSTM-MoS + FRAGE46.5NK→ 47.78^ • Perplexity 1 V?JC1<T • O:&%#PW ] – BFZa 57   – "-'$.,(0Le@6 • 3;[QGXES,(0\O=X,(0 36
  36. 37 57.3AWD-LSTM [Merity+ 18] 54.4AWD-LSTM-MoS [Yang+ 18] 47.2AWD-LSTM-DOC (Ensemble) [Takase+

    18] 6: &"'-0 PTB Perplexity 56.8AWD-LSTM + Fraternal Dropout [Zołna+ 18] 56.5AWD-LSTM%52 [Merity+ 18] 52.4AWD-LSTM-DOC [Takase+ 18] 47.7AWD-LSTM-MoS + Dynamic evaluation [Yang+ 18] 48.4AWD-LSTM-MoS + Dynamic evaluation %52 [Yang+ 18] 53.3AWD-LSTM-MoS + FRAGE [Gong+ 18] 47.7AWD-LSTM-MoS + FRAGE + Dynamic evaluation 8 ,)  2 [Gong+ 18] 46.5AWD-LSTM-MoS + FRAGE + Dynamic evaluation [Gong+ 18] 56.1AWD-LSTM + FRAGE [Gong+ 18] 55.7DARTS/179[Liu+ 19] 47.7AWD-LSTM-MoS (SigSoftmax) + Dynamic evaluation [Kanai+ 18]  4# !$ +3. (*
  37. Fraternal Dropout [Zołna+ 18] • Dropout mask H#"BUE?",0.) H= •

    X:/*1BUH="  – Dropout mask H=9"BU5V$>  • 6;R" Dropout mask $Q?/*1BU 73 " " • CFPerplexity57.3 → 56.8 @M – -(0&2 AWD-LSTM <T56.5 – IG AWD-LSTM '+$E" – !-(0&27JZP ('% – 45/*1AWD-LSTM-DOCOF 38 p1, p2R" Dropout mask BU ← BUDLYN$AS=WK8
  38. FRAGE [Gong+ 18] • &#619, ;% + • $!;%0;% .453

    :"#(7 • -/Perplexity57.3 → 56.1 )2 –  AWD-LSTM '856.5 39 I have P(have | I) P(a | I have) LSTM LSTM LSTM LSTM .45 .45 LT LD LT + LD θT*1  θD.453 
  39. AWD-LSTM-MoS + FRAGE • )#&    – Perplexity53.3→

    53.8( • AWD-LSTM-MoS *$  ' – (! + Dropout "53.6 • AWD-LSTM-DOC %  40
  40.    • PTB *%+47=?' –  ! )6

    – !"+4.#9( • 3-+ 15/20 –  ! <&3-8 – >$ 3-   • :,  ;:,   41
  41.  …  • v${^ "e\ – +(-,*174*C ‰GPakˆ –

    g[}"kˆf… • ;u‡,*WJGPT-2[Radford+ 19] – qŠp |†kˆjI0) • `RPlH"Ft$> – KD&. + YB%3'6:A~Zc="  – idXMy$h ƒTF>V€?‚$]z id$b<„ • L@Pam S$\!#"wr – PaxUOoE925 • /578FtNn_ "s Q"…… 42
  42. 2%NL- 42/3 • $ (, -.#' –  2018 10

    • $ -. *" – *" !) + •  '&  43
  43. 2/: > )6( • 10*'=69 > – 1 billion word

    corpus, English Wikipedia, … • 1 @% (07 > 44 I have a P(have | I) P(a | I have) P(dream | I have a) )6LSTMELMoTransformerGPT I [MASK] a have MASK " BERT 3648+A ?-!&B 9#.,; 5$1< …… 1
  44. word2vec >$LS0< 45 RNN6C !R7-2 /J?C(@ 0B5 9 [Mikolov+ 13]

    "FAE,N8D &H* RNNKIskip-gramCBOW [Mikolov+ 13] =+4+ 1 billion word corpus,N  LSTM6C !O;-2T# 4+.&0:G1 [Peter+ 17] ELMo [Peter+ 18] 1 billion word corpus Q'3LSTM6C !,N =+P0 4+ BERT [Devlin+ 19] &H%)M+LSTM → Transformer =+4+ man woman king queen I have a dream that I have dream that
  45. (% &  • p"Q!ELMoBERT se9L – _bRM*@<(' & –

    $&8vta5079' • ELMoBERT   O Gig\8[ • word2vec  RNN *t • Nb507C]f8]fY0/?':BoS – PTB I;KU507l • AWD-LSTM-DOC [Takase+ 18]  AWD-LSTM [Merity+ 18]  5kl • 8]f0/V)( :B#P dVariational dropout – ^D/.,#8]f0/*q ZJ ' • HuwrJF • 0/]f=.-7':B V % – A> Transformer 3.507WX • Transformer-XL [Dai+ 19]PTB  54.44 AWD-LSTM57.3 • hnGiTj 1+2264/E`c ma 46
  46.  • 5=V;J&"(XM – LSTM + 4S>KAWD-LSTM [Merity+ 18] –

    CG*W<D 8[Yang+ 18, Takase+ 18] • U[5-6BI – )YH ?    • !#" $'% T0 • A7 9N:@ • 1/QPF3,ZLO – " ER.(+2 47
  47. 1/5 • Mikolov et al., Empirical Evaluation and Combination of

    Advanced Language Modeling Techniques. INTERSPEECH 2011. • Mikolov et al., Context Dependent Recurrent Neural Network Language Model. SLT 2012. • Zaremba et al., Recurrent Neural Network Regularization. 2014. • Gal et al., A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. NIPS 2016. • Zilly et al., Recurrent Highway Networks. ICML 2017. • Inan et al., Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling. ICLR 2017. • Takase et al., Input-to-output gate to improve rnn language models. IJCNLP 2017. 48
  48. 2/5 • Zoph et al., Neural Architecture Search with Reinforcement

    Learning. ICLR 2017. • Lei et al., Simple Recurrent Units for Highly Parallelizable Recurrence. EMNLP 2018. • Melis et al., On the state of the art of evaluation in neural language models. ICLR 2018. • Merity et al., Regularizing and Optimizing LSTM Language Models. ICLR 2018. • Yang et al., Breaking the softmax bottleneck: A high-rank RNN language model. ICLR 2018. • Takase et al., Direct Output Connection for a High- Rank Language Model. EMNLP 2018. 49
  49. 3/5 • Wan et al., Regularization of Neural Networks using

    DropConnect. ICML 2013. • Press et al., Using the Output Embedding to Improve Language Models. EACL 2017. • Polyak et al., Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization 1992. • Vaswani et al., Attention Is All You Need. NIPS 2017. • Popel et al., Training Tips for the Transformer Model. PBML 2018. 50
  50. 4/5 • Zołna et al., Fraternal dropout. ICLR 2018. •

    Gong et al., FRAGE: Frequency-Agnostic Word Representation. NIPS 2018. • Liu et al., Deep Residual Output Layers for Neural Language Generation. ICML 2019. • Kanai et al., Sigsoftmax: Reanalysis of the Softmax Bottleneck. NIPS 2018. • Krause et al., Dynamic Evaluation of Neural Sequence Models. 2017. • Kuhn et al., A cache-based natural language model for speech recognition. PAMI 1990. • Grave et al., Improving Neural Language Models with a Continuous Cache. ICLR 2017. 51
  51. 5/5 • Radford et al., Language Models are Unsupervised Multitask

    Learners. 2019. • Mikolov et al., Linguistic Regularities in Continuous Space Word Representations. NAACL 2013. • Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality. NIPS 2013. • Peter et al., Semi-supervised sequence tagging with bidirectional language models. ACL 2017. • Peter et al., Deep Contextualized Word Representations. NAACL 2018. • Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. 52