Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ニューラル言語モデルの 研究動向(NL研招待講演資料)

ニューラル言語モデルの 研究動向(NL研招待講演資料)

Sho Takase

June 13, 2019
Tweet

More Decks by Sho Takase

Other Decks in Research

Transcript



  1. !"
    2019/6/13
    # !"
    $ %
    https://takase.github.io/
    1

    View Slide

  2. 2/
    • 1.
    NL*&
    '"!3)+(2
    • -6%
    *AIP
    ,5#0$4(2
    2

    View Slide

  3. +,'
    • 2008 - 2017: !$ - )$
    • 2017 - 2018: NTT CS&RA
    • 2018 - : #&(
    3



    "%
    [IJCNLP 17, EMNLP 18, AAAI 19]
    [EMNLP 16, NAACL 19]

    -*
    [ACL 16]

    View Slide

  4. 2%NL-42/3
    • $(,-.#'
    – 2018 10
    • $-.*"
    – *"
    !)+
    • '&
    4

    View Slide

  5. 2%NL-42/3
    • $(,-.#'
    – 2018 10
    • $-.*"
    – *"
    !)+
    • '&
    5

    View Slide

  6. .=#!$
    • %@.=4 05
    – P(I have a dream) > P(a have I dream)
    – &(4971649>05
    • #!$C+perplexity
    – 8A=':',D+-B
    – 32 "!'< → ;#!$
    – /,),?
    <*
    6
    NTTW
    2i
    Encoder-Decoder 2
    2RNN Encoder-Decoder
    P(I have a dream) > P(a have I dream) > P(fuga spam hoge

    :
    •  2RNN e
    •  2 P
    P(I have a dream)
    = P(I)P(have | I)P(a | I have)P(dream | I have a)
    I have a dream

    View Slide

  7. ",*#'
    • Noisy channel model
    – P(T) ",
    – *352.+&-14

    – )%(",

    • /,0$ !
    – Skip-gramELMoBERT
    7

    View Slide

  8. >)KM/;82N
    2DO
    – Penn Treebank (PTB)WikiText-2
    – "'# <3
    – 0D4R,1! &HJ8@;*P?
    • 6-8@L9
    7T
    8
    (DO
    – WikiText-1031 billion word corpus
    – S=5:4R<3ELMoBERT
    – (DO 4RF@,1$%Q3
    • S=5: E3G8@7T
    .+$%DOI Q3 CAB

    View Slide

  9. A-JL1=:2M
    4DN# "$
    – Penn Treebank (PTB)WikiText-2
    – &+!'# ?5
    – 2D6Q/3%"$*HI:C=.OB
    • 80:CK; 9S
    9
    ,DN# "$
    – WikiText-1031 billion word corpus
    – R@7<6Q?5ELMoBERT
    – ,DN# "$6QFC/3(#)P5
    • R@7<E5 G:C9S
    9
    >

    View Slide

  10. ,-$

    Penn TreebankPTB
    • &:#5*
    – Penn Treebank Wall Street Journal %"
    – 096E 7C [Mikolov+ 11]
    • 9H10,000
    – +1=)1=(
    – /= N 2?
    • 10 million → N million
    – 8G'9 3F2?
    • DB!<A9/887,521
    – 4>1 billion word corpus 1/1000
    – ;$ .@
    10

    View Slide

  11. PTB
    Ob3*5?h
    11
    2012
    141.2Kneser-Ney &1%8#5-gram
    124.7RNNOb3*5 [Mikolov+ 12]
    2014 78.4LSTMOb3*5 [Zaremba+ 14]
    2016
    75.0LSTMOb3*5 [Gal+ 16]
    (]o,6(0!+A<)
    2017 68.5Recurrent Highway Network [Zilly+ 17]
    66.1kbrQ;E=p\Y [Inan+ 17]
    64.4;EWC$+fM [Takase+ 17]
    60.3Simple Recurrent Unit [Lei+ 17]
    57.3LSTMOb3*5Iltd [Merity+ 18]
    54.4Mixture of Softmaxes [Yang+ 18]
    2018
    52.4Mixture of SoftmaxesoZ^P[[Takase+ 18]
    47.2:TEnsemble [Takase+ 18]
    62.4JGIlj aeqs [Zoph+ 17]
    58.3LSTMOb3*5. //42'qs [Melis+ 18]
    IlBSc
    -(+7"aeVR
    20189
    -(+7"aeqU

    nuLSTMOb3*5
    IlBSc@KXgL_X
    )&+*'`PNH)&+*'/42'mD>Fi

    View Slide

  12. PTB !Tg8/:Dm
    12
    2012
    141.2Kneser-Ney +6*=(5-gram
    124.7RNNTg8/: [Mikolov+ 12]
    2014 78.4LSTMTg8/: [Zaremba+ 14]
    2016
    75.0LSTMTg8/: [Gal+ 16]
    (bt1;-5$&0#FA!)
    2017 68.5Recurrent Highway Network [Zilly+ 17]
    66.1pgwV@JBu#a^ [Inan+ 17]
    64.4@J\H)0#kR [Takase+ 17]
    60.3Simple Recurrent Unit [Lei+ 17]
    57.3LSTMTg8/:Nq#yi! [Merity+ 18]
    54.4Mixture of Softmaxes [Yang+ 18]
    2018
    52.4Mixture of Softmaxest
    _c#U`[Takase+ 18]
    47.2?YEnsemble [Takase+ 18]
    62.4OLNqofj#vx [Zoph+ 17]
    58.3LSTMTg8/:3%4497,vx [Melis+ 18]
    .+0/,eUSM.+0/,497,#rI!CKn

    8/:

    NqGXh
    2-0<'fj[W
    2018>"
    2-0<'fjvZ

    szLSTMTg8/:
    NqGXhEP]lQd]

    View Slide

  13. ASGD Weight-Dropped LSTM
    (AWD-LSTM) [Merity+ 18]
    • <846
    1+
    – Variational dropout [Gal+ 16]
    – DropConnect [Wan+ 13]
    – Weight tying [Inan+ 17, Press+ 17]
    – Averaged SGD [Polyak+ 92]
    • 0
    • %;*$
    – -9
    PTB
    .5

    13
    &)!LSTM.5 [Zaremba+ 14] AWD-LSTM
    ':3 1 > /= 30 "2(
    , 407# 500 ~ 1000
    Gradient clipping 5 0.25

    View Slide

  14. ASGD Weight-Dropped LSTM
    (AWD-LSTM) [Merity+ 18]
    • B=8;5!/
    – Variational dropout [Gal+ 16]
    – DropConnect [Wan+ 13]
    – Weight tying [Inan+ 17, Press+ 17]
    – Averaged SGD [Polyak+ 92]
    • 4

    • )@.(
    – 1>PTB 2:
    14
    *-$
    LSTM2: [Zaremba+ 14] AWD-LSTM
    +?7 1 D3C 30#%" 6,
    0 40<' 500 ~ 1000
    Gradient clipping 5 0.25
    A9&

    View Slide

  15. 0!LSTM/8 [Zaremba+ 14]
    • LSTM3-"<8EAB
    – N-gram/81& N 7
    – Perplexity141.2Kneser-Ney 5-gram→ 78.4
    • C?C?25+ *
    15
    LSTM
    I have
    P(I) P(have | I) P(a | I have)
    LSTM
    LSTM
    LSTM
    LSTM
    LSTM
    f (x, ht-1
    )LSTM
    Dropout#),>
    25 p 61/(1 - p)@
    4Dropout 9(;: ↑ =
    %'D$.Dropout

    View Slide

  16. ASGD Weight-Dropped LSTM
    (AWD-LSTM) [Merity+ 18]
    • <846
    1+
    – Variational dropout [Gal+ 16]
    – DropConnect [Wan+ 13]
    – Weight tying [Inan+ 17, Press+ 17]
    – Averaged SGD [Polyak+ 92]
    • 0
    • %;*$
    – -9
    PTB
    .5

    16
    &)!LSTM.5 [Zaremba+ 14] AWD-LSTM
    ':3 1 > /= 30 "2(
    , 407# 500 ~ 1000
    Gradient clipping 5 0.25

    View Slide

  17. Variational dropout [Gal+ 16]
    • A&"# .*;>.Dropout %F6
    – 01J/9Dropout F6C=
    – LSTM(Dropout -3[Moon+ 15]

    +
    • @,$?@E5B2874
    17
    LSTM
    I have
    P(I) P(have | I) P(a | I have)
    LSTM
    LSTM
    LSTM
    LSTM
    LSTM
    LSTM
    I have
    P(I) P(have | I) P(a | I have)
    LSTM
    LSTM
    LSTM
    LSTM
    LSTM
    <)LSTM:D'!( Variational dropoutF6LSTM:D'!(
    .GIH. d F6

    View Slide

  18. R8."&QRW1/3
    • aS?
    – *123+(),20']TRA
    • 17VZ^ 9,20'!OC
    – ,20'
    8_$4-3
    – Dropout !K`,20'Dropout ;
    $4-3
    – []TDropout RA
    • 96HP9Dropout /%#!\C
    • MN@ S?
    – :5>XY G
    – @=,20' θ 7;8_ p(θ | X, Y) !I
    – EFLX p(θ | X, Y) !Y<U
    → q(θ) JbKL( q(θ) || p(θ | X, Y) ) !DBA
    18

    View Slide

  19. S7/"%RSW2/3
    • @< KL( q(θ) || p(θ | X, Y) ) !FBA
    19
    +234,'*! fθ y = fθ(x)
    =7O
    16(#45Y7O !D
    1 $6.4N^

    :> xi
    ;9-30&!P
    • :> xi
    LVTCGZ]LV1)4[V_\
    • LV1)48J i [VZ]MQ
    +234,'*-30&SA
    • HKDropout !X-30&!EI
    S7?U

    View Slide

  20. I2)#GIO3/3
    • Dropout BV q(θ) B
    20
    ',+% θ 8J1T1) &- θk

    J θk
    EM2U q(θk
    )
    N() >K2U) &- mk
    I2',+%
    • !/(-',+%Dropout Q:X<P
    – 3 ',+%@F3 Dropout *" Q:
    • 67W5?Dropout
    NC9H=

    !/(-1TEM p 1) &-
    $.
    EM p Dropout Q:1T
    W v a
    0
    0

    1T1

    $.) &-
    AD) &-
    4L;R
    $.
    Dropout AD0S

    View Slide

  21. Variational dropout!
    =7
    • Dropout <$ ,1@(
    *>8%5-B
    – '/)&6
    • [Gal+ 16] 1000 3*>
    • Perplexity19#"
    – 79.7 → 78.675.2 → 73.4
    • 20RNN: +;
    – .A?<$4/
    21

    View Slide

  22. ASGD Weight-Dropped LSTM
    (AWD-LSTM) [Merity+ 18]
    • <846
    1+
    – Variational dropout [Gal+ 16]
    – DropConnect [Wan+ 13]
    – Weight tying [Inan+ 17, Press+ 17]
    – Averaged SGD [Polyak+ 92]
    • 0
    • %;*$
    – -9
    PTB
    .5

    22
    &)!LSTM.5 [Zaremba+ 14] AWD-LSTM
    ':3 1 > /= 30 "2(
    , 407# 500 ~ 1000
    Gradient clipping 5 0.25

    View Slide

  23. DropConnect [Wan+ 13]
    • Dropout ?' &
    • 9<Dropout @.3 p
    • DropConnect@*=.3 p
    – !86
    • AWD-LSTMLSTMA47;/(@:(
    – #$B"+% 0)<5
    23
    W v a
    0
    0

    @


    ,-
    !2 *=
    Dropout ,->
    LSTM@;/
    M 1*=
    Bernoulli(1 - p)

    View Slide

  24. ASGD Weight-Dropped LSTM
    (AWD-LSTM) [Merity+ 18]
    • <846
    1+
    – Variational dropout [Gal+ 16]
    – DropConnect [Wan+ 13]
    – Weight tying [Inan+ 17, Press+ 17]
    – Averaged SGD [Polyak+ 92]
    • 0
    • %;*$
    – -9
    PTB
    .5

    24
    &)!LSTM.5 [Zaremba+ 14] AWD-LSTM
    ':3 1 > /= 30 "2(
    , 407# 500 ~ 1000
    Gradient clipping 5 0.25

    View Slide

  25. Weight tying [Inan+ 17, Press+ 17]
    • GAL2J-8I5= K16)
    J94
    • [Inan+ 17] 94*,(<$@&
    – D?%.MB/F!
    – GA">C K3N#
    ; H 7E
    25
    n I'LSTM fn
    GA+
    1-hot xt


    ←:NILSTM0A
    E = WT

    View Slide

  26. ASGD Weight-Dropped LSTM
    (AWD-LSTM) [Merity+ 18]
    • <846
    1+
    – Variational dropout [Gal+ 16]
    – DropConnect [Wan+ 13]
    – Weight tying [Inan+ 17, Press+ 17]
    – Averaged SGD [Polyak+ 92]
    • 0
    • %;*$
    – -9
    PTB
    .5

    26
    &)!LSTM.5 [Zaremba+ 14] AWD-LSTM
    ':3 1 > /= 30 "2(
    , 407# 500 ~ 1000
    Gradient clipping 5 0.25

    View Slide

  27. Averaged SGD (ASGD) [Polyak+ 92]
    • (.% $/,!
    27
    *SGD
    ") θt
    -!
    ASGD
    "' -!
    • AWD-LSTMSGD
    #&
    .

    ASGD+
    • *SGD%
    )
    %
    Perplexity
    SGDASGD+

    View Slide

  28. PTB
    Ob3*5?h
    28
    2012
    141.2Kneser-Ney &1%8#5-gram
    124.7RNNOb3*5 [Mikolov+ 12]
    2014 78.4LSTMOb3*5 [Zaremba+ 14]
    2016
    75.0LSTMOb3*5 [Gal+ 16]
    (]o,6(0!+A<)
    2017 68.5Recurrent Highway Network [Zilly+ 17]
    66.1kbrQ;E=p\Y [Inan+ 17]
    64.4;EWC$+fM [Takase+ 17]
    60.3Simple Recurrent Unit [Lei+ 17]
    57.3LSTMOb3*5Iltd [Merity+ 18]
    54.4Mixture of Softmaxes [Yang+ 18]
    2018
    52.4Mixture of SoftmaxesoZ^P[[Takase+ 18]
    47.2:TEnsemble [Takase+ 18]
    62.4JGIlj aeqs [Zoph+ 17]
    58.3LSTMOb3*5. //42'qs [Melis+ 18]
    )&+*'`PNH)&+*'/42'mD>Fi
    IlBSc
    -(+7"aeVR
    20189
    -(+7"aeqU

    nuLSTMOb3*5
    IlBSc@KXgL_X

    View Slide

  29. Mixture of Softmaxes
    29
    P
    LSTM
    LSTM
    P2
    LSTM
    LSTM
    P1
    P3
    2: /
    P
    2: /
    P2
    LSTM
    P1
    P3
    P
    LSTM
    • 7=(HSoftmaxI0382: /&+
    – 7=(H$&.!#%L
    9B
    ELSTM1@ " [Yang+ 18]
    -;FI07=(H38
    A3,415?
    [Takase+ 18]
    $)F7=(H38
    D@K56J *>
    A3,420?

    View Slide

  30. Mixture of Softmax FAQ
    • =I9Softmax /a2
    – Yes5K0Softmax /UD
    – [Yang+ 18] 3S0Softmax /O A
    • &)(#?

    – '!%* Qc:XR-\ JE
    ?

    • 4LW$# 5 ~ 10%?<
    • >H[
    – YesSoftmax >H ^:P[
    – PTB 5ZV1[
    • M6#"!T
    7`b_C

    – YesKF+M6#"!8B;, G@
    30
    W Y
    .]>HJE
    1PGN.]>H

    View Slide

  31. C9$# A

    • Variational Dropout
    – [Gal+ 16] I?4N('$/E >@
    – 1!%'!%A
    • OpenNMTlua+".1HB• DropConnect
    – 1!%'!%A
    • Weight Tying
    – Transformer [Vaswani+ 17] A9
    • ASGD
    – Tranformer [Vaswani+ 17] OQFA9
    • ;4 -'036:5
    -'0)/,$=L8
    • BLEU 0.2 G72KEn-Cz[Popel+ 18]
    • MoS
    – MLSTM1!%'!%OpenNMT '*0(DP
    BLEU 1.7 G72KEn-FrIWSLT 2016[Takase+ 18]
    31

    View Slide




  32. • CX,'.S8\I
    – ;hjg^UG]`XfF:?…
    • 7R['%$&(• LSTM + 9bHYa5– cLSTM_K)!**-+% >4
    3– Q/=AWD-LSTM [Merity+ 18]
    • Variational dropout, DropConnect, Weight tying, ASGD
    • eANT2d EODP @ 16

    • – ZMNG L
    32

    View Slide

  33. 2%NL-42/3
    • $(,-.#'
    – 2018 10
    • $-.*"
    – *"
    !)+
    • '&
    33

    View Slide

  34. 34
    57.3AWD-LSTM [Merity+ 18]
    54.4AWD-LSTM-MoS [Yang+ 18]
    47.2AWD-LSTM-DOC (Ensemble) [Takase+ 18]
    6:&"'-0
    PTB Perplexity
    56.8AWD-LSTM + Fraternal Dropout [Zołna+ 18]
    56.5AWD-LSTM%52 [Merity+ 18]
    52.4AWD-LSTM-DOC [Takase+ 18]
    47.7AWD-LSTM-MoS + Dynamic evaluation [Yang+ 18]
    48.4AWD-LSTM-MoS + Dynamic evaluation %52 [Yang+ 18]
    53.3AWD-LSTM-MoS + FRAGE [Gong+ 18]
    47.7AWD-LSTM-MoS + FRAGE + Dynamic evaluation
    8 ,)2 [Gong+ 18]
    46.5AWD-LSTM-MoS + FRAGE + Dynamic evaluation [Gong+ 18]
    56.1AWD-LSTM + FRAGE [Gong+ 18]
    55.7DARTS/179[Liu+ 19]
    47.7AWD-LSTM-MoS (SigSoftmax) + Dynamic evaluation [Kanai+ 18]

    4#
    !$
    +3.

    (*

    View Slide

  35. #?*
    Dynamic evaluation [Krause+ 17]
    • -8 $514 26:03
    • 51)&=@A#?*
    • !" $ [Kuhn+ 90, Grave+ 17]
    '<
    – (+>. $51)&=@
    70;/%
    35

    )&,9(
    +>. $#?*

    View Slide

  36. U
    >D
    • UYH 4
    – c_(&,(0g9?JYH
    • IA?`f
    – )!**/+&dM
    – b23RIA
    • AWD-LSTM-MoS47.7NK→ 48.48^
    • AWD-LSTM-MoS + FRAGE46.5NK→ 47.78^
    • Perplexity 1 V?JC1• O:&%#PW
    ]
    – BFZa
    57
    – "-'$.,(0Le@6
    • 3;[QGXES,(0\O=X,(0
    36

    View Slide

  37. 37
    57.3AWD-LSTM [Merity+ 18]
    54.4AWD-LSTM-MoS [Yang+ 18]
    47.2AWD-LSTM-DOC (Ensemble) [Takase+ 18]
    6:&"'-0
    PTB Perplexity
    56.8AWD-LSTM + Fraternal Dropout [Zołna+ 18]
    56.5AWD-LSTM%52 [Merity+ 18]
    52.4AWD-LSTM-DOC [Takase+ 18]
    47.7AWD-LSTM-MoS + Dynamic evaluation [Yang+ 18]
    48.4AWD-LSTM-MoS + Dynamic evaluation %52 [Yang+ 18]
    53.3AWD-LSTM-MoS + FRAGE [Gong+ 18]
    47.7AWD-LSTM-MoS + FRAGE + Dynamic evaluation
    8 ,)2 [Gong+ 18]
    46.5AWD-LSTM-MoS + FRAGE + Dynamic evaluation [Gong+ 18]
    56.1AWD-LSTM + FRAGE [Gong+ 18]
    55.7DARTS/179[Liu+ 19]
    47.7AWD-LSTM-MoS (SigSoftmax) + Dynamic evaluation [Kanai+ 18]

    4#
    !$
    +3.

    (*

    View Slide

  38. Fraternal Dropout [Zołna+ 18]
    • Dropout mask H#"BUE?",0.)H=
    • X:/*1BUH="
    – Dropout mask H=9"BU5V$>
    • 6;R" Dropout mask $Q?/*1BU73
    "
    "
    • CFPerplexity57.3 → 56.8
    @M
    – -(0&2 AWD-LSTM – IG AWD-LSTM '+$E"
    – !-(0&27JZP ('%
    – 45/*1AWD-LSTM-DOCOF
    38
    p1, p2R" Dropout mask BU
    ← BUDLYN$AS=WK8

    View Slide

  39. FRAGE [Gong+ 18]
    • 619, ;% +

    • $!;%0;% .453:"#(7
    • -/Perplexity57.3 → 56.1)2
    – AWD-LSTM '856.5
    39
    I have
    P(have | I) P(a | I have)
    LSTM
    LSTM
    LSTM
    LSTM
    .45 .45
    LT
    LD
    LT
    + LD
    θT*1
    θD.453

    View Slide

  40. AWD-LSTM-MoS + FRAGE
    • )#&

    – Perplexity53.3→ 53.8(
    • AWD-LSTM-MoS *$ '
    – (! + Dropout "53.6
    • AWD-LSTM-DOC%
    40

    View Slide


  41. • PTB *%+47=?'
    – ! )6
    – !"+4.#9(
    • 3-+ 15/20
    – ! <&3-8
    – >$3-
    • :, ;:,

    41

    View Slide

  42. …
    • v${^"e\
    – +(-,*174*C ‰GPakˆ
    – g[}"kˆf…
    • ;u‡,*WJGPT-2[Radford+ 19]
    – qŠp |†kˆjI0)
    • `RPlH"Ft$>
    – KD&. + YB%3'6:A~Zc="
    – idXMy$h ƒTF>V€?‚$]z
    id$b<„
    • L@PamS$\!#"wr
    – PaxUOoE925
    • /578FtNn_
    "sQ"……
    42

    View Slide

  43. 2%NL-42/3
    • $(,-.#'
    – 2018 10
    • $-.*"
    – *"
    !)+
    • '&
    43

    View Slide

  44. 2/: >

    )6(
    • 10*'=69 >
    – 1 billion word corpus, English Wikipedia, …
    • 1@% (07 >
    44
    I have a
    P(have | I) P(a | I have) P(dream | I have a)
    )6LSTMELMoTransformerGPT
    I [MASK] a
    have
    MASK "BERT
    3648+A
    ?-!&B
    9#.,;
    5$1<
    ……
    1

    View Slide

  45. word2vec >$LS0<
    45
    RNN6C !R7-2
    /J?C(@
    0B59 [Mikolov+ 13]
    "FAE,N8D &H*
    RNNKIskip-gramCBOW [Mikolov+ 13]
    =+4+
    1 billion word corpus,N
    LSTM6C !O;-2T#
    4+.&0:G1 [Peter+ 17]
    ELMo [Peter+ 18]
    1 billion word corpus
    Q'3LSTM6C !,N
    =+P0 4+
    BERT [Devlin+ 19]
    &H%)M+LSTM → Transformer
    =+4+
    man
    woman
    king
    queen
    I have a dream that
    I have dream that

    View Slide

  46. (% &
    • p"Q!ELMoBERT se9L
    – _bRM*@<('&
    – $&8vta5079'
    • ELMoBERT O Gig\8[

    • word2vec RNN *t
    • Nb507C]f8]fY0/?':BoS
    – PTB I;KU507l
    • AWD-LSTM-DOC [Takase+ 18] AWD-LSTM [Merity+ 18] 5kl
    • 8]f0/V)(:B#P dVariational dropout

    – ^D/.,#8]f0/*q ZJ'
    • HuwrJF
    • 0/]f=.-7':BV%
    – A> Transformer 3.507WX
    • Transformer-XL [Dai+ 19]PTB 54.44 AWD-LSTM57.3

    • hnGiTj 1+2264/E`cma

    46

    View Slide


  47. • 5=V;J&"(XM
    – LSTM + 4S>KAWD-LSTM [Merity+ 18]
    – CG*W<D8[Yang+ 18, Takase+ 18]
    • U[5-6BI
    – )YH?

    • !#" $'% T0
    • A79N:@
    • 1/QPF3,ZLO
    – " ER.(+2
    47

    View Slide

  48. 1/5
    • Mikolov et al., Empirical Evaluation and Combination of
    Advanced Language Modeling Techniques. INTERSPEECH
    2011.
    • Mikolov et al., Context Dependent Recurrent Neural Network
    Language Model. SLT 2012.
    • Zaremba et al., Recurrent Neural Network Regularization.
    2014.
    • Gal et al., A Theoretically Grounded Application of Dropout in
    Recurrent Neural Networks. NIPS 2016.
    • Zilly et al., Recurrent Highway Networks. ICML 2017.
    • Inan et al., Tying Word Vectors and Word Classifiers: A Loss
    Framework for Language Modeling. ICLR 2017.
    • Takase et al., Input-to-output gate to improve rnn language
    models. IJCNLP 2017.
    48

    View Slide

  49. 2/5
    • Zoph et al., Neural Architecture Search with
    Reinforcement Learning. ICLR 2017.
    • Lei et al., Simple Recurrent Units for Highly
    Parallelizable Recurrence. EMNLP 2018.
    • Melis et al., On the state of the art of evaluation in
    neural language models. ICLR 2018.
    • Merity et al., Regularizing and Optimizing LSTM
    Language Models. ICLR 2018.
    • Yang et al., Breaking the softmax bottleneck: A
    high-rank RNN language model. ICLR 2018.
    • Takase et al., Direct Output Connection for a High-
    Rank Language Model. EMNLP 2018.
    49

    View Slide

  50. 3/5
    • Wan et al., Regularization of Neural Networks
    using DropConnect. ICML 2013.
    • Press et al., Using the Output Embedding to
    Improve Language Models. EACL 2017.
    • Polyak et al., Acceleration of Stochastic
    Approximation by Averaging. SIAM Journal
    on Control and Optimization 1992.
    • Vaswani et al., Attention Is All You Need.
    NIPS 2017.
    • Popel et al., Training Tips for the Transformer
    Model. PBML 2018.
    50

    View Slide

  51. 4/5
    • Zołna et al., Fraternal dropout. ICLR 2018.
    • Gong et al., FRAGE: Frequency-Agnostic Word
    Representation. NIPS 2018.
    • Liu et al., Deep Residual Output Layers for Neural
    Language Generation. ICML 2019.
    • Kanai et al., Sigsoftmax: Reanalysis of the Softmax
    Bottleneck. NIPS 2018.
    • Krause et al., Dynamic Evaluation of Neural Sequence
    Models. 2017.
    • Kuhn et al., A cache-based natural language model for
    speech recognition. PAMI 1990.
    • Grave et al., Improving Neural Language Models with
    a Continuous Cache. ICLR 2017.
    51

    View Slide

  52. 5/5
    • Radford et al., Language Models are Unsupervised
    Multitask Learners. 2019.
    • Mikolov et al., Linguistic Regularities in Continuous
    Space Word Representations. NAACL 2013.
    • Mikolov et al., Distributed Representations of Words
    and Phrases and their Compositionality. NIPS 2013.
    • Peter et al., Semi-supervised sequence tagging with
    bidirectional language models. ACL 2017.
    • Peter et al., Deep Contextualized Word
    Representations. NAACL 2018.
    • Devlin et al., BERT: Pre-training of Deep Bidirectional
    Transformers for Language Understanding. NAACL
    2019.
    52

    View Slide