• “I went to Nepal in 2009”ͱ”In 2009, I went to Nepal” RNNsͱύϥϝʔλڞ༗ s(0) s(1) s(2) s(3) s(4) s(5) s(0) s(1) s(2) s(3) s(4) s(0) s(1) s(2) s(3) s(4) s(5)
(t) o (t) h (t 1) h (t 1) h (t) h (t) x (t 1) x (t 1) x (t) x (t) W V V U U o (t 1) o (t 1) o (t) o (t) L(t 1) L(t 1) L(t) L(t) y (t 1) y (t 1) y (t) y (t) h (t 1) h (t 1) h (t) h (t) x (t 1) x (t 1) x (t) x (t) W V V U U Train time Test time h (t 1) h (t 1) W h (t) h (t) . . . . . . x (t 1) x (t 1) x (t) x (t) x (...) x (...) W W U U U h ( ) h ( ) x ( ) x ( ) W U o ( ) o ( ) y ( ) y ( ) L( ) L( ) V . . . . . . 10.2 Recurrent Neural Network U U V V W W o (t 1) o (t 1) h h o o y y L L x x o (t) o (t) o (t+1) o (t+1) L(t 1) L(t 1) L(t) L(t) L (t+1) L (t+1) y (t 1) y (t 1) y (t) y (t) y (t+1) y (t+1) h (t 1) h (t 1) h (t) h (t) h (t+1) h (t+1) x(t 1) x(t 1) x(t) x(t) x(t+1) x(t+1) W W W W W W W W h (... ) h (... ) h (... ) h (... ) V V V V V V U U U U U U Unfold http://www.deeplearningbook.org/lecture_slides.html U V W o (t 1) o (t 1) h h o o y y L L x x o (t) o (t) o (t+1) o (t+1) L(t 1) L(t 1) L(t) L(t) L (t+1) L (t+1) y (t 1) y (t 1) y (t) y (t) y (t+1) y (t+1) h (t 1) h (t 1) h (t) h (t) h (t+1) h (t+1) x(t 1) x(t 1) x(t) x(t) x(t+1) x(t+1) W W W W o (... ) o (... ) h (... ) h (... ) V V V U U U Unfold
ideas of section 10.1, w n design a wide variety of recurrent neural networks. U U V V W W o (t 1) o (t 1) h h o o y y L L x x o (t) o (t) o (t+1) o (t+1) L(t 1) L(t 1) L(t) L(t) L (t+1) L (t+1) y (t 1) y (t 1) y (t) y (t) y (t+1) y (t+1) h (t 1) h (t 1) h (t) h (t) h (t+1) h (t+1) x(t 1) x(t 1) x(t) x(t) x(t+1) x(t+1) W W W W W W W W h (... ) h (... ) h (... ) h (... ) V V V V V V U U U U U U Unfold http://www.deeplearningbook.org/lecture_slides.html
(t 1) h h o o y y L L x x o (t) o (t) o (t+1) o (t+1) L(t 1) L(t 1) L(t) L(t) L (t+1) L (t+1) y (t 1) y (t 1) y (t) y (t) y (t+1) y (t+1) h (t 1) h (t 1) h (t) h (t) h (t+1) h (t+1) x(t 1) x(t 1) x(t) x(t) x(t+1) x(t+1) W W W W o (... ) o (... ) h (... ) h (... ) V V V U U U Unfold
step first, because the des the ideal value of that output. h (t 1) h (t 1) W h (t) h (t) . . . . . . x (t 1) x (t 1) x (t) x (t) x (...) x (...) W W U U U h ( ) h ( ) x ( ) x ( ) W U o ( ) o ( ) y ( ) y ( ) L( ) L( ) V . . . . . .
o (t) o (t) h (t 1) h (t 1) h (t) h (t) x (t 1) x (t 1) x (t) x (t) W V V U U o (t 1) o (t 1) o (t) o (t) L(t 1) L(t 1) L(t) L(t) y (t 1) y (t 1) y (t) y (t) h (t 1) h (t 1) h (t) h (t) x (t 1) x (t 1) x (t) x (t) W V V U U Train time Test time http://www.deeplearningbook.org/lecture_slides.html
10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS mation flow forward in time (computing outputs and losses) and backward me (computing gradients) by explicitly showing the path along which this mation flows. Recurrent Neural Networks d with the graph unrolling and parameter sharing ideas of section 10.1, we esign a wide variety of recurrent neural networks. U U V V W W o (t 1) o (t 1) h h o o y y L L x x o (t) o (t) o (t+1) o (t+1) L(t 1) L(t 1) L(t) L(t) L (t+1) L (t+1) y (t 1) y (t 1) y (t) y (t) y (t+1) y (t+1) h (t 1) h (t 1) h (t) h (t) h (t+1) h (t+1) x(t 1) x(t 1) x(t) x(t) x(t+1) x(t+1) W W W W W W W W h (... ) h (... ) h (... ) h (... ) V V V V V V U U U U U U Unfold 10.3: The computational graph to compute the training loss of a recurrent network maps an input sequence of x values to a corresponding sequence of output o values. http://www.deeplearningbook.org/lecture_slides.html a(t) = b + Wh(t 1) + Ux(t) h(t) = tanh(a(t)) o(t) = c + Vh(t) ˆ y(t) = softmax(o(t))
of section 10.1, we gn a wide variety of recurrent neural networks. U U V V W W o (t 1) o (t 1) h h o o y y L L x x o (t) o (t) o (t+1) o (t+1) L(t 1) L(t 1) L(t) L(t) L (t+1) L (t+1) y (t 1) y (t 1) y (t) y (t) y (t+1) y (t+1) h (t 1) h (t 1) h (t) h (t) h (t+1) h (t+1) x(t 1) x(t 1) x(t) x(t) x(t+1) x(t+1) W W W W W W W W h (... ) h (... ) h (... ) h (... ) V V V V V V U U U U U U Unfold http://www.deeplearningbook.org/lecture_slides.html
exp (o(t) i ) (i = k) 0 (otherwise) ˆ y(t) i = softmax(o(t))i = exp (o(t) i ) P c exp (o(t) c ) X c @L(t) @ˆ y(t) c · @ˆ y(t) c @o(t) i = @L(t) @ˆ y(t) i · @ˆ y(t) i @o(t) i + X c6=i @L(t) @ˆ y(t) c · @ˆ y(t) c @o(t) i ͳͷͰɼiʹ͍ͭͯ߹͚͢Δͱ… ͜͜Ͱɼ JD J㱠D
@o(t) i = y(t) i ˆ y(t) i · exp (o(t) i ) P c exp (o(t) c ) exp (o(t) i ) exp (o(t) i ) P c exp (o(t) c )2 = y(t) i ˆ y(t) i · exp (o(t) i )( P c exp (o(t) c ) exp (o(t) i )) P c exp (o(t) c )2 = y(t) i ˆ y(t) i · ˆ y(t) i (1 ˆ y(t) i ) = y(t) i (1 ˆ y(t) i ) = y(t) i ˆ y(t) i y(t) i @L(t) @ˆ y(t) c · @ˆ y(t) c @o(t) i = y(t) c ˆ y(t) c · exp (o(t) i ) exp (o(t) c ) P c exp (o(t) c )2 = y(t) c ˆ y(t) c · ˆ y(t) i ˆ y(t) c = y(t) c · ˆ y(t) i f g 0 f0g fg0 g2
@ˆ y(t) c @o(t) i = @L(t) @ˆ y(t) i · @ˆ y(t) i @o(t) i + X c6=i @L(t) @ˆ y(t) c · @ˆ y(t) c @o(t) i = y(t) i ˆ y(t) i y(t) i + X c6=i y(t) c · ˆ y(t) i = y(t) i + X c y(t) c · ˆ y(t) i = ˆ y(t) i y(t) i (* X c y(t) c = 1) ) @L @o(t) i = @L @L(t) · @L(t) @o(t) i = @L @L(t) · X c @L(t) @ˆ y(t) c · @ˆ y(t) c @o(t) i = ˆ y(t) i y(t) i ϕΫτϧͰදݱ͢Δͱʁ JOEJDBUPSGVODUJPOΛ͏ ڭՊॻͱҰக͢Δ
of section 10.1, we gn a wide variety of recurrent neural networks. U U V V W W o (t 1) o (t 1) h h o o y y L L x x o (t) o (t) o (t+1) o (t+1) L(t 1) L(t 1) L(t) L(t) L (t+1) L (t+1) y (t 1) y (t 1) y (t) y (t) y (t+1) y (t+1) h (t 1) h (t 1) h (t) h (t) h (t+1) h (t+1) x(t 1) x(t 1) x(t) x(t) x(t+1) x(t+1) W W W W W W W W h (... ) h (... ) h (... ) h (... ) V V V V V V U U U U U U Unfold http://www.deeplearningbook.org/lecture_slides.html
@L(t) · ⇣X u @L(t) @h(t+1) u @h(t+1) u @h(t) i + X c @L(t) @o(t) c @o(t) c @h(t) i ⌘ = @L(t) @h(t+1) i @h(t+1) i @h(t) i + @L(t) @o(t) i @o(t) i @h(t) i ⇣ * @h(t+1) u @h(t) i = 0 (u 6= i) & @o(t) c @h(t) i = 0 (c 6= i) ⌘
@h(t) i = @ tanh(a(t+1) i ) @h(t) i = ⇣ 1 tanh2(a(t+1) i ) ⌘ · @ (bi + Wh(t) i + Ux(t+1) i ) @h(t) i = ⇣ 1 tanh2(a(t+1) i ) ⌘ · X u Wiu ) @L(t) @h(t+1) i @h(t+1) i @h(t) i = ⇣ 1 tanh2(a(t+1) i ) ⌘ · (rh(t+1)L )i · X u Wiu )rh(t) L = WT · (rh(t+1) L) · diag ⇣ 1 (h(t+1)))2 ⌘ @o(t) i @h(t) i = @ ci + V h(t) i @h(t) i = X u Viu ) @L @h(t) i = ro(t) L i X u Viu )rh(t) L = V T ro(t) L
⇣X u @L(t) @h(t+1) u @h(t+1) u @h(t) i + X c @L(t) @o(t) c @o(t) c @h(t) i ⌘ = @L(t) @h(t+1) i @h(t+1) i @h(t) i + @L(t) @o(t) i @o(t) i @h(t) i ⇣ * @h(t+1) u @h(t) i = 0 (u 6= i) & @o(t) c @h(t) i = 0 (c 6= i) ⌘ )rh(t) L = WT · (rh(t+1) L) · diag ⇣ 1 (h(t+1)))2 ⌘ + V T ro(t) L
of section 10.1, we gn a wide variety of recurrent neural networks. U U V V W W o (t 1) o (t 1) h h o o y y L L x x o (t) o (t) o (t+1) o (t+1) L(t 1) L(t 1) L(t) L(t) L (t+1) L (t+1) y (t 1) y (t 1) y (t) y (t) y (t+1) y (t+1) h (t 1) h (t 1) h (t) h (t) h (t+1) h (t+1) x(t 1) x(t 1) x(t) x(t) x(t+1) x(t+1) W W W W W W W W h (... ) h (... ) h (... ) h (... ) V V V V V V U U U U U U Unfold http://www.deeplearningbook.org/lecture_slides.html Ͳ͔͜
@o(t) k · @o(t) k @ci = X t @L(t) @o(t) i (* i 6= k ) @o(t) k @ci = 0) )rcL = X t ro(t) L(t) @L @bi = X t X c @L(t) @h(t) c · X k @h(t) c @a(t) k · @a(t) k @bi = X t (rh(t) L(t))i · (1 tanh2(h(t) i )) (* k 6= i ) @a(t) k @bi = 0) ) rbL = X t diag(1 (h(t))2)rh(t) L
of section 10.1, we gn a wide variety of recurrent neural networks. U U V V W W o (t 1) o (t 1) h h o o y y L L x x o (t) o (t) o (t+1) o (t+1) L(t 1) L(t 1) L(t) L(t) L (t+1) L (t+1) y (t 1) y (t 1) y (t) y (t) y (t+1) y (t+1) h (t 1) h (t 1) h (t) h (t) h (t+1) h (t+1) x(t 1) x(t 1) x(t) x(t) x(t+1) x(t+1) W W W W W W W W h (... ) h (... ) h (... ) h (... ) V V V V V V U U U U U U Unfold http://www.deeplearningbook.org/lecture_slides.html
c @L @L(t) · @L(t) @o(t) c · @o(t) c @Vij = X t @L(t) @o(t) i · h(t) j = X t @L @o(t) i · h(t) j * @o(t) c @Vij = @(ci + Vh(t) i ) @Vij = h(t) j (* Vh(t) i = X k Vikh(t) k ) ) rV L = X t (ro(t) L)h(t)T
of section 10.1, we gn a wide variety of recurrent neural networks. U U V V W W o (t 1) o (t 1) h h o o y y L L x x o (t) o (t) o (t+1) o (t+1) L(t 1) L(t 1) L(t) L(t) L (t+1) L (t+1) y (t 1) y (t 1) y (t) y (t) y (t+1) y (t+1) h (t 1) h (t 1) h (t) h (t) h (t+1) h (t+1) x(t 1) x(t 1) x(t) x(t) x(t+1) x(t+1) W W W W W W W W h (... ) h (... ) h (... ) h (... ) V V V V V V U U U U U U Unfold http://www.deeplearningbook.org/lecture_slides.html
X u @L @h(t) u · @h(t) u @Wij = X t @L @h(t) i · @h(t) i @Wij (* h(t) i = tanh(a(t) i ) ) @h(t) i @Wij = X c @h(t) i @a(t) c · @a(t) c @Wij = @h(t) i @a(t) i · @a(t) i @Wij ) = 1 tanh2(a(t) i ) · @a(t) i @Wij = (1 h(t)2 )h(t 1) j (* (Wh(t))i = X k Wikh(t) k ) @(Wh(t))i @Wij = h(t) j )
of section 10.1, we gn a wide variety of recurrent neural networks. U U V V W W o (t 1) o (t 1) h h o o y y L L x x o (t) o (t) o (t+1) o (t+1) L(t 1) L(t 1) L(t) L(t) L (t+1) L (t+1) y (t 1) y (t 1) y (t) y (t) y (t+1) y (t+1) h (t 1) h (t 1) h (t) h (t) h (t+1) h (t+1) x(t 1) x(t 1) x(t) x(t) x(t+1) x(t+1) W W W W W W W W h (... ) h (... ) h (... ) h (... ) V V V V V V U U U U U U Unfold http://www.deeplearningbook.org/lecture_slides.html
u @L @h(t) u · @h(t) u @Uij = X t @L @h(t) i · @h(t) i @Uij = X t @L @h(t) i · X c @h(t) i @a(t) c · @a(t) c @Uij = X t @L @h(t) i · @h(t) i @a(t) i · @a(t) i @Uij = X t @L @h(t) i · (1 tanh2(h(t) i )) · x(t) j (* Ux(t) i = X k Uikx(t) k ) @Ux(t) i @Uij = x(t) j ) ) rU L = X t diag ⇣ 1 (h(t))2 ⌘ (rh(t) L)x(t)T
X t diag(1 (h(t))2)rh(t) L rWL = X t diag ⇣ 1 (h(t))2 ⌘ (rh(t) L)h(t 1)T rUL = X t diag ⇣ 1 (h(t))2 ⌘ (rh(t) L)x(t)T rVL = X t (ro(t) L)h(t)T ͜ΕͰֶश͕Ͱ͖Δʂ
10.9 how an RNN can map a fixed-size vector to a sequence. We have seen in figures 10.3, 10.4, 10.10 and 10.11 how an RNN can map an input sequence to an output sequence of the same length. Encoder … x (1) x (1) x (2) x (2) x (...) x (...) x (nx) x (nx) Decoder … y (1) y (1) y (2) y (2) y (...) y (...) y (ny) y (ny) C C http://www.deeplearningbook.org/lecture_slides.html
x (1) x (1) x (2) x (2) x (3) x (3) V V V y y L L x (4) x (4) V o o U W U W U W Figure 10.14: A recursive network has a computational graph that generalizes that of the
• RNN (recursive neural network): ߏΛ࣋ͭσʔλ CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS information flow forward in time (computing outputs and losses) and backward in time (computing gradients) by explicitly showing the path along which this information flows. 10.2 Recurrent Neural Networks Armed with the graph unrolling and parameter sharing ideas of section 10.1, we can design a wide variety of recurrent neural networks. U U V V W W o (t 1) o (t 1) h h o o y y L L x x o (t) o (t) o (t+1) o (t+1) L(t 1) L(t 1) L(t) L(t) L (t+1) L (t+1) y (t 1) y (t 1) y (t) y (t) y (t+1) y (t+1) h (t 1) h (t 1) h (t) h (t) h (t+1) h (t+1) x(t 1) x(t 1) x(t) x(t) x(t+1) x(t+1) W W W W W W W W h (... ) h (... ) h (... ) h (... ) V V V V V V U U U U U U Unfold Figure 10.3: The computational graph to compute the training loss of a recurrent network that maps an input sequence of x values to a corresponding sequence of output o values. CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS can be mitigated by introducing skip connections in the hidden-to-hidden path, a illustrated in figure 10.13c. 10.6 Recursive Neural Networks x (1) x (1) x (2) x (2) x (3) x (3) V V V y y L L x (4) x (4) V o o U W U W U W Figure 10.14: A recursive network has a computational graph that generalizes that of th recurrent network from a chain to a tree. A variable-size sequence x(1), x(2), . . . , x(t) ca be mapped to a fixed-size representation (the output o), with a fixed set of paramete http://www.deeplearningbook.org/lecture_slides.html