Upgrade to PRO for Only $50/Yearโ€”Limited-Time Offer! ๐Ÿ”ฅ

DNN for Structural Data

DNN for Structuralย Data

Recurrent Neural Networks (RNNs), Gradient vanishing and exploding, Long Short-Term Memory (LSTM), Gated Recurrent Units (GRUs), Recursive Neural Network, Tree-structured LSTM, Convolutional Neural Networks (CNNs)

Avatar for Naoaki Okazaki

Naoaki Okazaki PRO

August 07, 2020
Tweet

More Decks by Naoaki Okazaki

Other Decks in Research

Transcript

  1. DNN for Structural Data Naoaki Okazaki School of Computing, Tokyo

    Institute of Technology [email protected] PowerPoint template designed by https://ppt.design4u.jp/template/
  2. Embeddings for phrases and sentences 1 ๏ฌ Word embeddings represent

    words with real-valued vectors ๏ฌ Is it possible to consider embeddings for phrases and sentences? John loves Mary John loves Mary
  3. The baseline: additive composition 2 (Example with word embeddings of

    two-dimensional vectors) This approach surprisingly works well in practice, but cannot distinguish different word orders (โ€œJohn loves Maryโ€ vs โ€œMary loves Johnโ€) loves (1,0) Mary (0,1) John (0.25,-0.25) John loves Mary (1.25, 0.75)
  4. Summary ๏ฌ Various NN architectures that can leverage structures ๏ฌ

    Recurrent Neural Networks (RNNs) ๏ฌ Long Short-Term Memories (LSTMs) ๏ฌ Gated Recurrent Units (GRU) ๏ฌ Recursive Neural Networks (Recursive NNs) ๏ฌ Convolutional Neural Networks (CNNs) 3
  5. Recurrent Neural Networks (RNNs) (Sutskever+ 2011) 5 I Sutskever, J

    Martens, G Hinton. 2011. Generating text with recurrent neural networks. Proc. of ICML, pp. 1017โ€“1024. John loves โ„Ž 4 ๐‘ฆ โ„Žโ„Ž Mary โ„Žโ„Ž much โ„Žโ„Ž softmax Word embeddings Represent a word with a vector โˆˆ โ„ โ„Ž โ„Ž โ„Ž 1 2 3 1 2 3 4 Recurrent computation Compose a hidden vector from an input word and the hidden vector โˆ’1 at the previous timestep = (โ„Ž + โ„Žโ„Žโˆ’1) Fully-connected layer for a task Make a prediction from the hidden vector 4 , which are composed from all words in the sentence, by using a fully-connected layer and softmax 0 = 0 โ„Žโ„Ž โ˜บ The parameters โ„Ž , โ„Žโ„Ž , ๐‘ฆ are shared over the entire sequence They are trained by the supervision signal 1 , โ€ฆ , 4 , using backpropagation
  6. RNN in math 6 โ„Ž โ„Žโ„Ž + โˆ’1 tanh โ„Ž

    โ„Žโ„Ž + +1 tanh +1 +1 = RNN , โˆ’1 = โ„Ž + โ„Žโ„Ž โˆ’1 , โˆˆ โ„, โˆˆ โ„, โ„Ž โˆˆ โ„ร—, โ„Žโ„Ž โˆˆ โ„ร— Typical activation functions are tanh and ReLU RNN RNN
  7. Multi-layer RNNs 7 โ„Ž (1) โ„Žโ„Ž (1) + โˆ’1 (1)

    tanh (1) โ„Ž (1) โ„Žโ„Ž (1) + +1 (0) tanh +1 (1) (0) RNN(1) RNN(1) โ„Ž (2) โ„Žโ„Ž (2) + โˆ’1 (2) tanh (2) โ„Ž (2) โ„Žโ„Ž (2) + tanh +1 (2) +1 (2) (2) RNN(2) RNN(2) = = +1
  8. Forward and backward RNNs 8 โ„Ž โ„Žโ„Ž + โˆ’1 tanh

    RNN โ„Ž โ„Žโ„Ž + โˆ’1 tanh RNN Forward RNNs = RNN , โˆ’1 = โ„Ž + โ„Žโ„Ž โˆ’1 Backward RNNs โˆ’1 = RNN , = โ„Ž + โ„Žโ„Ž
  9. Bidirectional RNNs (Graves+ 2013) 9 John loves Mary much softmax

    Forward Backward Concatenate the last hidden vectors of the both directions Fully-connected layer for a task The same as unidirectional RNNs โ˜บ A Graves, A Mohamed and G Hinton. 2013. Speech Recognition with Deep Recurrent Neural Networks. Proc. of ICASSP, pp. 6645-6649.
  10. Unfolded Recurrent Neural Network 10 ๏ฌ Process a sequence 1

    , 2 , โ€ฆ , of length ๏ฌ Include interactions from the past ๏ฌ Neural network is deep in time direction ๏ฌ Share parameters of โ„Ž and โ„Žโ„Ž over sequence ๏ฌ Trained by backpropagation on unfolded graph ๏ฌ This is called backpropagation through time (BPTT) RNN 1 , 2 , โ€ฆ , 1 , 2 , โ€ฆ , Unfold RNN 1 1 RNN 2 2 RNN
  11. Example: RNN for nationality prediction 11 G o โ„Ž 4

    ๐‘ฆ โ„Žโ„Ž t โ„Žโ„Ž o โ„Žโ„Ž softmax โ„Ž โ„Ž โ„Ž 1 2 3 1 2 3 4 โˆˆ โ„18 0 = 0 โ„Žโ„Ž โˆˆ โ„55 (one-hot vector) (55 = |[A-Za-z .,;']|) โˆˆ โ„128
  12. Preprocess the data 12 [ [ "Nguyen", "Vietnameseโ€œ ], [

    "Tron", "Vietnameseโ€œ ], [ "Le", "Vietnameseโ€œ ], โ€ฆโ€ฆ ] https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb
  13. Convert the data into numerical data 13 [ [[16, 35,

    49, 53, 33, 42], 17], [[22, 46, 43, 42], 17], [[14, 33], 17], โ€ฆโ€ฆ ] Find alphabet (X) and a set of country names (Y) Build an associative array to map a letter/country into an integer ID Convert letters and countries into integer IDs by using the associative arrays https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb
  14. Bare implementation of RNN states 14 = โ„Ž + โ„Žโ„Ž

    โˆ’1 = (โ„Ž [ ; โˆ’1 ]) https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb
  15. Long-term dependency 17 ๏ฌ Consider a simplified RNN (without an

    input and activation function), = โˆ’1 ๏ฌ After steps, this is equivalent to multiplying = 0 ๏ฌ When has an eigenvalue decomposition, = diag()โˆ’1 ๏ฌ We can compute as, = diag โˆ’1 = diag โˆ’1 ๏ฌ The eigenvalues are multiplied times ๏ฌ When < 1, โ†’ 0 (gradient vanishing) ๏ฌ When > 1, โ†’ โˆž (gradient exploding) ๏ฌ Computing in this way is similar to the power method ๏ฌ will be close to the eigenvector for the largest eigenvalue of , regardless of the vector 0 I Goodfellow, Y Bengio, A Courville. 2016. Deep Learning, page 286, MIT Press.
  16. Gradient vanishing/exploding problem 18 ๏ฌ Gradients vanish or explode over

    time ๏ฌ More detailed explanations: ๏ฌ Why are deep neural networks hard to train? http://neuralnetworksanddeeplearning.com/chap5.html ๏ฌ ใƒ‹ใƒฅใƒผใƒฉใƒซใƒใƒƒใƒˆใƒฏใƒผใ‚ฏใ‚’่จ“็ทดใ™ใ‚‹ใฎใฏใชใœ้›ฃใ—ใ„ใฎใ‹ https://nnadl-ja.github.io/nnadl_site_ja/chap5.html ๏ฌ Recurrent Neural Networks LSTMs and Vanishing & Exploding Gradients - Fun and Easy Machine Learning https://www.youtube.com/watch?v=2GNbIKTKCfE RNN 1 1 RNN 2 2 RNN RNN โˆ’2 โˆ’2 RNN โˆ’1 โˆ’1
  17. Addressing gradient vanishing/exploding 19 ๏ฌ Gradient exploding ๏ฌ Gradient clipping

    (Pascanu+ 2013) ๏ฌ Gradient vanishing ๏ฌ Activation function: tanh to ReLU ๏ฌ Long Short-Term Memory (LSTM) ๏ฌ Gated Recurrent Unit (GRU) ๏ฌ Residual Networks (Pascanu+ 2013) When the norm of gradients is above the threshold, scale down the gradients R Pascanu, T Mikolov, Y Bengio. 2013. On the difficulty of training recurrent neural networks. Proc. of ICML, pp. 1310-1318.
  18. Long Short-Term Memory (Hochreiter+ 1997) 21 ๏ฌ Consist of (โˆ—

    denotes elementwise product): ๏ฌ Hidden state: โ„Ž = โˆ— tanh ๏ฌ Memory cell: = โˆ— โˆ’1 + โˆ— tanh ๐‘”๐‘” + ๐‘” โ„Žโˆ’1 ๏ฌ Input gate: = ๐‘–๐‘– + ๐‘– โ„Žโˆ’1 ๏ฌ Output gate: = ๐‘œ๐‘œ + ๐‘œ โ„Žโˆ’1 ๏ฌ Forget gate: = ๐‘“๐‘“ + ๐‘“ โ„Žโˆ’1 ๏ฌ The architecture looks complicated, but LSTMs are also a neural network ๏ฌ LSTMs can be also trained by the standard procedure of backpropagation S Hochreiter, J Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735โ€“1780.
  19. LSTM in math and diagram 22 ๐‘“๐‘“ ๐‘“ ๐‘–๐‘– ๐‘–

    ๐‘”๐‘” ๐‘” ๐‘œ๐‘œ ๐‘œ + + + tanh + * + * tanh * โ„Žโˆ’1 โˆ’1 โ„Ž โ„Ž Memory cell Hidden state Forget gate Input gate Output gate = ๐‘“๐‘“ + ๐‘“ โ„Žโˆ’1 = + โ„Ž โ„Žโˆ’1 = + โ„Ž โ„Žโˆ’1 = tanh ๐‘”๐‘” + ๐‘” โ„Žโˆ’1 = โˆ— โˆ’1 + โˆ— โ„Ž = โˆ— tanh
  20. Implementation of LSTM cell in pytorch 24 ๐‘“๐‘“ ๐‘“ ๐‘–๐‘–

    ๐‘– ๐‘”๐‘” ๐‘” ๐‘œ๐‘œ ๐‘œ + + + tanh + * + * tanh * โ„Žโˆ’1 โˆ’1 โ„Ž โ„Ž Memory cell Hidden state Forget gate Input gate Output gate def LSTMCell(x, hidden, w_x, w_h, b_x=None, b_h=None): h_prev, c_prev = hidden gates = F.linear(x, w_x, b_x) + F.linear(h_prev, w_h, b_h) ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1) ingate = F.sigmoid(ingate) forgetgate = F.sigmoid(forgetgate) cellgate = F.tanh(cellgate) outgate = F.sigmoid(outgate) ct = (forgetgate * c_prev) + (ingate * cellgate) ht = outgate * F.tanh(ct) return ht, ct
  21. Implementation of LSTM cell in pytorch 25 ๐‘“๐‘“ ๐‘–๐‘– ๐‘”๐‘”

    ๐‘œ๐‘œ + + + + (x) w_x = [๐‘“๐‘“ ; ๐‘–๐‘– ; ๐‘”๐‘” ; ๐‘œ๐‘œ ] def LSTMCell(x, hidden, w_x, w_h, b_x=None, b_h=None): h_prev, c_prev = hidden gates = F.linear(x, w_x, b_x)
  22. Implementation of LSTM cell in pytorch 26 ๐‘“๐‘“ ๐‘“ ๐‘–๐‘–

    ๐‘– ๐‘”๐‘” ๐‘” ๐‘œ๐‘œ ๐‘œ + + + + โ„Žโˆ’1 (h_prev) (x) w_x = [๐‘“๐‘“ ; ๐‘–๐‘– ; ๐‘”๐‘” ; ๐‘œ๐‘œ ] w_h = [โ„Ž ; โ„Ž ; โ„Ž ; โ„Ž ] def LSTMCell(x, hidden, w_x, w_h, b_x=None, b_h=None): h_prev, c_prev = hidden gates = F.linear(x, w_x, b_x) + F.linear(h_prev, w_h, b_h) gates
  23. Implementation of LSTM cell in pytorch 27 gates = F.linear(x,

    w_x, b_x) + F.linear(h_prev, w_h, b_h) ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1) ingate = F.sigmoid(ingate) forgetgate = F.sigmoid(forgetgate) cellgate = F.tanh(cellgate) outgate = F.sigmoid(outgate) tanh gates ingate forgetgate cellgate outgate
  24. Implementation of LSTM cell in pytorch 28 outgate = F.sigmoid(outgate)

    ct = (forgetgate * c_prev) + (ingate * cellgate) ht = outgate * F.tanh(ct) return ht, ct * + * tanh * โˆ’1 (c_prev) โ„Ž โ„Ž ingate forgetgate cellgate outgate
  25. LSTM remedies vanishing gradients 29 ๏ฌ Memory cells provide short

    cuts among states ๏ฌ Memory cells do not suffer from zero gradients caused by activation functions (tanh and ReLU) ๏ฌ Memory cells are connected without activation functions ๏ฌ Information in โˆ’1 can flow when a forget gate is wide opened ( = 1) ๏ฌ The input from each state ( โˆ— ) has no effect in computing โˆ’1 + * โˆ’1 + * +1 +1 โˆ— +1 โˆ— +1
  26. Gated Recurrent Unit (GRU) (Cho+ 2014) 31 ๏ฌ Consist of

    (โˆ— denotes elementwise product): ๏ฌ Hidden state: โ„Ž = โˆ— โ„Žโˆ’1 + 1 โˆ’ โˆ— ๏ฌ New hidden state: = tanh โ„Ž + โ„Žโ„Ž ( โˆ— โ„Žโˆ’1 ) ๏ฌ Reset gate: = ๐‘Ÿ๐‘Ÿ + ๐‘Ÿ โ„Žโˆ’1 ๏ฌ Update gate: = ๐‘ง๐‘ง + ๐‘ง โ„Žโˆ’1 ๏ฌ Motivated by LSTM unit ๏ฌ But much simpler to compute and implement K Cho, van B Merrienboer, C Gulcehre, D Bahdanau, F Bougares, H Schwenk, Y Bengio. 2014. Learning phrase representations using RNN encoderโ€“ decoder for statistical machine translation. Proc. of EMNLP, pp. 1724โ€“1734.
  27. GRU in math and diagram 32 ๐‘Ÿ๐‘Ÿ ๐‘Ÿ โ„Ž โ„Žโ„Ž

    ๐‘ง๐‘ง ๐‘ง + + + * * tanh โ„Žโˆ’1 โ„Ž Reset gate * 1 โˆ’ + Update gate = tanh โ„Ž + โ„Žโ„Ž ( โˆ— โ„Žโˆ’1 ) โ„Ž = โˆ— โ„Žโˆ’1 + 1 โˆ’ โˆ— = ๐‘Ÿ๐‘Ÿ + ๐‘Ÿ โ„Žโˆ’1 = + โ„Ž โ„Žโˆ’1
  28. Sequential GRU in PyTorch 33 ๏ฌ Replace torch.nn.RNN to torch.nn.GRU

    ๏ฌ The shape of an initial state is unchanged
  29. Implementation of GRU cell in PyTorch 34 ๐‘Ÿ๐‘Ÿ ๐‘Ÿ โ„Ž

    โ„Žโ„Ž ๐‘ง๐‘ง ๐‘ง + + + * tanh โ„Žโˆ’1 โ„Ž = ๐‘Ÿ๐‘Ÿ + ๐‘Ÿ โ„Žโˆ’1 Reset gate - + Update gate * = ๐‘ง๐‘ง + ๐‘ง โ„Žโˆ’1 This is more computationally efficient = tanh โ„Ž + โ„Žโ„Ž ( โˆ— โ„Žโˆ’1 ) = tanh โ„Ž + โˆ— โ„Žโ„Ž โ„Žโˆ’1 โ„Ž = โˆ— โ„Žโˆ’1 + 1 โˆ’ โˆ— = + โˆ— (โ„Žโˆ’1 โˆ’ )
  30. Implementation of GRU cell in PyTorch 35 ๐‘Ÿ๐‘Ÿ ๐‘Ÿ โ„Ž

    โ„Žโ„Ž ๐‘ง๐‘ง ๐‘ง + + + * tanh โ„Žโˆ’1 โ„Ž Reset gate - + Update gate * def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None): gx = F.linear(input, w_x, b_x) gh = F.linear(hidden, w_h, b_h) x_r, x_i, x_n = gi.chunk(3, 1) h_r, h_i, h_n = gh.chunk(3, 1) resetgate = F.sigmoid(x_r + h_r) inputgate = F.sigmoid(x_i + h_i) newgate = F.tanh(x_n + resetgate * h_n) hy = newgate + inputgate * (hidden - newgate) return hy
  31. Implementation of GRU cell in PyTorch 36 ๐‘Ÿ๐‘Ÿ โ„Ž ๐‘ง๐‘ง

    โ„Žโˆ’1 def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None): gx = F.linear(input, w_x, b_x)
  32. Implementation of GRU cell in PyTorch 37 ๐‘Ÿ๐‘Ÿ ๐‘Ÿ โ„Ž

    โ„Žโ„Ž ๐‘ง๐‘ง ๐‘ง โ„Žโˆ’1 def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None): gx = F.linear(input, w_x, b_x) gh = F.linear(hidden, w_h, b_h)
  33. Implementation of GRU cell in PyTorch 38 ๐‘Ÿ๐‘Ÿ ๐‘Ÿ โ„Ž

    โ„Žโ„Ž ๐‘ง๐‘ง ๐‘ง โ„Žโˆ’1 h_r h_n x_i x_r x_n h_i def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None): gx = F.linear(input, w_x, b_x) gh = F.linear(hidden, w_h, b_h) x_r, x_i, x_n = gi.chunk(3, 1) h_r, h_i, h_n = gh.chunk(3, 1)
  34. Implementation of GRU cell in PyTorch 39 ๐‘Ÿ๐‘Ÿ ๐‘Ÿ ๐‘ง๐‘ง

    ๐‘ง + + โ„Žโˆ’1 resetgate inputgate resetgate = F.sigmoid(x_r + h_r) inputgate = F.sigmoid(x_i + h_i) = ๐‘Ÿ๐‘Ÿ + ๐‘Ÿ โ„Žโˆ’1 = ๐‘ง๐‘ง + ๐‘ง โ„Žโˆ’1
  35. Implementation of GRU cell in PyTorch 40 ๐‘Ÿ๐‘Ÿ ๐‘Ÿ โ„Ž

    โ„Žโ„Ž ๐‘ง๐‘ง ๐‘ง + + + * tanh โ„Žโˆ’1 โ„Ž - + * hy newgate inputgate hidden resetgate newgate = F.tanh(x_n + resetgate * h_n) hy = newgate + inputgate * (hidden - newgate) = ๐‘Ÿ๐‘Ÿ + ๐‘Ÿ โ„Žโˆ’1 = ๐‘ง๐‘ง + ๐‘ง โ„Žโˆ’1
  36. Comparison of RNNs (Karpathy+ 2016) 41 (Karpathy+ 2016) ๏ฌ Task:

    character-level language modeling (predicting subsequent characters) ๏ฌ LSTMs and GRUs significantly outperform RNNs ๏ฌ RNNs seem to learn different embeddings from those of LSTMs and GRUs A Karpathy, J Johnson, and L Fei-Fei. 2016. Visualizing and Understanding Recurrent Networks. Proc. of ICLR Workshop 2016.
  37. Observing LSTM cells (Karpathy+ 2016) 42 (Karpathy+ 2016) A Karpathy,

    J Johnson, and L Fei-Fei. 2016. Visualizing and Understanding Recurrent Networks. Proc. of ICLR Workshop 2016.
  38. Recursive Neural Network (Socher+ 2011) 44 R Socher, J Pennington,

    E Huang, A Ng, and C Manning. 2011. Semi-supervised recursive autoencoders for predicting sentiment distributions. Proc. of EMNLP, pp. 151-161. movie good very ( ร— 2) ใƒป ใƒป very good very good movie ( ร— 2) ๏ฌ Compose a phrase vector = , = ๏ฌ , โˆˆ โ„: constituent vectors ๏ฌ โˆˆ โ„: phrase vector ๏ฌ โˆˆ โ„ร—2: parameter ๏ฌ : activation function ๏ฌ Recursively compose vectors along the phrase structure (parse tree) of a sentence
  39. Matrix-Vector Recursive Neural Network (MV-RNN) (Socher+ 2012) 45 ๏ฌ Each

    word has a semantic vector and composition matrix ๏ฌ Compose a phrase vector and matrix recursively = , , = , = [; ] = , = [; ] R Socher, B Huval, C Manning and A Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. Proc. of EMNLP, pp. 1201-1211.
  40. Recursive Neural Tensor Network (Socher+ 2013) 46 ๏ฌ MV-RNN has

    too many parameters to train, assigning every word with a composition matrix ๏ฌ Transform a word vector into a composition matrix by using a tensor R Socher, A Perelygin, J Wu, J Chuang, C Manning, A Ng and C Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. Proc. of EMNLP, pp. 1631-1642.
  41. Tree-structured LSTM (Tai+ 2015) 47 https://pdfs.semanticscholar.org/bd19/c394931257c1901a940ba8388366c35a3e33.pdf K S Tai, R

    Socher, C D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. Proc. of ACL- IJCNLP, pp. 1556โ€“1566.
  42. Stanford Sentiment Treebank (Socher+ 2013) 48 Movie reviews are parsed

    into phrase structures. Each node in a parse tree has a sentiment value (--, -, 0, +, ++) assigned by three annotators. R Socher, A Perelygin, J Wu, J Chuang, C Manning, A Ng and C Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. Proc. of EMNLP, pp. 1631-1642.
  43. Comparison on Stanford Sentiment Treebank (Tai+ 2015) 49 K S

    Tai, R Socher, C D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. Proc. of ACL- IJCNLP, pp. 1556โ€“1566.
  44. Convolutional Neural Network (CNN) (Kim 2014) 51 Y Kim. 2014.

    Convolutional neural networks for sentence classification. Proc. of EMNLP, pp. 1746-1751. It is a very good movie indeed :+ ใƒป ใƒป ใƒป ใƒป ใƒป ใƒป = max 1<<โˆ’+1 , Max pooling: each dimension is the maximum number of the values , over timesteps softmax (๐‘ฆ๐‘ฆ) โ˜บ
  45. Various pooling operations (Kalchbrenner+ 2014) 52 N Kalchbrenner, E Grefenstette,

    P Blunsom. 2014. A convolutional neural network for modelling sentences. Proc. of ACL, pp. 655-665. ๏ฌ Max pooling = max 1<<โˆ’+1 , ๏ฌ Average pooling = 1 โˆ’ + 1 ๏ฟฝ =1 โˆ’+1 , ๏ฌ -max pooling ๏ฌ Taking the -max values (instead of 1-max) ๏ฌ Dynamic -max pooling ๏ฌ Change the value of adaptively based on the length () of an input
  46. Hierarchical CNN includes Recursive NN 54 The movie was the

    best of all (1) (2) (3) (4) (5) (6) PP NP VP NP S
  47. Hierarchical CNN (AdaSent) (Zhao+ 2015) 55 The movie was the

    best of all (1) (2) (3) (4) Max Pooling Use these vectors (e.g., concatenation of these vectors) as the input to the fully-connected layer for classification H Zhao, Z Lu, P Poupart. 2015. Self-Adaptive Hierarchical Sentence Model. Proc. of IJCAI, pp. 4069-4076.
  48. Summary ๏ฌ Various NN architectures that can leverage structures ๏ฌ

    Recurrent Neural Networks (RNNs) ๏ฌ Long Short-Term Memories (LSTMs) ๏ฌ Gated Recurrent Units (GRU) ๏ฌ Recursive Neural Networks (Recursive NNs) ๏ฌ Convolutional Neural Networks (CNNs) ๏ฌ Next question Can we generate a sentence from neural networks? 56