DNN for Structural Data

Slide 1

Slide 1 text

DNN for Structural Data Naoaki Okazaki School of Computing, Tokyo Institute of Technology [email protected] PowerPoint template designed by https://ppt.design4u.jp/template/

Slide 2

Slide 2 text

Embeddings for phrases and sentences 1  Word embeddings represent words with real-valued vectors  Is it possible to consider embeddings for phrases and sentences? John loves Mary John loves Mary

Slide 3

Slide 3 text

The baseline: additive composition 2 (Example with word embeddings of two-dimensional vectors) This approach surprisingly works well in practice, but cannot distinguish different word orders (“John loves Mary” vs “Mary loves John”) loves (1,0) Mary (0,1) John (0.25,-0.25) John loves Mary (1.25, 0.75)

Slide 4

Slide 4 text

Summary  Various NN architectures that can leverage structures  Recurrent Neural Networks (RNNs)  Long Short-Term Memories (LSTMs)  Gated Recurrent Units (GRU)  Recursive Neural Networks (Recursive NNs)  Convolutional Neural Networks (CNNs) 3

Slide 5

Slide 5 text

Recurrent Neural Networks (RNNs) 4

Slide 6

Slide 6 text

Recurrent Neural Networks (RNNs) (Sutskever+ 2011) 5 I Sutskever, J Martens, G Hinton. 2011. Generating text with recurrent neural networks. Proc. of ICML, pp. 1017–1024. John loves ℎ 4 𝑦 ℎℎ Mary ℎℎ much ℎℎ softmax Word embeddings Represent a word with a vector ∈ ℝ ℎ ℎ ℎ 1 2 3 1 2 3 4 Recurrent computation Compose a hidden vector from an input word and the hidden vector −1 at the previous timestep = (ℎ + ℎℎ−1) Fully-connected layer for a task Make a prediction from the hidden vector 4 , which are composed from all words in the sentence, by using a fully-connected layer and softmax 0 = 0 ℎℎ ☺ The parameters ℎ , ℎℎ , 𝑦 are shared over the entire sequence They are trained by the supervision signal 1 , … , 4 , using backpropagation

Slide 7

Slide 7 text

RNN in math 6 ℎ ℎℎ + −1 tanh ℎ ℎℎ + +1 tanh +1 +1 = RNN , −1 = ℎ + ℎℎ −1 , ∈ ℝ, ∈ ℝ, ℎ ∈ ℝ×, ℎℎ ∈ ℝ× Typical activation functions are tanh and ReLU RNN RNN

Slide 8

Slide 8 text

Multi-layer RNNs 7 ℎ (1) ℎℎ (1) + −1 (1) tanh (1) ℎ (1) ℎℎ (1) + +1 (0) tanh +1 (1) (0) RNN(1) RNN(1) ℎ (2) ℎℎ (2) + −1 (2) tanh (2) ℎ (2) ℎℎ (2) + tanh +1 (2) +1 (2) (2) RNN(2) RNN(2) = = +1

Slide 9

Slide 9 text

Forward and backward RNNs 8 ℎ ℎℎ + −1 tanh RNN ℎ ℎℎ + −1 tanh RNN Forward RNNs = RNN , −1 = ℎ + ℎℎ −1 Backward RNNs −1 = RNN , = ℎ + ℎℎ

Slide 10

Slide 10 text

Bidirectional RNNs (Graves+ 2013) 9 John loves Mary much softmax Forward Backward Concatenate the last hidden vectors of the both directions Fully-connected layer for a task The same as unidirectional RNNs ☺ A Graves, A Mohamed and G Hinton. 2013. Speech Recognition with Deep Recurrent Neural Networks. Proc. of ICASSP, pp. 6645-6649.

Slide 11

Slide 11 text

Unfolded Recurrent Neural Network 10  Process a sequence 1 , 2 , … , of length  Include interactions from the past  Neural network is deep in time direction  Share parameters of ℎ and ℎℎ over sequence  Trained by backpropagation on unfolded graph  This is called backpropagation through time (BPTT) RNN 1 , 2 , … , 1 , 2 , … , Unfold RNN 1 1 RNN 2 2 RNN

Slide 12

Slide 12 text

Example: RNN for nationality prediction 11 G o ℎ 4 𝑦 ℎℎ t ℎℎ o ℎℎ softmax ℎ ℎ ℎ 1 2 3 1 2 3 4 ∈ ℝ18 0 = 0 ℎℎ ∈ ℝ55 (one-hot vector) (55 = |[A-Za-z .,;']|) ∈ ℝ128

Slide 13

Slide 13 text

Preprocess the data 12 [ [ "Nguyen", "Vietnamese“ ], [ "Tron", "Vietnamese“ ], [ "Le", "Vietnamese“ ], …… ] https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

Slide 14

Slide 14 text

Convert the data into numerical data 13 [ [[16, 35, 49, 53, 33, 42], 17], [[22, 46, 43, 42], 17], [[14, 33], 17], …… ] Find alphabet (X) and a set of country names (Y) Build an associative array to map a letter/country into an integer ID Convert letters and countries into integer IDs by using the associative arrays https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

Slide 15

Slide 15 text

Bare implementation of RNN states 14 = ℎ + ℎℎ −1 = (ℎ [ ; −1 ]) https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

Slide 16

Slide 16 text

Sequential RNN module 15 https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

Slide 17

Slide 17 text

Mini-batch RNN 16 https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

Slide 18

Slide 18 text

Long-term dependency 17  Consider a simplified RNN (without an input and activation function), = −1  After steps, this is equivalent to multiplying = 0  When has an eigenvalue decomposition, = diag()−1  We can compute as, = diag −1 = diag −1  The eigenvalues are multiplied times  When < 1, → 0 (gradient vanishing)  When > 1, → ∞ (gradient exploding)  Computing in this way is similar to the power method  will be close to the eigenvector for the largest eigenvalue of , regardless of the vector 0 I Goodfellow, Y Bengio, A Courville. 2016. Deep Learning, page 286, MIT Press.

Slide 19

Slide 19 text

Gradient vanishing/exploding problem 18  Gradients vanish or explode over time  More detailed explanations:  Why are deep neural networks hard to train? http://neuralnetworksanddeeplearning.com/chap5.html  ニューラルネットワークを訓練するのはなぜ難しいのか https://nnadl-ja.github.io/nnadl_site_ja/chap5.html  Recurrent Neural Networks LSTMs and Vanishing & Exploding Gradients - Fun and Easy Machine Learning https://www.youtube.com/watch?v=2GNbIKTKCfE RNN 1 1 RNN 2 2 RNN RNN −2 −2 RNN −1 −1

Slide 20

Slide 20 text

Addressing gradient vanishing/exploding 19  Gradient exploding  Gradient clipping (Pascanu+ 2013)  Gradient vanishing  Activation function: tanh to ReLU  Long Short-Term Memory (LSTM)  Gated Recurrent Unit (GRU)  Residual Networks (Pascanu+ 2013) When the norm of gradients is above the threshold, scale down the gradients R Pascanu, T Mikolov, Y Bengio. 2013. On the difficulty of training recurrent neural networks. Proc. of ICML, pp. 1310-1318.

Slide 21

Slide 21 text

Long Short-Term Memory (LSTM) 20

Slide 22

Slide 22 text

Long Short-Term Memory (Hochreiter+ 1997) 21  Consist of (∗ denotes elementwise product):  Hidden state: ℎ = ∗ tanh  Memory cell: = ∗ −1 + ∗ tanh 𝑔𝑔 + 𝑔 ℎ−1  Input gate: = 𝑖𝑖 + 𝑖 ℎ−1  Output gate: = 𝑜𝑜 + 𝑜 ℎ−1  Forget gate: = 𝑓𝑓 + 𝑓 ℎ−1  The architecture looks complicated, but LSTMs are also a neural network  LSTMs can be also trained by the standard procedure of backpropagation S Hochreiter, J Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.

Slide 23

Slide 23 text

LSTM in math and diagram 22 𝑓𝑓 𝑓 𝑖𝑖 𝑖 𝑔𝑔 𝑔 𝑜𝑜 𝑜 + + + tanh + * + * tanh * ℎ−1 −1 ℎ ℎ Memory cell Hidden state Forget gate Input gate Output gate = 𝑓𝑓 + 𝑓 ℎ−1 = + ℎ ℎ−1 = + ℎ ℎ−1 = tanh 𝑔𝑔 + 𝑔 ℎ−1 = ∗ −1 + ∗ ℎ = ∗ tanh

Slide 24

Slide 24 text

Sequential LSTM in pytorch 23  Replace torch.nn.RNN to torch.nn.LSTM  Change the shape of an initial state

Slide 25

Slide 25 text

Implementation of LSTM cell in pytorch 24 𝑓𝑓 𝑓 𝑖𝑖 𝑖 𝑔𝑔 𝑔 𝑜𝑜 𝑜 + + + tanh + * + * tanh * ℎ−1 −1 ℎ ℎ Memory cell Hidden state Forget gate Input gate Output gate def LSTMCell(x, hidden, w_x, w_h, b_x=None, b_h=None): h_prev, c_prev = hidden gates = F.linear(x, w_x, b_x) + F.linear(h_prev, w_h, b_h) ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1) ingate = F.sigmoid(ingate) forgetgate = F.sigmoid(forgetgate) cellgate = F.tanh(cellgate) outgate = F.sigmoid(outgate) ct = (forgetgate * c_prev) + (ingate * cellgate) ht = outgate * F.tanh(ct) return ht, ct

Slide 26

Slide 26 text

Implementation of LSTM cell in pytorch 25 𝑓𝑓 𝑖𝑖 𝑔𝑔 𝑜𝑜 + + + + (x) w_x = [𝑓𝑓 ; 𝑖𝑖 ; 𝑔𝑔 ; 𝑜𝑜 ] def LSTMCell(x, hidden, w_x, w_h, b_x=None, b_h=None): h_prev, c_prev = hidden gates = F.linear(x, w_x, b_x)

Slide 27

Slide 27 text

Implementation of LSTM cell in pytorch 26 𝑓𝑓 𝑓 𝑖𝑖 𝑖 𝑔𝑔 𝑔 𝑜𝑜 𝑜 + + + + ℎ−1 (h_prev) (x) w_x = [𝑓𝑓 ; 𝑖𝑖 ; 𝑔𝑔 ; 𝑜𝑜 ] w_h = [ℎ ; ℎ ; ℎ ; ℎ ] def LSTMCell(x, hidden, w_x, w_h, b_x=None, b_h=None): h_prev, c_prev = hidden gates = F.linear(x, w_x, b_x) + F.linear(h_prev, w_h, b_h) gates

Slide 28

Slide 28 text

Implementation of LSTM cell in pytorch 27 gates = F.linear(x, w_x, b_x) + F.linear(h_prev, w_h, b_h) ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1) ingate = F.sigmoid(ingate) forgetgate = F.sigmoid(forgetgate) cellgate = F.tanh(cellgate) outgate = F.sigmoid(outgate) tanh gates ingate forgetgate cellgate outgate

Slide 29

Slide 29 text

Implementation of LSTM cell in pytorch 28 outgate = F.sigmoid(outgate) ct = (forgetgate * c_prev) + (ingate * cellgate) ht = outgate * F.tanh(ct) return ht, ct * + * tanh * −1 (c_prev) ℎ ℎ ingate forgetgate cellgate outgate

Slide 30

Slide 30 text

LSTM remedies vanishing gradients 29  Memory cells provide short cuts among states  Memory cells do not suffer from zero gradients caused by activation functions (tanh and ReLU)  Memory cells are connected without activation functions  Information in −1 can flow when a forget gate is wide opened ( = 1)  The input from each state ( ∗ ) has no effect in computing −1 + * −1 + * +1 +1 ∗ +1 ∗ +1

Slide 31

Slide 31 text

Gated Recurrent Units (GRUs) 30

Slide 32

Slide 32 text

Gated Recurrent Unit (GRU) (Cho+ 2014) 31  Consist of (∗ denotes elementwise product):  Hidden state: ℎ = ∗ ℎ−1 + 1 − ∗  New hidden state: = tanh ℎ + ℎℎ ( ∗ ℎ−1 )  Reset gate: = 𝑟𝑟 + 𝑟 ℎ−1  Update gate: = 𝑧𝑧 + 𝑧 ℎ−1  Motivated by LSTM unit  But much simpler to compute and implement K Cho, van B Merrienboer, C Gulcehre, D Bahdanau, F Bougares, H Schwenk, Y Bengio. 2014. Learning phrase representations using RNN encoder– decoder for statistical machine translation. Proc. of EMNLP, pp. 1724–1734.

Slide 33

Slide 33 text

GRU in math and diagram 32 𝑟𝑟 𝑟 ℎ ℎℎ 𝑧𝑧 𝑧 + + + * * tanh ℎ−1 ℎ Reset gate * 1 − + Update gate = tanh ℎ + ℎℎ ( ∗ ℎ−1 ) ℎ = ∗ ℎ−1 + 1 − ∗ = 𝑟𝑟 + 𝑟 ℎ−1 = + ℎ ℎ−1

Slide 34

Slide 34 text

Sequential GRU in PyTorch 33  Replace torch.nn.RNN to torch.nn.GRU  The shape of an initial state is unchanged

Slide 35

Slide 35 text

Implementation of GRU cell in PyTorch 34 𝑟𝑟 𝑟 ℎ ℎℎ 𝑧𝑧 𝑧 + + + * tanh ℎ−1 ℎ = 𝑟𝑟 + 𝑟 ℎ−1 Reset gate - + Update gate * = 𝑧𝑧 + 𝑧 ℎ−1 This is more computationally efficient = tanh ℎ + ℎℎ ( ∗ ℎ−1 ) = tanh ℎ + ∗ ℎℎ ℎ−1 ℎ = ∗ ℎ−1 + 1 − ∗ = + ∗ (ℎ−1 − )

Slide 36

Slide 36 text

Implementation of GRU cell in PyTorch 35 𝑟𝑟 𝑟 ℎ ℎℎ 𝑧𝑧 𝑧 + + + * tanh ℎ−1 ℎ Reset gate - + Update gate * def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None): gx = F.linear(input, w_x, b_x) gh = F.linear(hidden, w_h, b_h) x_r, x_i, x_n = gi.chunk(3, 1) h_r, h_i, h_n = gh.chunk(3, 1) resetgate = F.sigmoid(x_r + h_r) inputgate = F.sigmoid(x_i + h_i) newgate = F.tanh(x_n + resetgate * h_n) hy = newgate + inputgate * (hidden - newgate) return hy

Slide 37

Slide 37 text

Implementation of GRU cell in PyTorch 36 𝑟𝑟 ℎ 𝑧𝑧 ℎ−1 def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None): gx = F.linear(input, w_x, b_x)

Slide 38

Slide 38 text

Implementation of GRU cell in PyTorch 37 𝑟𝑟 𝑟 ℎ ℎℎ 𝑧𝑧 𝑧 ℎ−1 def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None): gx = F.linear(input, w_x, b_x) gh = F.linear(hidden, w_h, b_h)

Slide 39

Slide 39 text

Implementation of GRU cell in PyTorch 38 𝑟𝑟 𝑟 ℎ ℎℎ 𝑧𝑧 𝑧 ℎ−1 h_r h_n x_i x_r x_n h_i def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None): gx = F.linear(input, w_x, b_x) gh = F.linear(hidden, w_h, b_h) x_r, x_i, x_n = gi.chunk(3, 1) h_r, h_i, h_n = gh.chunk(3, 1)

Slide 40

Slide 40 text

Implementation of GRU cell in PyTorch 39 𝑟𝑟 𝑟 𝑧𝑧 𝑧 + + ℎ−1 resetgate inputgate resetgate = F.sigmoid(x_r + h_r) inputgate = F.sigmoid(x_i + h_i) = 𝑟𝑟 + 𝑟 ℎ−1 = 𝑧𝑧 + 𝑧 ℎ−1

Slide 41

Slide 41 text

Implementation of GRU cell in PyTorch 40 𝑟𝑟 𝑟 ℎ ℎℎ 𝑧𝑧 𝑧 + + + * tanh ℎ−1 ℎ - + * hy newgate inputgate hidden resetgate newgate = F.tanh(x_n + resetgate * h_n) hy = newgate + inputgate * (hidden - newgate) = 𝑟𝑟 + 𝑟 ℎ−1 = 𝑧𝑧 + 𝑧 ℎ−1

Slide 42

Slide 42 text

Comparison of RNNs (Karpathy+ 2016) 41 (Karpathy+ 2016)  Task: character-level language modeling (predicting subsequent characters)  LSTMs and GRUs significantly outperform RNNs  RNNs seem to learn different embeddings from those of LSTMs and GRUs A Karpathy, J Johnson, and L Fei-Fei. 2016. Visualizing and Understanding Recurrent Networks. Proc. of ICLR Workshop 2016.

Slide 43

Slide 43 text

Observing LSTM cells (Karpathy+ 2016) 42 (Karpathy+ 2016) A Karpathy, J Johnson, and L Fei-Fei. 2016. Visualizing and Understanding Recurrent Networks. Proc. of ICLR Workshop 2016.

Slide 44

Slide 44 text

RNNs over tree 43

Slide 45

Slide 45 text

Recursive Neural Network (Socher+ 2011) 44 R Socher, J Pennington, E Huang, A Ng, and C Manning. 2011. Semi-supervised recursive autoencoders for predicting sentiment distributions. Proc. of EMNLP, pp. 151-161. movie good very ( × 2) ・・ very good very good movie ( × 2)  Compose a phrase vector = , =  , ∈ ℝ: constituent vectors  ∈ ℝ: phrase vector  ∈ ℝ×2: parameter  : activation function  Recursively compose vectors along the phrase structure (parse tree) of a sentence

Slide 46

Slide 46 text

Matrix-Vector Recursive Neural Network (MV-RNN) (Socher+ 2012) 45  Each word has a semantic vector and composition matrix  Compose a phrase vector and matrix recursively = , , = , = [; ] = , = [; ] R Socher, B Huval, C Manning and A Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. Proc. of EMNLP, pp. 1201-1211.

Slide 47

Slide 47 text

Recursive Neural Tensor Network (Socher+ 2013) 46  MV-RNN has too many parameters to train, assigning every word with a composition matrix  Transform a word vector into a composition matrix by using a tensor R Socher, A Perelygin, J Wu, J Chuang, C Manning, A Ng and C Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. Proc. of EMNLP, pp. 1631-1642.

Slide 48

Slide 48 text

Tree-structured LSTM (Tai+ 2015) 47 https://pdfs.semanticscholar.org/bd19/c394931257c1901a940ba8388366c35a3e33.pdf K S Tai, R Socher, C D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. Proc. of ACL- IJCNLP, pp. 1556–1566.

Slide 49

Slide 49 text

Stanford Sentiment Treebank (Socher+ 2013) 48 Movie reviews are parsed into phrase structures. Each node in a parse tree has a sentiment value (--, -, 0, +, ++) assigned by three annotators. R Socher, A Perelygin, J Wu, J Chuang, C Manning, A Ng and C Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. Proc. of EMNLP, pp. 1631-1642.

Slide 50

Slide 50 text

Comparison on Stanford Sentiment Treebank (Tai+ 2015) 49 K S Tai, R Socher, C D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. Proc. of ACL- IJCNLP, pp. 1556–1566.

Slide 51

Slide 51 text

Convolutional Neural Networks (CNNs) for Text 50

Slide 52

Slide 52 text

Convolutional Neural Network (CNN) (Kim 2014) 51 Y Kim. 2014. Convolutional neural networks for sentence classification. Proc. of EMNLP, pp. 1746-1751. It is a very good movie indeed :+ ・・・・・・ = max 1<<−+1 , Max pooling: each dimension is the maximum number of the values , over timesteps softmax (𝑦𝑦) ☺

Slide 53

Slide 53 text

Various pooling operations (Kalchbrenner+ 2014) 52 N Kalchbrenner, E Grefenstette, P Blunsom. 2014. A convolutional neural network for modelling sentences. Proc. of ACL, pp. 655-665.  Max pooling = max 1<<−+1 ,  Average pooling = 1 − + 1 � =1 −+1 ,  -max pooling  Taking the -max values (instead of 1-max)  Dynamic -max pooling  Change the value of adaptively based on the length () of an input

Slide 54

Slide 54 text

Hierarchical CNN includes Recursive NN 53 The movie was the best of all (1) (2) (3) (4) (5) (6)

Slide 55

Slide 55 text

Hierarchical CNN includes Recursive NN 54 The movie was the best of all (1) (2) (3) (4) (5) (6) PP NP VP NP S

Slide 56

Slide 56 text

Hierarchical CNN (AdaSent) (Zhao+ 2015) 55 The movie was the best of all (1) (2) (3) (4) Max Pooling Use these vectors (e.g., concatenation of these vectors) as the input to the fully-connected layer for classification H Zhao, Z Lu, P Poupart. 2015. Self-Adaptive Hierarchical Sentence Model. Proc. of IJCAI, pp. 4069-4076.