two-dimensional vectors) This approach surprisingly works well in practice, but cannot distinguish different word orders (โJohn loves Maryโ vs โMary loves Johnโ) loves (1,0) Mary (0,1) John (0.25,-0.25) John loves Mary (1.25, 0.75)
Martens, G Hinton. 2011. Generating text with recurrent neural networks. Proc. of ICML, pp. 1017โ1024. John loves โ 4 ๐ฆ โโ Mary โโ much โโ softmax Word embeddings Represent a word with a vector โ โ โ โ โ 1 2 3 1 2 3 4 Recurrent computation Compose a hidden vector from an input word and the hidden vector โ1 at the previous timestep = (โ + โโโ1) Fully-connected layer for a task Make a prediction from the hidden vector 4 , which are composed from all words in the sentence, by using a fully-connected layer and softmax 0 = 0 โโ โบ The parameters โ , โโ , ๐ฆ are shared over the entire sequence They are trained by the supervision signal 1 , โฆ , 4 , using backpropagation
Forward Backward Concatenate the last hidden vectors of the both directions Fully-connected layer for a task The same as unidirectional RNNs โบ A Graves, A Mohamed and G Hinton. 2013. Speech Recognition with Deep Recurrent Neural Networks. Proc. of ICASSP, pp. 6645-6649.
, 2 , โฆ , of length ๏ฌ Include interactions from the past ๏ฌ Neural network is deep in time direction ๏ฌ Share parameters of โ and โโ over sequence ๏ฌ Trained by backpropagation on unfolded graph ๏ฌ This is called backpropagation through time (BPTT) RNN 1 , 2 , โฆ , 1 , 2 , โฆ , Unfold RNN 1 1 RNN 2 2 RNN
49, 53, 33, 42], 17], [[22, 46, 43, 42], 17], [[14, 33], 17], โฆโฆ ] Find alphabet (X) and a set of country names (Y) Build an associative array to map a letter/country into an integer ID Convert letters and countries into integer IDs by using the associative arrays https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb
input and activation function), = โ1 ๏ฌ After steps, this is equivalent to multiplying = 0 ๏ฌ When has an eigenvalue decomposition, = diag()โ1 ๏ฌ We can compute as, = diag โ1 = diag โ1 ๏ฌ The eigenvalues are multiplied times ๏ฌ When < 1, โ 0 (gradient vanishing) ๏ฌ When > 1, โ โ (gradient exploding) ๏ฌ Computing in this way is similar to the power method ๏ฌ will be close to the eigenvector for the largest eigenvalue of , regardless of the vector 0 I Goodfellow, Y Bengio, A Courville. 2016. Deep Learning, page 286, MIT Press.
(Pascanu+ 2013) ๏ฌ Gradient vanishing ๏ฌ Activation function: tanh to ReLU ๏ฌ Long Short-Term Memory (LSTM) ๏ฌ Gated Recurrent Unit (GRU) ๏ฌ Residual Networks (Pascanu+ 2013) When the norm of gradients is above the threshold, scale down the gradients R Pascanu, T Mikolov, Y Bengio. 2013. On the difficulty of training recurrent neural networks. Proc. of ICML, pp. 1310-1318.
cuts among states ๏ฌ Memory cells do not suffer from zero gradients caused by activation functions (tanh and ReLU) ๏ฌ Memory cells are connected without activation functions ๏ฌ Information in โ1 can flow when a forget gate is wide opened ( = 1) ๏ฌ The input from each state ( โ ) has no effect in computing โ1 + * โ1 + * +1 +1 โ +1 โ +1
(โ denotes elementwise product): ๏ฌ Hidden state: โ = โ โโ1 + 1 โ โ ๏ฌ New hidden state: = tanh โ + โโ ( โ โโ1 ) ๏ฌ Reset gate: = ๐๐ + ๐ โโ1 ๏ฌ Update gate: = ๐ง๐ง + ๐ง โโ1 ๏ฌ Motivated by LSTM unit ๏ฌ But much simpler to compute and implement K Cho, van B Merrienboer, C Gulcehre, D Bahdanau, F Bougares, H Schwenk, Y Bengio. 2014. Learning phrase representations using RNN encoderโ decoder for statistical machine translation. Proc. of EMNLP, pp. 1724โ1734.
character-level language modeling (predicting subsequent characters) ๏ฌ LSTMs and GRUs significantly outperform RNNs ๏ฌ RNNs seem to learn different embeddings from those of LSTMs and GRUs A Karpathy, J Johnson, and L Fei-Fei. 2016. Visualizing and Understanding Recurrent Networks. Proc. of ICLR Workshop 2016.
E Huang, A Ng, and C Manning. 2011. Semi-supervised recursive autoencoders for predicting sentiment distributions. Proc. of EMNLP, pp. 151-161. movie good very ( ร 2) ใป ใป very good very good movie ( ร 2) ๏ฌ Compose a phrase vector = , = ๏ฌ , โ โ: constituent vectors ๏ฌ โ โ: phrase vector ๏ฌ โ โร2: parameter ๏ฌ : activation function ๏ฌ Recursively compose vectors along the phrase structure (parse tree) of a sentence
word has a semantic vector and composition matrix ๏ฌ Compose a phrase vector and matrix recursively = , , = , = [; ] = , = [; ] R Socher, B Huval, C Manning and A Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. Proc. of EMNLP, pp. 1201-1211.
too many parameters to train, assigning every word with a composition matrix ๏ฌ Transform a word vector into a composition matrix by using a tensor R Socher, A Perelygin, J Wu, J Chuang, C Manning, A Ng and C Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. Proc. of EMNLP, pp. 1631-1642.
Socher, C D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. Proc. of ACL- IJCNLP, pp. 1556โ1566.
into phrase structures. Each node in a parse tree has a sentiment value (--, -, 0, +, ++) assigned by three annotators. R Socher, A Perelygin, J Wu, J Chuang, C Manning, A Ng and C Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. Proc. of EMNLP, pp. 1631-1642.
Tai, R Socher, C D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. Proc. of ACL- IJCNLP, pp. 1556โ1566.
Convolutional neural networks for sentence classification. Proc. of EMNLP, pp. 1746-1751. It is a very good movie indeed :+ ใป ใป ใป ใป ใป ใป = max 1<<โ+1 , Max pooling: each dimension is the maximum number of the values , over timesteps softmax (๐ฆ๐ฆ) โบ
P Blunsom. 2014. A convolutional neural network for modelling sentences. Proc. of ACL, pp. 655-665. ๏ฌ Max pooling = max 1<<โ+1 , ๏ฌ Average pooling = 1 โ + 1 ๏ฟฝ =1 โ+1 , ๏ฌ -max pooling ๏ฌ Taking the -max values (instead of 1-max) ๏ฌ Dynamic -max pooling ๏ฌ Change the value of adaptively based on the length () of an input
best of all (1) (2) (3) (4) Max Pooling Use these vectors (e.g., concatenation of these vectors) as the input to the fully-connected layer for classification H Zhao, Z Lu, P Poupart. 2015. Self-Adaptive Hierarchical Sentence Model. Proc. of IJCAI, pp. 4069-4076.