8.2k

# DNN for Structural Data

Recurrent Neural Networks (RNNs), Gradient vanishing and exploding, Long Short-Term Memory (LSTM), Gated Recurrent Units (GRUs), Recursive Neural Network, Tree-structured LSTM, Convolutional Neural Networks (CNNs) August 07, 2020

## Transcript

1. DNN for Structural Data
Naoaki Okazaki
School of Computing,
Tokyo Institute of Technology
[email protected]
PowerPoint template designed by https://ppt.design4u.jp/template/

2. Embeddings for phrases and sentences
1
 Word embeddings represent words with real-valued vectors
 Is it possible to consider embeddings for phrases and sentences?
John
loves
Mary John loves Mary

2
(Example with word embeddings
of two-dimensional vectors)
This approach surprisingly works well in practice,
but cannot distinguish different word orders
(“John loves Mary” vs “Mary loves John”)
loves
(1,0)
Mary
(0,1)
John
(0.25,-0.25)
John loves Mary
(1.25, 0.75)

4. Summary
 Various NN architectures that can leverage structures
 Recurrent Neural Networks (RNNs)
 Long Short-Term Memories (LSTMs)
 Gated Recurrent Units (GRU)
 Recursive Neural Networks (Recursive NNs)
 Convolutional Neural Networks (CNNs)
3

5. Recurrent Neural Networks (RNNs)
4

6. Recurrent Neural Networks (RNNs) (Sutskever+ 2011)
5
I Sutskever, J Martens, G Hinton. 2011. Generating text with recurrent neural networks. Proc. of ICML, pp. 1017–1024.
John loves

4
𝑦
ℎℎ
Mary
ℎℎ
much
ℎℎ
softmax
Word embeddings
Represent a word
with a vector

1
2
3
1
2
3
4
Recurrent computation
Compose a hidden vector
from an
input word
and the hidden vector
−1
at the previous timestep

= (ℎ
+ ℎℎ−1)
Make a prediction from the hidden
vector 4
, which are composed from
all words in the sentence, by using a
fully-connected layer and softmax

0
= 0
ℎℎ

The parameters ℎ
, ℎℎ
, 𝑦
are shared over the entire sequence
They are trained by the supervision signal 1
, … , 4
, using backpropagation

7. RNN in math
6

ℎℎ
+
−1 tanh

ℎℎ
+
+1
tanh +1
+1

= RNN
, −1
= ℎ

+ ℎℎ
−1
,

∈ ℝ,
∈ ℝ, ℎ
∈ ℝ×, ℎℎ
∈ ℝ×
Typical activation functions are tanh and ReLU
RNN RNN

8. Multi-layer RNNs
7

(1)

ℎℎ
(1) +
−1
(1)
tanh
(1)

(1)

ℎℎ
(1) +
+1
(0)
tanh +1
(1)

(0)
RNN(1) RNN(1)

(2)

ℎℎ
(2) +
−1
(2)
tanh
(2)

(2)

ℎℎ
(2) + tanh +1
(2)
+1
(2)

(2)
RNN(2) RNN(2)
=
= +1

9. Forward and backward RNNs
8

ℎℎ
+
−1 tanh

RNN

ℎℎ
+
−1 tanh

RNN
Forward RNNs

= RNN
, −1
= ℎ

+ ℎℎ
−1
Backward RNNs
−1
= RNN
,
= ℎ

+ ℎℎ

10. Bidirectional RNNs (Graves+ 2013)
9
John loves Mary much
softmax

Forward
Backward
Concatenate the last hidden
vectors of the both directions
The same as unidirectional RNNs

A Graves, A Mohamed and G Hinton. 2013. Speech Recognition with Deep Recurrent Neural Networks. Proc. of ICASSP, pp. 6645-6649.

11. Unfolded Recurrent Neural Network
10
 Process a sequence 1
, 2
, … ,
of length
 Include interactions from the past
 Neural network is deep in time direction
 Share parameters of ℎ
and ℎℎ
over sequence
 Trained by backpropagation on unfolded graph
 This is called backpropagation through time (BPTT)
RNN
1
, 2
, … ,
1
, 2
, … ,
Unfold RNN
1
1
RNN
2
2
RNN

12. Example: RNN for nationality prediction
11
G o

4
𝑦
ℎℎ
t
ℎℎ
o
ℎℎ
softmax

1
2
3
1
2
3
4
∈ ℝ18
0
= 0
ℎℎ

∈ ℝ55
(one-hot vector)
(55 = |[A-Za-z .,;']|)

∈ ℝ128

13. Preprocess the data
12
[
[
"Nguyen",
"Vietnamese“
],
[
"Tron",
"Vietnamese“
],
[
"Le",
"Vietnamese“
],
……
]
https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

14. Convert the data into numerical data
13
[
[[16, 35, 49, 53, 33, 42], 17],
[[22, 46, 43, 42], 17],
[[14, 33], 17],
……
]
Find alphabet (X) and a set
of country names (Y)
Build an associative array
to map a letter/country
into an integer ID
Convert letters and countries
into integer IDs by using the
associative arrays
https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

15. Bare implementation of RNN states
14

= ℎ

+ ℎℎ
−1
= (ℎ
[
; −1
])
https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

16. Sequential RNN module
15
https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

17. Mini-batch RNN
16
https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

18. Long-term dependency
17
 Consider a simplified RNN (without an input and activation function),

= −1
 After steps, this is equivalent to multiplying

= 0
 When has an eigenvalue decomposition,
= diag()−1
 We can compute
as,

= diag −1 = diag −1
 The eigenvalues are multiplied times
 When
< 1,
 When
> 1,
 Computing
in this way is similar to the power method

will be close to the eigenvector for the largest eigenvalue of ,
regardless of the vector 0
I Goodfellow, Y Bengio, A Courville. 2016. Deep Learning, page 286, MIT Press.

18
 Gradients vanish or explode over time
 More detailed explanations:
 Why are deep neural networks hard to train?
http://neuralnetworksanddeeplearning.com/chap5.html
 ニューラルネットワークを訓練するのはなぜ難しいのか
 Recurrent Neural Networks LSTMs and Vanishing & Exploding
Gradients - Fun and Easy Machine Learning
RNN
1
1
RNN
2
2
RNN

RNN
−2
−2
RNN
−1
−1

19
 Activation function: tanh to ReLU
 Long Short-Term Memory (LSTM)
 Gated Recurrent Unit (GRU)
 Residual Networks
(Pascanu+ 2013)
is above the threshold, scale
R Pascanu, T Mikolov, Y Bengio. 2013. On the difficulty of training recurrent neural networks. Proc. of ICML, pp. 1310-1318.

21. Long Short-Term Memory (LSTM)
20

22. Long Short-Term Memory (Hochreiter+ 1997)
21
 Consist of (∗ denotes elementwise product):
 Hidden state: ℎ
=
∗ tanh
 Memory cell:
=
∗ −1
+
∗ tanh
𝑔𝑔

+ 𝑔
ℎ−1
 Input gate:
= 𝑖𝑖

+ 𝑖
ℎ−1
 Output gate:
=
𝑜𝑜

+ 𝑜
ℎ−1
 Forget gate:
= 𝑓𝑓

+ 𝑓
ℎ−1
 The architecture looks complicated, but LSTMs are also a
neural network
 LSTMs can be also trained by the standard procedure of
backpropagation
S Hochreiter, J Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.

23. LSTM in math and diagram
22
𝑓𝑓
𝑓 𝑖𝑖
𝑖

𝑔𝑔
𝑔
𝑜𝑜
𝑜
+

+

+
tanh
+

*
+
*
tanh
*
ℎ−1
−1

Memory cell
Hidden state
Forget gate
Input gate
Output gate

= 𝑓𝑓

+ 𝑓
ℎ−1

=

+ ℎ
ℎ−1

=

+ ℎ
ℎ−1

= tanh
𝑔𝑔

+ 𝑔
ℎ−1

=
∗ −1
+

=
∗ tanh

24. Sequential LSTM in pytorch
23
 Replace torch.nn.RNN to torch.nn.LSTM
 Change the shape of an initial state

25. Implementation of LSTM cell in pytorch
24
𝑓𝑓
𝑓 𝑖𝑖
𝑖

𝑔𝑔
𝑔
𝑜𝑜
𝑜
+

+

+
tanh
+

*
+
*
tanh
*
ℎ−1
−1

Memory cell
Hidden state
Forget gate
Input gate
Output gate
def LSTMCell(x, hidden, w_x, w_h, b_x=None, b_h=None):
h_prev, c_prev = hidden
gates = F.linear(x, w_x, b_x) + F.linear(h_prev, w_h, b_h)
ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1)
ingate = F.sigmoid(ingate)
forgetgate = F.sigmoid(forgetgate)
cellgate = F.tanh(cellgate)
outgate = F.sigmoid(outgate)
ct = (forgetgate * c_prev) + (ingate * cellgate)
ht = outgate * F.tanh(ct)
return ht, ct

26. Implementation of LSTM cell in pytorch
25
𝑓𝑓 𝑖𝑖

𝑔𝑔
𝑜𝑜
+ + + +

(x) w_x = [𝑓𝑓
; 𝑖𝑖
;
𝑔𝑔
;
𝑜𝑜
]
def LSTMCell(x, hidden, w_x, w_h, b_x=None, b_h=None):
h_prev, c_prev = hidden
gates = F.linear(x, w_x, b_x)

27. Implementation of LSTM cell in pytorch
26
𝑓𝑓
𝑓 𝑖𝑖
𝑖

𝑔𝑔
𝑔
𝑜𝑜
𝑜
+ + + +
ℎ−1

(h_prev)
(x) w_x = [𝑓𝑓
; 𝑖𝑖
;
𝑔𝑔
;
𝑜𝑜
]
w_h = [ℎ
; ℎ
; ℎ
; ℎ
]
def LSTMCell(x, hidden, w_x, w_h, b_x=None, b_h=None):
h_prev, c_prev = hidden
gates = F.linear(x, w_x, b_x) + F.linear(h_prev, w_h, b_h)
gates

28. Implementation of LSTM cell in pytorch
27
gates = F.linear(x, w_x, b_x) + F.linear(h_prev, w_h, b_h)
ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1)
ingate = F.sigmoid(ingate)
forgetgate = F.sigmoid(forgetgate)
cellgate = F.tanh(cellgate)
outgate = F.sigmoid(outgate)
tanh

gates
ingate
forgetgate cellgate
outgate

29. Implementation of LSTM cell in pytorch
28
outgate = F.sigmoid(outgate)
ct = (forgetgate * c_prev) + (ingate * cellgate)
ht = outgate * F.tanh(ct)
return ht, ct
*
+
*
tanh
*
−1
(c_prev)

ingate
forgetgate cellgate
outgate

29
 Memory cells provide short cuts among states
 Memory cells do not suffer from zero gradients caused by
activation functions (tanh and ReLU)
 Memory cells are connected without activation functions
 Information in −1
can flow when a forget gate is wide opened
(
= 1)
 The input from each state (

) has no effect in computing

−1
+
*
−1
+
* +1
+1

+1
∗ +1

31. Gated Recurrent Units (GRUs)
30

32. Gated Recurrent Unit (GRU) (Cho+ 2014)
31
 Consist of (∗ denotes elementwise product):
 Hidden state: ℎ
=
∗ ℎ−1
+ 1 −

 New hidden state:
= tanh ℎ

+ ℎℎ
(
∗ ℎ−1
)
 Reset gate:
=
𝑟𝑟

+ 𝑟
ℎ−1
 Update gate:
=
𝑧𝑧

+ 𝑧
ℎ−1
 Motivated by LSTM unit
 But much simpler to compute and implement
K Cho, van B Merrienboer, C Gulcehre, D Bahdanau, F Bougares, H Schwenk, Y Bengio. 2014. Learning phrase representations using RNN encoder–
decoder for statistical machine translation. Proc. of EMNLP, pp. 1724–1734.

33. GRU in math and diagram
32

𝑟𝑟
𝑟

ℎℎ

𝑧𝑧
𝑧
+

+
+

*
* tanh
ℎ−1 ℎ

Reset gate
*
1 −
+
Update gate

= tanh ℎ

+ ℎℎ
(
∗ ℎ−1
)

=
∗ ℎ−1
+ 1 −

=
𝑟𝑟

+ 𝑟
ℎ−1

=

+ ℎ
ℎ−1

34. Sequential GRU in PyTorch
33
 Replace torch.nn.RNN to torch.nn.GRU
 The shape of an initial state is unchanged

35. Implementation of GRU cell in PyTorch
34

𝑟𝑟
𝑟

ℎℎ

𝑧𝑧
𝑧
+

+
+

* tanh
ℎ−1

=
𝑟𝑟

+ 𝑟
ℎ−1
Reset gate
- +
Update gate

*

=
𝑧𝑧

+ 𝑧
ℎ−1
This is more computationally efficient

= tanh ℎ

+ ℎℎ
(
∗ ℎ−1
) = tanh ℎ

+
∗ ℎℎ
ℎ−1

=
∗ ℎ−1
+ 1 −

=
+
∗ (ℎ−1

)

36. Implementation of GRU cell in PyTorch
35

𝑟𝑟
𝑟

ℎℎ

𝑧𝑧
𝑧
+

+
+

* tanh
ℎ−1

Reset gate
- +
Update gate

*
def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None):
gx = F.linear(input, w_x, b_x)
gh = F.linear(hidden, w_h, b_h)
x_r, x_i, x_n = gi.chunk(3, 1)
h_r, h_i, h_n = gh.chunk(3, 1)
resetgate = F.sigmoid(x_r + h_r)
inputgate = F.sigmoid(x_i + h_i)
newgate = F.tanh(x_n + resetgate * h_n)
hy = newgate + inputgate * (hidden - newgate)
return hy

37. Implementation of GRU cell in PyTorch
36

𝑟𝑟

𝑧𝑧
ℎ−1

def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None):
gx = F.linear(input, w_x, b_x)

38. Implementation of GRU cell in PyTorch
37

𝑟𝑟
𝑟

ℎℎ

𝑧𝑧
𝑧
ℎ−1

def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None):
gx = F.linear(input, w_x, b_x)
gh = F.linear(hidden, w_h, b_h)

39. Implementation of GRU cell in PyTorch
38

𝑟𝑟
𝑟

ℎℎ

𝑧𝑧
𝑧
ℎ−1

h_r h_n
x_i
x_r x_n
h_i
def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None):
gx = F.linear(input, w_x, b_x)
gh = F.linear(hidden, w_h, b_h)
x_r, x_i, x_n = gi.chunk(3, 1)
h_r, h_i, h_n = gh.chunk(3, 1)

40. Implementation of GRU cell in PyTorch
39

𝑟𝑟
𝑟

𝑧𝑧
𝑧
+

+

ℎ−1

resetgate inputgate
resetgate = F.sigmoid(x_r + h_r)
inputgate = F.sigmoid(x_i + h_i)

=
𝑟𝑟

+ 𝑟
ℎ−1
=
𝑧𝑧

+ 𝑧
ℎ−1

41. Implementation of GRU cell in PyTorch
40

𝑟𝑟
𝑟

ℎℎ

𝑧𝑧
𝑧
+

+
+

* tanh
ℎ−1

- +

*
hy
newgate
inputgate
hidden
resetgate
newgate = F.tanh(x_n + resetgate * h_n)
hy = newgate + inputgate * (hidden - newgate)

=
𝑟𝑟

+ 𝑟
ℎ−1
=
𝑧𝑧

+ 𝑧
ℎ−1

42. Comparison of RNNs (Karpathy+ 2016)
41
(Karpathy+ 2016)
 Task: character-level language modeling (predicting subsequent characters)
 LSTMs and GRUs significantly outperform RNNs
 RNNs seem to learn different embeddings from those of LSTMs and GRUs
A Karpathy, J Johnson, and L Fei-Fei. 2016. Visualizing and Understanding Recurrent Networks. Proc. of ICLR Workshop 2016.

43. Observing LSTM cells (Karpathy+ 2016)
42
(Karpathy+ 2016)
A Karpathy, J Johnson, and L Fei-Fei. 2016. Visualizing and Understanding Recurrent Networks. Proc. of ICLR Workshop 2016.

44. RNNs over tree
43

45. Recursive Neural Network (Socher+ 2011)
44
R Socher, J Pennington, E Huang, A Ng, and C Manning. 2011. Semi-supervised recursive autoencoders for predicting sentiment distributions. Proc.
of EMNLP, pp. 151-161.
movie
good
very
( × 2)

very good
very good
movie
( × 2)
 Compose a phrase vector
= , =

 , ∈ ℝ: constituent vectors
 ∈ ℝ: phrase vector
 ∈ ℝ×2: parameter
 : activation function
 Recursively compose vectors
along the phrase structure
(parse tree) of a sentence

46. Matrix-Vector Recursive Neural Network (MV-RNN) (Socher+ 2012)
45
 Each word has a semantic vector and composition matrix
 Compose a phrase vector and matrix recursively
= ,
, = , = [; ]
=
, =
[; ]
R Socher, B Huval, C Manning and A Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. Proc. of EMNLP, pp. 1201-1211.

47. Recursive Neural Tensor Network (Socher+ 2013)
46
 MV-RNN has too many
parameters to train,
assigning every word
with a composition
matrix
 Transform a word vector
into a composition
matrix by using a tensor
R Socher, A Perelygin, J Wu, J Chuang, C Manning, A Ng and C Potts. 2013. Recursive deep models for semantic compositionality over a sentiment
treebank. Proc. of EMNLP, pp. 1631-1642.

48. Tree-structured LSTM (Tai+ 2015)
47
https://pdfs.semanticscholar.org/bd19/c394931257c1901a940ba8388366c35a3e33.pdf
K S Tai, R Socher, C D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. Proc. of ACL-
IJCNLP, pp. 1556–1566.

49. Stanford Sentiment Treebank (Socher+ 2013)
48
Movie reviews are parsed into
phrase structures. Each node
in a parse tree has a sentiment
value (--, -, 0, +, ++) assigned
by three annotators.
R Socher, A Perelygin, J Wu, J Chuang, C Manning, A Ng and C Potts. 2013. Recursive deep models for semantic compositionality over a sentiment
treebank. Proc. of EMNLP, pp. 1631-1642.

50. Comparison on Stanford Sentiment Treebank (Tai+ 2015)
49
K S Tai, R Socher, C D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. Proc. of ACL-
IJCNLP, pp. 1556–1566.

51. Convolutional Neural Networks (CNNs) for Text
50

52. Convolutional Neural Network (CNN) (Kim 2014)
51
Y Kim. 2014. Convolutional neural networks for sentence classification. Proc. of EMNLP, pp. 1746-1751.
It is a very good movie indeed

:+
・ ・ ・ ・ ・ ・

= max
1<<−+1
,
Max pooling: each dimension
is the maximum number
of the values ,
over timesteps
softmax
(𝑦𝑦)

53. Various pooling operations (Kalchbrenner+ 2014)
52
N Kalchbrenner, E Grefenstette, P Blunsom. 2014. A convolutional neural network for modelling sentences. Proc. of ACL, pp. 655-665.
 Max pooling

= max
1<<−+1
,
 Average pooling

=
1
− + 1

=1
−+1
,
 -max pooling
 Taking the -max values (instead of 1-max)
 Dynamic -max pooling
 Change the value of adaptively based on the length () of an
input

54. Hierarchical CNN includes Recursive NN
53
The movie was the best of all

(1)

(2)

(3)

(4)

(5)

(6)

55. Hierarchical CNN includes Recursive NN
54
The movie was the best of all

(1)

(2)

(3)

(4)

(5)

(6)
PP
NP
VP
NP
S

56. Hierarchical CNN (AdaSent) (Zhao+ 2015)
55
The movie was the best of all

(1)

(2)

(3)

(4)
Max Pooling
Use these vectors (e.g., concatenation of these vectors) as the
input to the fully-connected layer for classification
H Zhao, Z Lu, P Poupart. 2015. Self-Adaptive Hierarchical Sentence Model. Proc. of IJCAI, pp. 4069-4076.

57. Summary
 Various NN architectures that can leverage structures
 Recurrent Neural Networks (RNNs)
 Long Short-Term Memories (LSTMs)
 Gated Recurrent Units (GRU)
 Recursive Neural Networks (Recursive NNs)
 Convolutional Neural Networks (CNNs)
 Next question
Can we generate a sentence from neural networks?
56