Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DNN for Structural Data

DNN for Structural Data

Recurrent Neural Networks (RNNs), Gradient vanishing and exploding, Long Short-Term Memory (LSTM), Gated Recurrent Units (GRUs), Recursive Neural Network, Tree-structured LSTM, Convolutional Neural Networks (CNNs)

Naoaki Okazaki
PRO

August 07, 2020
Tweet

More Decks by Naoaki Okazaki

Other Decks in Research

Transcript

  1. DNN for Structural Data
    Naoaki Okazaki
    School of Computing,
    Tokyo Institute of Technology
    [email protected]
    PowerPoint template designed by https://ppt.design4u.jp/template/

    View Slide

  2. Embeddings for phrases and sentences
    1
     Word embeddings represent words with real-valued vectors
     Is it possible to consider embeddings for phrases and sentences?
    John
    loves
    Mary John loves Mary

    View Slide

  3. The baseline: additive composition
    2
    (Example with word embeddings
    of two-dimensional vectors)
    This approach surprisingly works well in practice,
    but cannot distinguish different word orders
    (“John loves Mary” vs “Mary loves John”)
    loves
    (1,0)
    Mary
    (0,1)
    John
    (0.25,-0.25)
    John loves Mary
    (1.25, 0.75)

    View Slide

  4. Summary
     Various NN architectures that can leverage structures
     Recurrent Neural Networks (RNNs)
     Long Short-Term Memories (LSTMs)
     Gated Recurrent Units (GRU)
     Recursive Neural Networks (Recursive NNs)
     Convolutional Neural Networks (CNNs)
    3

    View Slide

  5. Recurrent Neural Networks (RNNs)
    4

    View Slide

  6. Recurrent Neural Networks (RNNs) (Sutskever+ 2011)
    5
    I Sutskever, J Martens, G Hinton. 2011. Generating text with recurrent neural networks. Proc. of ICML, pp. 1017–1024.
    John loves

    4
    𝑦
    ℎℎ
    Mary
    ℎℎ
    much
    ℎℎ
    softmax
    Word embeddings
    Represent a word
    with a vector





    1
    2
    3
    1
    2
    3
    4
    Recurrent computation
    Compose a hidden vector
    from an
    input word
    and the hidden vector
    −1
    at the previous timestep

    = (ℎ
    + ℎℎ−1)
    Fully-connected layer for a task
    Make a prediction from the hidden
    vector 4
    , which are composed from
    all words in the sentence, by using a
    fully-connected layer and softmax

    0
    = 0
    ℎℎ

    The parameters ℎ
    , ℎℎ
    , 𝑦
    are shared over the entire sequence
    They are trained by the supervision signal 1
    , … , 4
    , using backpropagation

    View Slide

  7. RNN in math
    6

    ℎℎ
    +
    −1 tanh

    ℎℎ
    +
    +1
    tanh +1
    +1



    = RNN
    , −1
    = ℎ

    + ℎℎ
    −1
    ,

    ∈ ℝ,
    ∈ ℝ, ℎ
    ∈ ℝ×, ℎℎ
    ∈ ℝ×
    Typical activation functions are tanh and ReLU
    RNN RNN

    View Slide

  8. Multi-layer RNNs
    7


    (1)

    ℎℎ
    (1) +
    −1
    (1)
    tanh
    (1)


    (1)

    ℎℎ
    (1) +
    +1
    (0)
    tanh +1
    (1)

    (0)
    RNN(1) RNN(1)


    (2)

    ℎℎ
    (2) +
    −1
    (2)
    tanh
    (2)


    (2)

    ℎℎ
    (2) + tanh +1
    (2)
    +1
    (2)

    (2)
    RNN(2) RNN(2)
    =
    = +1

    View Slide

  9. Forward and backward RNNs
    8

    ℎℎ
    +
    −1 tanh

    RNN

    ℎℎ
    +
    −1 tanh

    RNN
    Forward RNNs

    = RNN
    , −1
    = ℎ

    + ℎℎ
    −1
    Backward RNNs
    −1
    = RNN
    ,
    = ℎ

    + ℎℎ

    View Slide

  10. Bidirectional RNNs (Graves+ 2013)
    9
    John loves Mary much
    softmax

    Forward
    Backward
    Concatenate the last hidden
    vectors of the both directions
    Fully-connected layer for a task
    The same as unidirectional RNNs

    A Graves, A Mohamed and G Hinton. 2013. Speech Recognition with Deep Recurrent Neural Networks. Proc. of ICASSP, pp. 6645-6649.

    View Slide

  11. Unfolded Recurrent Neural Network
    10
     Process a sequence 1
    , 2
    , … ,
    of length
     Include interactions from the past
     Neural network is deep in time direction
     Share parameters of ℎ
    and ℎℎ
    over sequence
     Trained by backpropagation on unfolded graph
     This is called backpropagation through time (BPTT)
    RNN
    1
    , 2
    , … ,
    1
    , 2
    , … ,
    Unfold RNN
    1
    1
    RNN
    2
    2
    RNN


    View Slide

  12. Example: RNN for nationality prediction
    11
    G o

    4
    𝑦
    ℎℎ
    t
    ℎℎ
    o
    ℎℎ
    softmax



    1
    2
    3
    1
    2
    3
    4
    ∈ ℝ18
    0
    = 0
    ℎℎ

    ∈ ℝ55
    (one-hot vector)
    (55 = |[A-Za-z .,;']|)

    ∈ ℝ128

    View Slide

  13. Preprocess the data
    12
    [
    [
    "Nguyen",
    "Vietnamese“
    ],
    [
    "Tron",
    "Vietnamese“
    ],
    [
    "Le",
    "Vietnamese“
    ],
    ……
    ]
    https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

    View Slide

  14. Convert the data into numerical data
    13
    [
    [[16, 35, 49, 53, 33, 42], 17],
    [[22, 46, 43, 42], 17],
    [[14, 33], 17],
    ……
    ]
    Find alphabet (X) and a set
    of country names (Y)
    Build an associative array
    to map a letter/country
    into an integer ID
    Convert letters and countries
    into integer IDs by using the
    associative arrays
    https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

    View Slide

  15. Bare implementation of RNN states
    14

    = ℎ

    + ℎℎ
    −1
    = (ℎ
    [
    ; −1
    ])
    https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

    View Slide

  16. Sequential RNN module
    15
    https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

    View Slide

  17. Mini-batch RNN
    16
    https://github.com/chokkan/deeplearning/blob/master/notebook/rnn.ipynb

    View Slide

  18. Long-term dependency
    17
     Consider a simplified RNN (without an input and activation function),

    = −1
     After steps, this is equivalent to multiplying

    = 0
     When has an eigenvalue decomposition,
    = diag()−1
     We can compute
    as,

    = diag −1 = diag −1
     The eigenvalues are multiplied times
     When
    < 1,
    → 0 (gradient vanishing)
     When
    > 1,
    → ∞ (gradient exploding)
     Computing
    in this way is similar to the power method

    will be close to the eigenvector for the largest eigenvalue of ,
    regardless of the vector 0
    I Goodfellow, Y Bengio, A Courville. 2016. Deep Learning, page 286, MIT Press.

    View Slide

  19. Gradient vanishing/exploding problem
    18
     Gradients vanish or explode over time
     More detailed explanations:
     Why are deep neural networks hard to train?
    http://neuralnetworksanddeeplearning.com/chap5.html
     ニューラルネットワークを訓練するのはなぜ難しいのか
    https://nnadl-ja.github.io/nnadl_site_ja/chap5.html
     Recurrent Neural Networks LSTMs and Vanishing & Exploding
    Gradients - Fun and Easy Machine Learning
    https://www.youtube.com/watch?v=2GNbIKTKCfE
    RNN
    1
    1
    RNN
    2
    2
    RNN


    RNN
    −2
    −2
    RNN
    −1
    −1

    View Slide

  20. Addressing gradient vanishing/exploding
    19
     Gradient exploding
     Gradient clipping (Pascanu+ 2013)
     Gradient vanishing
     Activation function: tanh to ReLU
     Long Short-Term Memory (LSTM)
     Gated Recurrent Unit (GRU)
     Residual Networks
    (Pascanu+ 2013)
    When the norm of gradients
    is above the threshold, scale
    down the gradients
    R Pascanu, T Mikolov, Y Bengio. 2013. On the difficulty of training recurrent neural networks. Proc. of ICML, pp. 1310-1318.

    View Slide

  21. Long Short-Term Memory (LSTM)
    20

    View Slide

  22. Long Short-Term Memory (Hochreiter+ 1997)
    21
     Consist of (∗ denotes elementwise product):
     Hidden state: ℎ
    =
    ∗ tanh
     Memory cell:
    =
    ∗ −1
    +
    ∗ tanh
    𝑔𝑔

    + 𝑔
    ℎ−1
     Input gate:
    = 𝑖𝑖

    + 𝑖
    ℎ−1
     Output gate:
    =
    𝑜𝑜

    + 𝑜
    ℎ−1
     Forget gate:
    = 𝑓𝑓

    + 𝑓
    ℎ−1
     The architecture looks complicated, but LSTMs are also a
    neural network
     LSTMs can be also trained by the standard procedure of
    backpropagation
    S Hochreiter, J Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.

    View Slide

  23. LSTM in math and diagram
    22
    𝑓𝑓
    𝑓 𝑖𝑖
    𝑖

    𝑔𝑔
    𝑔
    𝑜𝑜
    𝑜
    +

    +

    +
    tanh
    +

    *
    +
    *
    tanh
    *
    ℎ−1
    −1








    Memory cell
    Hidden state
    Forget gate
    Input gate
    Output gate

    = 𝑓𝑓

    + 𝑓
    ℎ−1

    =

    + ℎ
    ℎ−1

    =


    + ℎ
    ℎ−1

    = tanh
    𝑔𝑔

    + 𝑔
    ℎ−1

    =
    ∗ −1
    +


    =
    ∗ tanh

    View Slide

  24. Sequential LSTM in pytorch
    23
     Replace torch.nn.RNN to torch.nn.LSTM
     Change the shape of an initial state

    View Slide

  25. Implementation of LSTM cell in pytorch
    24
    𝑓𝑓
    𝑓 𝑖𝑖
    𝑖

    𝑔𝑔
    𝑔
    𝑜𝑜
    𝑜
    +

    +

    +
    tanh
    +

    *
    +
    *
    tanh
    *
    ℎ−1
    −1








    Memory cell
    Hidden state
    Forget gate
    Input gate
    Output gate
    def LSTMCell(x, hidden, w_x, w_h, b_x=None, b_h=None):
    h_prev, c_prev = hidden
    gates = F.linear(x, w_x, b_x) + F.linear(h_prev, w_h, b_h)
    ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1)
    ingate = F.sigmoid(ingate)
    forgetgate = F.sigmoid(forgetgate)
    cellgate = F.tanh(cellgate)
    outgate = F.sigmoid(outgate)
    ct = (forgetgate * c_prev) + (ingate * cellgate)
    ht = outgate * F.tanh(ct)
    return ht, ct

    View Slide

  26. Implementation of LSTM cell in pytorch
    25
    𝑓𝑓 𝑖𝑖

    𝑔𝑔
    𝑜𝑜
    + + + +

    (x) w_x = [𝑓𝑓
    ; 𝑖𝑖
    ;
    𝑔𝑔
    ;
    𝑜𝑜
    ]
    def LSTMCell(x, hidden, w_x, w_h, b_x=None, b_h=None):
    h_prev, c_prev = hidden
    gates = F.linear(x, w_x, b_x)

    View Slide

  27. Implementation of LSTM cell in pytorch
    26
    𝑓𝑓
    𝑓 𝑖𝑖
    𝑖

    𝑔𝑔
    𝑔
    𝑜𝑜
    𝑜
    + + + +
    ℎ−1

    (h_prev)
    (x) w_x = [𝑓𝑓
    ; 𝑖𝑖
    ;
    𝑔𝑔
    ;
    𝑜𝑜
    ]
    w_h = [ℎ
    ; ℎ
    ; ℎ
    ; ℎ
    ]
    def LSTMCell(x, hidden, w_x, w_h, b_x=None, b_h=None):
    h_prev, c_prev = hidden
    gates = F.linear(x, w_x, b_x) + F.linear(h_prev, w_h, b_h)
    gates

    View Slide

  28. Implementation of LSTM cell in pytorch
    27
    gates = F.linear(x, w_x, b_x) + F.linear(h_prev, w_h, b_h)
    ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1)
    ingate = F.sigmoid(ingate)
    forgetgate = F.sigmoid(forgetgate)
    cellgate = F.tanh(cellgate)
    outgate = F.sigmoid(outgate)
    tanh




    gates
    ingate
    forgetgate cellgate
    outgate

    View Slide

  29. Implementation of LSTM cell in pytorch
    28
    outgate = F.sigmoid(outgate)
    ct = (forgetgate * c_prev) + (ingate * cellgate)
    ht = outgate * F.tanh(ct)
    return ht, ct
    *
    +
    *
    tanh
    *
    −1
    (c_prev)







    ingate
    forgetgate cellgate
    outgate

    View Slide

  30. LSTM remedies vanishing gradients
    29
     Memory cells provide short cuts among states
     Memory cells do not suffer from zero gradients caused by
    activation functions (tanh and ReLU)
     Memory cells are connected without activation functions
     Information in −1
    can flow when a forget gate is wide opened
    (
    = 1)
     The input from each state (

    ) has no effect in computing

    −1
    +
    *
    −1
    +
    * +1
    +1


    +1
    ∗ +1

    View Slide

  31. Gated Recurrent Units (GRUs)
    30

    View Slide

  32. Gated Recurrent Unit (GRU) (Cho+ 2014)
    31
     Consist of (∗ denotes elementwise product):
     Hidden state: ℎ
    =
    ∗ ℎ−1
    + 1 −

     New hidden state:
    = tanh ℎ

    + ℎℎ
    (
    ∗ ℎ−1
    )
     Reset gate:
    =
    𝑟𝑟

    + 𝑟
    ℎ−1
     Update gate:
    =
    𝑧𝑧

    + 𝑧
    ℎ−1
     Motivated by LSTM unit
     But much simpler to compute and implement
    K Cho, van B Merrienboer, C Gulcehre, D Bahdanau, F Bougares, H Schwenk, Y Bengio. 2014. Learning phrase representations using RNN encoder–
    decoder for statistical machine translation. Proc. of EMNLP, pp. 1724–1734.

    View Slide

  33. GRU in math and diagram
    32

    𝑟𝑟
    𝑟

    ℎℎ

    𝑧𝑧
    𝑧
    +

    +
    +

    *
    * tanh
    ℎ−1 ℎ



    Reset gate
    *
    1 −
    +
    Update gate


    = tanh ℎ

    + ℎℎ
    (
    ∗ ℎ−1
    )

    =
    ∗ ℎ−1
    + 1 −


    =
    𝑟𝑟

    + 𝑟
    ℎ−1

    =


    + ℎ
    ℎ−1

    View Slide

  34. Sequential GRU in PyTorch
    33
     Replace torch.nn.RNN to torch.nn.GRU
     The shape of an initial state is unchanged

    View Slide

  35. Implementation of GRU cell in PyTorch
    34

    𝑟𝑟
    𝑟

    ℎℎ

    𝑧𝑧
    𝑧
    +

    +
    +

    * tanh
    ℎ−1



    =
    𝑟𝑟

    + 𝑟
    ℎ−1
    Reset gate
    - +
    Update gate

    *

    =
    𝑧𝑧

    + 𝑧
    ℎ−1
    This is more computationally efficient

    = tanh ℎ

    + ℎℎ
    (
    ∗ ℎ−1
    ) = tanh ℎ

    +
    ∗ ℎℎ
    ℎ−1

    =
    ∗ ℎ−1
    + 1 −

    =
    +
    ∗ (ℎ−1

    )

    View Slide

  36. Implementation of GRU cell in PyTorch
    35

    𝑟𝑟
    𝑟

    ℎℎ

    𝑧𝑧
    𝑧
    +

    +
    +

    * tanh
    ℎ−1


    Reset gate
    - +
    Update gate

    *
    def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None):
    gx = F.linear(input, w_x, b_x)
    gh = F.linear(hidden, w_h, b_h)
    x_r, x_i, x_n = gi.chunk(3, 1)
    h_r, h_i, h_n = gh.chunk(3, 1)
    resetgate = F.sigmoid(x_r + h_r)
    inputgate = F.sigmoid(x_i + h_i)
    newgate = F.tanh(x_n + resetgate * h_n)
    hy = newgate + inputgate * (hidden - newgate)
    return hy

    View Slide

  37. Implementation of GRU cell in PyTorch
    36

    𝑟𝑟


    𝑧𝑧
    ℎ−1

    def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None):
    gx = F.linear(input, w_x, b_x)

    View Slide

  38. Implementation of GRU cell in PyTorch
    37

    𝑟𝑟
    𝑟

    ℎℎ

    𝑧𝑧
    𝑧
    ℎ−1

    def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None):
    gx = F.linear(input, w_x, b_x)
    gh = F.linear(hidden, w_h, b_h)

    View Slide

  39. Implementation of GRU cell in PyTorch
    38

    𝑟𝑟
    𝑟

    ℎℎ

    𝑧𝑧
    𝑧
    ℎ−1

    h_r h_n
    x_i
    x_r x_n
    h_i
    def GRUCell(x, hidden, w_x, w_h, b_x=None, b_h=None):
    gx = F.linear(input, w_x, b_x)
    gh = F.linear(hidden, w_h, b_h)
    x_r, x_i, x_n = gi.chunk(3, 1)
    h_r, h_i, h_n = gh.chunk(3, 1)

    View Slide

  40. Implementation of GRU cell in PyTorch
    39

    𝑟𝑟
    𝑟

    𝑧𝑧
    𝑧
    +

    +

    ℎ−1

    resetgate inputgate
    resetgate = F.sigmoid(x_r + h_r)
    inputgate = F.sigmoid(x_i + h_i)

    =
    𝑟𝑟

    + 𝑟
    ℎ−1
    =
    𝑧𝑧

    + 𝑧
    ℎ−1

    View Slide

  41. Implementation of GRU cell in PyTorch
    40

    𝑟𝑟
    𝑟

    ℎℎ

    𝑧𝑧
    𝑧
    +

    +
    +

    * tanh
    ℎ−1


    - +

    *
    hy
    newgate
    inputgate
    hidden
    resetgate
    newgate = F.tanh(x_n + resetgate * h_n)
    hy = newgate + inputgate * (hidden - newgate)

    =
    𝑟𝑟

    + 𝑟
    ℎ−1
    =
    𝑧𝑧

    + 𝑧
    ℎ−1

    View Slide

  42. Comparison of RNNs (Karpathy+ 2016)
    41
    (Karpathy+ 2016)
     Task: character-level language modeling (predicting subsequent characters)
     LSTMs and GRUs significantly outperform RNNs
     RNNs seem to learn different embeddings from those of LSTMs and GRUs
    A Karpathy, J Johnson, and L Fei-Fei. 2016. Visualizing and Understanding Recurrent Networks. Proc. of ICLR Workshop 2016.

    View Slide

  43. Observing LSTM cells (Karpathy+ 2016)
    42
    (Karpathy+ 2016)
    A Karpathy, J Johnson, and L Fei-Fei. 2016. Visualizing and Understanding Recurrent Networks. Proc. of ICLR Workshop 2016.

    View Slide

  44. RNNs over tree
    43

    View Slide

  45. Recursive Neural Network (Socher+ 2011)
    44
    R Socher, J Pennington, E Huang, A Ng, and C Manning. 2011. Semi-supervised recursive autoencoders for predicting sentiment distributions. Proc.
    of EMNLP, pp. 151-161.
    movie
    good
    very
    ( × 2)


    very good
    very good
    movie
    ( × 2)
     Compose a phrase vector
    = , =


     , ∈ ℝ: constituent vectors
     ∈ ℝ: phrase vector
     ∈ ℝ×2: parameter
     : activation function
     Recursively compose vectors
    along the phrase structure
    (parse tree) of a sentence

    View Slide

  46. Matrix-Vector Recursive Neural Network (MV-RNN) (Socher+ 2012)
    45
     Each word has a semantic vector and composition matrix
     Compose a phrase vector and matrix recursively
    = ,
    , = , = [; ]
    =
    , =
    [; ]
    R Socher, B Huval, C Manning and A Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. Proc. of EMNLP, pp. 1201-1211.

    View Slide

  47. Recursive Neural Tensor Network (Socher+ 2013)
    46
     MV-RNN has too many
    parameters to train,
    assigning every word
    with a composition
    matrix
     Transform a word vector
    into a composition
    matrix by using a tensor
    R Socher, A Perelygin, J Wu, J Chuang, C Manning, A Ng and C Potts. 2013. Recursive deep models for semantic compositionality over a sentiment
    treebank. Proc. of EMNLP, pp. 1631-1642.

    View Slide

  48. Tree-structured LSTM (Tai+ 2015)
    47
    https://pdfs.semanticscholar.org/bd19/c394931257c1901a940ba8388366c35a3e33.pdf
    K S Tai, R Socher, C D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. Proc. of ACL-
    IJCNLP, pp. 1556–1566.

    View Slide

  49. Stanford Sentiment Treebank (Socher+ 2013)
    48
    Movie reviews are parsed into
    phrase structures. Each node
    in a parse tree has a sentiment
    value (--, -, 0, +, ++) assigned
    by three annotators.
    R Socher, A Perelygin, J Wu, J Chuang, C Manning, A Ng and C Potts. 2013. Recursive deep models for semantic compositionality over a sentiment
    treebank. Proc. of EMNLP, pp. 1631-1642.

    View Slide

  50. Comparison on Stanford Sentiment Treebank (Tai+ 2015)
    49
    K S Tai, R Socher, C D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. Proc. of ACL-
    IJCNLP, pp. 1556–1566.

    View Slide

  51. Convolutional Neural Networks (CNNs) for Text
    50

    View Slide

  52. Convolutional Neural Network (CNN) (Kim 2014)
    51
    Y Kim. 2014. Convolutional neural networks for sentence classification. Proc. of EMNLP, pp. 1746-1751.
    It is a very good movie indeed

    :+
    ・ ・ ・ ・ ・ ・



    = max
    1<<−+1
    ,
    Max pooling: each dimension
    is the maximum number
    of the values ,
    over timesteps
    softmax
    (𝑦𝑦)

    View Slide

  53. Various pooling operations (Kalchbrenner+ 2014)
    52
    N Kalchbrenner, E Grefenstette, P Blunsom. 2014. A convolutional neural network for modelling sentences. Proc. of ACL, pp. 655-665.
     Max pooling

    = max
    1<<−+1
    ,
     Average pooling

    =
    1
    − + 1

    =1
    −+1
    ,
     -max pooling
     Taking the -max values (instead of 1-max)
     Dynamic -max pooling
     Change the value of adaptively based on the length () of an
    input

    View Slide

  54. Hierarchical CNN includes Recursive NN
    53
    The movie was the best of all


    (1)

    (2)

    (3)

    (4)

    (5)

    (6)

    View Slide

  55. Hierarchical CNN includes Recursive NN
    54
    The movie was the best of all


    (1)

    (2)

    (3)

    (4)

    (5)

    (6)
    PP
    NP
    VP
    NP
    S

    View Slide

  56. Hierarchical CNN (AdaSent) (Zhao+ 2015)
    55
    The movie was the best of all


    (1)

    (2)

    (3)

    (4)
    Max Pooling
    Use these vectors (e.g., concatenation of these vectors) as the
    input to the fully-connected layer for classification
    H Zhao, Z Lu, P Poupart. 2015. Self-Adaptive Hierarchical Sentence Model. Proc. of IJCAI, pp. 4069-4076.

    View Slide

  57. Summary
     Various NN architectures that can leverage structures
     Recurrent Neural Networks (RNNs)
     Long Short-Term Memories (LSTMs)
     Gated Recurrent Units (GRU)
     Recursive Neural Networks (Recursive NNs)
     Convolutional Neural Networks (CNNs)
     Next question
    Can we generate a sentence from neural networks?
    56

    View Slide