Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parsing The Web with Deep Learning

Sorami Hisamoto
September 04, 2013

Parsing The Web with Deep Learning

@ Rakuten Institute of Technology, New York.

Sorami Hisamoto

September 04, 2013
Tweet

More Decks by Sorami Hisamoto

Other Decks in Research

Transcript

  1. Parsing the Web with
    Deep Learning
    - a shallow introduction.
    Sorami Hisamoto
    September 4, 2013
    @ Rakuten Institute of Technology, New York

    View Slide

  2. /48
    Abstract
    ‣ Deep Learning
    ‣ What’s it all about, and why we should care.
    ‣ Word Embedding
    ‣ How to represent language for Deep Learning.
    ‣ Dependency Parsing
    ‣ How to incorporate Deep Learning to NLP.
    2

    View Slide

  3. /48
    Abstract
    ‣ Deep Learning
    ‣ What’s it all about, and why we should care.
    ‣ Word Embedding
    ‣ How to represent language for Deep Learning.
    ‣ Dependency Parsing
    ‣ How to incorporate Deep Learning to NLP.
    2
    Our
    work!

    View Slide

  4. /48
    3
    Deep
    Learning
    Word
    Embedding
    Dependency
    Parsing

    View Slide

  5. /48
    4
    Deep
    Learning
    Word
    Embedding
    Dependency
    Parsing

    View Slide

  6. /48
    How to make a good predictor?
    ‣ Building good predicators on complex domains mean
    learning complicated functions.
    ‣ One approach:
    composition of several layers of non-linearity.
    ➡ Deep Architecture
    5

    View Slide

  7. /48
    How to make a good predictor?
    ‣ Building good predicators on complex domains mean
    learning complicated functions.
    ‣ One approach:
    composition of several layers of non-linearity.
    ➡ Deep Architecture
    5

    View Slide

  8. /48
    Why “Deep” ?
    ‣ Parts of human brain (e.g. visual cortex) has deep architecture.
    ‣ Given same number of units, deeper architecture is
    more expressive than shallow one [Bishop 1995].
    ‣ ... and other theoretical arguments in its favor. 6
    Figure from [Bishop 1995] "Neural Networks for Pattern Recognition"

    View Slide

  9. /48
    But learning a deep architecture is di cult ...
    ‣ Previous approach doesn’t work well [Rumelhart+ 1986].
    Initialize at random, then stochastic gradient descent (SGD).
    ‣ Poor result, very slow. Vanishing gradient problem.
    ‣ High representation power ↔ Difficult to learn.
    ➡ ... recent breakthrough of “Deep Learning”.
    7

    View Slide

  10. /48
    But learning a deep architecture is di cult ...
    ‣ Previous approach doesn’t work well [Rumelhart+ 1986].
    Initialize at random, then stochastic gradient descent (SGD).
    ‣ Poor result, very slow. Vanishing gradient problem.
    ‣ High representation power ↔ Difficult to learn.
    ➡ ... recent breakthrough of “Deep Learning”.
    7

    View Slide

  11. /48
    Deep Learning in action.
    ‣ Shown very effective in various areas (especially in vision & sound).
    ‣ #1 @ recent competitions in Image Recognition,
    Sound Recognition, Molecular Activity Prediction, etc.
    ‣ Google/Microsoft/Baidu’s voice recognition systems, and
    Apple’s Siri use deep learning algorithms.
    ‣ Used in Google Street View to recognize human faces.
    ‣ etc.
    8

    View Slide

  12. /48
    ... and it’s HOT!
    ‣ Workshops in NIPS, ICML, ACL, CVPR, etc.
    ‣ Baidu opened Deep Learning research lab. in Silicon Valley.
    ‣ Google accquired Geoffrey Hinton (one of the originator of
    deep learning)’s company.
    ‣ Google’s Jeff Dean now working on Deep Learning.
    ‣ etc.
    9

    View Slide

  13. /48
    WIRED UK
    June 26, 2012

    View Slide

  14. /48
    WIRED UK
    June 26, 2012

    View Slide

  15. /48
    MIT Technology Review
    April 23, 2013

    View Slide

  16. /48
    12
    The New Yorker
    November 25, 2012

    View Slide

  17. /48
    The New York Times
    November 23, 2012

    View Slide

  18. /48
    The New York Times
    November 23, 2012
    Too much HYPE!
    (similar to “Big Data”...)

    View Slide

  19. /48
    So, what’s “Deep Learning” anyway?
    ‣ 2 different meanings;
    ‣ Traditional: A model with many layers (e.g. neural network),
    trained in a layer-wise way.
    ‣ New: Unsupervised feature representation learning,
    at successively higher levels.
    ‣ First conference on deep learning;
    “International Conference on Learning Representations” (ICLR2013).
    14

    View Slide

  20. /48
    So, what’s “Deep Learning” anyway?
    ‣ 2 different meanings;
    ‣ Traditional: A model with many layers (e.g. neural network),
    trained in a layer-wise way.
    ‣ New: Unsupervised feature representation learning,
    at successively higher levels.
    ‣ First conference on deep learning;
    “International Conference on Learning Representations” (ICLR2013).
    14

    View Slide

  21. /48
    15
    Figure from [Bengio 2009] "Learning Deep Architecture for AI"
    Figure from Andrew Ng's presentation at Microsoft Research Faculty Summit 2013
    research.microsoft.com/en-us/events/fs2013/andrew-ng_machinelearning.pdf

    View Slide

  22. /48
    ... and how does it work?
    ‣ One key point is a good layer-by-layer initialization (pre-training)
    with local unsupervised criterion.
    ‣ 2 different main approaches:
    ‣ Deep Belief Nets (DBN) [Hinton+ 2006]
    Initialize by stacking Restricted Boltzmann Machines, fine tune with Up-Down.
    ‣ Stacked Autoencoders (SAE) [Bengio+ 2007] [Ranzato+2007]
    Initialize by stacking autoencoders, fine tune with gradient descent.
    ‣ SAE simpler, but DBN performs better.
    16

    View Slide

  23. /48
    ... and how does it work?
    ‣ One key point is a good layer-by-layer initialization (pre-training)
    with local unsupervised criterion.
    ‣ 2 different main approaches:
    ‣ Deep Belief Nets (DBN) [Hinton+ 2006]
    Initialize by stacking Restricted Boltzmann Machines, fine tune with Up-Down.
    ‣ Stacked Autoencoders (SAE) [Bengio+ 2007] [Ranzato+2007]
    Initialize by stacking autoencoders, fine tune with gradient descent.
    ‣ SAE simpler, but DBN performs better.
    16

    View Slide

  24. /48
    Stacked (denoising) autoencoders.
    ‣ Autoencoder: Basically a PCA with non-linearity.
    17
    Figure from Pascal Vincent's presentation "Deep Learning with Denoising Autoencoders"
    http://www.iro.umontreal.ca/~lisa/seminaires/25-03-2008-2.pdf

    View Slide

  25. /48
    Why does unsupervised pre-training help?
    ‣ Hypotheses (shown empirically) [Erhan+ 2010]:
    1.Regularization.
    Parameters don’t move much.
    2.Good initialization.
    So the parameters don’t get stuck in local optima!
    18

    View Slide

  26. /48
    Why does unsupervised pre-training help?
    ‣ Hypotheses (shown empirically) [Erhan+ 2010]:
    1.Regularization.
    Parameters don’t move much.
    2.Good initialization.
    So the parameters don’t get stuck in local optima!
    18

    View Slide

  27. /48
    Why does unsupervised pre-training help?
    ‣ Hypotheses (shown empirically) [Erhan+ 2010]:
    1.Regularization.
    Parameters don’t move much.
    2.Good initialization.
    So the parameters don’t get stuck in local optima!
    18

    View Slide

  28. /48
    Improving, non-stop.
    ‣ “The area is developing extremely fast, and what we knew last year
    may no longer be "true" this year.” - Kevin Duh
    ‣ Group theory [Mallat].
    ‣ Supervised approach: Stacked SVM [Vinyals+].
    ‣ Novel applications: 3-D object classification [Socher+], text-image modeling
    [Srivastava&Salakhutdinov], video [Zou+], computational biology [Di Lena+], etc...
    ‣ etc.
    19

    View Slide

  29. /48
    Deep Learning and NLP.
    ‣ Still not much deep learning in NLP (...yet?) .
    ‣ e.g. Sentiment classification [Glorot+ 2011].
    Classification with term features. Not much linguistics aspect.
    ‣ Success in vision (pixels), sound (digital waves).
    NLP: How do we represent the text?
    ‣ Many NLP tasks beyond classification: Structured prediction.
    20

    View Slide

  30. /48
    Deep Learning and NLP.
    ‣ Still not much deep learning in NLP (...yet?) .
    ‣ e.g. Sentiment classification [Glorot+ 2011].
    Classification with term features. Not much linguistics aspect.
    ‣ Success in vision (pixels), sound (digital waves).
    NLP: How do we represent the text?
    ‣ Many NLP tasks beyond classification: Structured prediction.
    20

    View Slide

  31. /48
    Deep Learning and NLP.
    ‣ Still not much deep learning in NLP (...yet?) .
    ‣ e.g. Sentiment classification [Glorot+ 2011].
    Classification with term features. Not much linguistics aspect.
    ‣ Success in vision (pixels), sound (digital waves).
    NLP: How do we represent the text?
    ‣ Many NLP tasks beyond classification: Structured prediction.
    20

    View Slide

  32. /48
    21
    Deep
    Learning
    Word
    Embedding
    Dependency
    Parsing

    View Slide

  33. /48
    22
    Deep
    Learning
    Word
    Embedding
    Dependency
    Parsing

    View Slide

  34. /48
    Mathematical representation of words.
    ‣ Vector Space Model [Turney&Pantel 2010].
    ‣ LSA[Deerwester+ 1990], LDA[Blei+ 2003].
    ➡Word Embeddings with Neural Language Model (2008~).
    23

    View Slide

  35. /48
    Mathematical representation of words.
    ‣ Vector Space Model [Turney&Pantel 2010].
    ‣ LSA[Deerwester+ 1990], LDA[Blei+ 2003].
    ➡Word Embeddings with Neural Language Model (2008~).
    23

    View Slide

  36. /48
    Word embeddings a.k.a. distributed representations
    ‣ Real-valued dense low-dimensional (e.g. 50) word vectors.
    Background also from cognitive science.
    ‣ Task independent, ideal for language modeling.
    Shown effective in POS tagging, NER, chunking, SRL.
    ‣ Often induced from a neural language model (next slide). 24

    View Slide

  37. /48
    Neural Language Model (NLM) [Bengio+ 2001]
    25
    Figure from [Bengio+ 2001] "A Neural Probabilistic Language Model"

    View Slide

  38. /48
    Neural Language Model (NLM) [Bengio+ 2001]
    25

    View Slide

  39. /48
    Neural Language Model (NLM) [Bengio+ 2001]
    25
    Neural Language Model
    Hidden Layers

    Word Embeddings

    View Slide

  40. /48
    Training embeddings with NLM [Collobert+ 2011]
    ‣ Idea of “Implicit Negative Evidence” [Smith&Eisner 2005].
    ‣ Corrupt original phrase by replacing a word with random one.
    ‣ Model learning by making score(original) > score(corrupted).
    26
    Figure from Richard Socher et al. 's tutorial "A Neural Probabilistic Language Model" at ACL2012
    http://nlp.stanford.edu/~socherr/SocherBengioManning-DeepLearning-ACL2012-20120707-NoMargin.pdf

    View Slide

  41. /48
    Training embeddings with NLM [Collobert+ 2011]
    ‣ Idea of “Implicit Negative Evidence” [Smith&Eisner 2005].
    ‣ Corrupt original phrase by replacing a word with random one.
    ‣ Model learning by making score(original) > score(corrupted).
    26
    Unsupervised!

    View Slide

  42. /48
    27
    Figures by Masashi Tsubaki

    View Slide

  43. /48
    27

    View Slide

  44. /48
    27
    Figures by Masashi Tsubaki

    View Slide

  45. /48
    27
    ‣ Similar to a normal NN, but we learn the x (word vector) as well.
    ‣ Word vectors initialized randomly. Learn by SGD, BFGS, etc.

    View Slide

  46. /48
    ‣ Multiple Word Prototypes [Huang+ 2012].
    ‣ multiple vectors for words with multiple meanings.
    ‣ “Modeling and Learning Semantic Co-Compositionality through Prototype
    Projections and Neural Networks” [Tsubaki+ 2013].
    To appear in EMNLP2013.
    28
    Progress in NLMs: some examples

    View Slide

  47. /48
    ‣ Multiple Word Prototypes [Huang+ 2012].
    ‣ multiple vectors for words with multiple meanings.
    ‣ “Modeling and Learning Semantic Co-Compositionality through Prototype
    Projections and Neural Networks” [Tsubaki+ 2013].
    To appear in EMNLP2013.
    28
    Progress in NLMs: some examples
    Figure from socher.org/index.php/Main/ImprovingWordRepresentationsViaGlobalContextAndMultipleWordPrototypes

    View Slide

  48. /48
    ‣ Multiple Word Prototypes [Huang+ 2012].
    ‣ multiple vectors for words with multiple meanings.
    ‣ “Modeling and Learning Semantic Co-Compositionality through Prototype
    Projections and Neural Networks” [Tsubaki+ 2013].
    To appear in EMNLP2013.
    28
    Progress in NLMs: some examples

    View Slide

  49. /48
    ‣ Multiple Word Prototypes [Huang+ 2012].
    ‣ multiple vectors for words with multiple meanings.
    ‣ “Modeling and Learning Semantic Co-Compositionality through Prototype
    Projections and Neural Networks” [Tsubaki+ 2013].
    To appear in EMNLP2013.
    28
    Progress in NLMs: some examples

    View Slide

  50. /48
    King - Man + Woman = Queen? [Mikolov+ 2013]
    ‣ Word vectors capture many linguistic regularities;
    ‣ One example;
    ‣ vector('Paris') - vector('France') + vector('Italy')
    results in a vector that is very close to vector('Rome').
    ‣ vector('king') - vector('man') + vector('woman') is
    close to vector('queen').
    29

    View Slide

  51. /48
    King - Man + Woman = Queen? [Mikolov+ 2013]
    ‣ Word vectors capture many linguistic regularities;
    ‣ One example;
    ‣ vector('Paris') - vector('France') + vector('Italy')
    results in a vector that is very close to vector('Rome').
    ‣ vector('king') - vector('man') + vector('woman') is
    close to vector('queen').
    29

    View Slide

  52. /48
    King - Man + Woman = Queen? [Mikolov+ 2013]
    ‣ Word vectors capture many linguistic regularities;
    ‣ One example;
    ‣ vector('Paris') - vector('France') + vector('Italy')
    results in a vector that is very close to vector('Rome').
    ‣ vector('king') - vector('man') + vector('woman') is
    close to vector('queen').
    29

    View Slide

  53. /48
    King - Man + Woman = Queen? [Mikolov+ 2013]
    ‣ Word vectors capture many linguistic regularities;
    ‣ One example;
    ‣ vector('Paris') - vector('France') + vector('Italy')
    results in a vector that is very close to vector('Rome').
    ‣ vector('king') - vector('man') + vector('woman') is
    close to vector('queen').
    29

    View Slide

  54. /48
    Publicly available resources
    ‣ Joseph Turian provides an open source NLM code, and some
    pre-trained embeddings.
    ‣ In August 2013, Google released a tool to train word
    embeddings, word2vec.
    30

    View Slide

  55. /48
    31
    Deep
    Learning
    Word
    Embedding
    Dependency
    Parsing

    View Slide

  56. /48
    32
    Deep
    Learning
    Word
    Embedding
    Dependency
    Parsing

    View Slide

  57. /48
    Dependency parsing
    33
    John hit the ball with the bat

    View Slide

  58. /48
    Dependency parsing
    33
    Determine each word’s head in a sentence.
    John hit the ball with the bat

    View Slide

  59. /48
    Dependency parsing
    33
    Determine each word’s head in a sentence.
    John hit the ball with the bat

    View Slide

  60. /48
    Dependency parsing
    33
    “hit” is head
    of “John”
    Determine each word’s head in a sentence.
    John hit the ball with the bat

    View Slide

  61. /48
    Dependency parsing
    33
    “hit” is head
    of “John”
    Determine each word’s head in a sentence.
    John hit the ball with the bat

    View Slide

  62. /48
    Dependency parsing
    33
    “hit” is head
    of “John”
    Determine each word’s head in a sentence.
    John hit the ball with the bat
    Core process for applications such as
    Information Retrieval, Machine Translations, etc.

    View Slide

  63. /48
    Graph-based approach [McDonald+ 2005]
    34
    Non-Projective Dependency Parsing using Spanning Tree Algorithms
    “John saw Mary.”
    Input sentence

    View Slide

  64. /48
    Graph-based approach [McDonald+ 2005]
    34
    Non-Projective Dependency Parsing using Spanning Tree Algorithms
    “John saw Mary.”
    Input sentence
    All dependency edges with costs
    Step 1

    View Slide

  65. /48
    Graph-based approach [McDonald+ 2005]
    34
    Non-Projective Dependency Parsing using Spanning Tree Algorithms
    “John saw Mary.”
    Input sentence
    All dependency edges with costs
    Spanning tree with highest costs
    Step 1
    Step 2

    View Slide

  66. /48
    Graph-based approach [McDonald+ 2005]
    34
    Non-Projective Dependency Parsing using Spanning Tree Algorithms
    “John saw Mary.”
    Input sentence
    All dependency edges with costs
    Spanning tree with highest costs
    By a graph algorithm
    (Chu-Liu-Edmonds)
    Step 1
    Step 2

    View Slide

  67. /48
    Graph-based approach [McDonald+ 2005]
    34
    Non-Projective Dependency Parsing using Spanning Tree Algorithms
    “John saw Mary.”
    Input sentence
    All dependency edges with costs
    Spanning tree with highest costs
    By a graph algorithm
    (Chu-Liu-Edmonds)
    How to calculate
    those scores?
    Step 1
    Step 2

    View Slide

  68. /48
    Calculating edge scores
    35
    ‣ Input: Features of a word pair.
    ‣ Output: Edge score between word pair.
    ➡ Can we make use of unlabeled data?
    Word Pair
    (John, saw)
    Features
    p-word: “John”, p-pos: NN,
    c-word: “saw”, c-pos: VB, ...
    Perceptron
    / MIRA
    Score of the edge
    “John → saw”
    Parse
    get max. spanning tree
    Update
    Weights
    get scores of
    all edges

    View Slide

  69. /48
    Calculating edge scores
    35
    ‣ Input: Features of a word pair.
    ‣ Output: Edge score between word pair.
    ➡ Can we make use of unlabeled data?
    Word Pair
    (John, saw)
    Features
    p-word: “John”, p-pos: NN,
    c-word: “saw”, c-pos: VB, ...
    Perceptron
    / MIRA
    Score of the edge
    “John → saw”
    Parse
    get max. spanning tree
    Update
    Weights
    get scores of
    all edges

    View Slide

  70. /48
    Calculating edge scores
    35
    ‣ Input: Features of a word pair.
    ‣ Output: Edge score between word pair.
    ➡ Can we make use of unlabeled data?
    Word Pair
    (John, saw)
    Features
    p-word: “John”, p-pos: NN,
    c-word: “saw”, c-pos: VB, ...
    Perceptron
    / MIRA
    Score of the edge
    “John → saw”
    Parse
    get max. spanning tree
    Update
    Weights
    get scores of
    all edges
    Add extra features
    from unlabeled data
    (e.g. cluster info.)

    View Slide

  71. /48
    Calculating edge scores
    35
    ‣ Input: Features of a word pair.
    ‣ Output: Edge score between word pair.
    ➡ Can we make use of unlabeled data?
    Word Pair
    (John, saw)
    Features
    p-word: “John”, p-pos: NN,
    c-word: “saw”, c-pos: VB, ...
    Perceptron
    / MIRA
    Score of the edge
    “John → saw”
    Parse
    get max. spanning tree
    Update
    Weights
    get scores of
    all edges
    Add extra features
    from unlabeled data
    (e.g. cluster info.)

    View Slide

  72. Word Representations
    for
    Dependency Parsing
    Our work, presented at NLP2013.

    View Slide

  73. /48
    Semi-supervised learning with word representations
    ‣ [Koo+ 2008]
    Brown clusters as extra features for dependency parsing.
    → Simple, yet effective.
    ‣ [Turian+ 2010]
    Brown clusters & embeddings as extra features for chunking &
    NER. → Effective, but Brown better than embeddings.
    ‣ [Collobert+ 2011]
    Multitask learning (POS, chunking, NER, SRL, etc.) with word
    embeddings. Minimal feature engineering. → state-of-art results.
    37

    View Slide

  74. /48
    Semi-supervised learning with word representations
    ‣ [Koo+ 2008]
    Brown clusters as extra features for dependency parsing.
    → Simple, yet effective.
    ‣ [Turian+ 2010]
    Brown clusters & embeddings as extra features for chunking &
    NER. → Effective, but Brown better than embeddings.
    ‣ [Collobert+ 2011]
    Multitask learning (POS, chunking, NER, SRL, etc.) with word
    embeddings. Minimal feature engineering. → state-of-art results.
    37

    View Slide

  75. /48
    Semi-supervised learning with word representations
    ‣ [Koo+ 2008]
    Brown clusters as extra features for dependency parsing.
    → Simple, yet effective.
    ‣ [Turian+ 2010]
    Brown clusters & embeddings as extra features for chunking &
    NER. → Effective, but Brown better than embeddings.
    ‣ [Collobert+ 2011]
    Multitask learning (POS, chunking, NER, SRL, etc.) with word
    embeddings. Minimal feature engineering. → state-of-art results.
    37

    View Slide

  76. /48
    Word representations for parsing the web
    ‣ Unsupervised word representation features for
    dependency parsing on web text.
    ‣ Web domain: lot of unlabeled data.
    38
    Dependency parsing NER & Chunking
    Dependency parsing
    (web text)
    Brown clustering
    Word embedding
    [Koo+ 2008]
    [Turian+ 2010] our work
    [Turian+ 2010] our work

    View Slide

  77. /48
    Word representations for parsing the web
    ‣ Unsupervised word representation features for
    dependency parsing on web text.
    ‣ Web domain: lot of unlabeled data.
    38
    Dependency parsing NER & Chunking
    Dependency parsing
    (web text)
    Brown clustering
    Word embedding
    [Koo+ 2008]
    [Turian+ 2010] our work
    [Turian+ 2010] our work
    Preliminary investigation
    for incorporating
    deep structure to parsing.

    View Slide

  78. /48
    Do they help parsing the web? [Hisamoto+ 2013]
    39
    ‣ Word embeddings & Brown clusters as
    extra features for dependency parsing.
    ‣ Data: Google Web Treebank.

    View Slide

  79. /48
    Do they help parsing the web? [Hisamoto+ 2013]
    39
    Answers Emails Newsgroups Reviews Weblogs
    0.33
    0.34
    0.04
    0.96
    -0.11
    0.48
    0.08
    0.19
    0.83
    0.04
    0.91
    0.53
    0.33
    0.91
    -0.61
    -0.08
    0.01
    -0.57
    0.1
    -0.1
    -0.29
    -0.47
    -0.93
    0.13
    -0.07 -0.06
    -0.18
    -0.35
    -0.1
    -0.59
    Brown, 50 cl. (Gold POS) C&W, 50 cl. (Gold POS) C&W, 1000 cl. (Gold POS)
    Brown, 50 cl. (Predicted POS) C&W, 50 cl. (Predicted POS) C&W, 1000 cl. (Predicted POS)
    ‣ Word embeddings & Brown clusters as
    extra features for dependency parsing.
    ‣ Data: Google Web Treebank.

    View Slide

  80. /48
    Do they help parsing the web? [Hisamoto+ 2013]
    39
    Answers Emails Newsgroups Reviews Weblogs
    0.33
    0.34
    0.04
    0.96
    -0.11
    0.48
    0.08
    0.19
    0.83
    0.04
    0.91
    0.53
    0.33
    0.91
    -0.61
    -0.08
    0.01
    -0.57
    0.1
    -0.1
    -0.29
    -0.47
    -0.93
    0.13
    -0.07 -0.06
    -0.18
    -0.35
    -0.1
    -0.59
    Brown, 50 cl. (Gold POS) C&W, 50 cl. (Gold POS) C&W, 1000 cl. (Gold POS)
    Brown, 50 cl. (Predicted POS) C&W, 50 cl. (Predicted POS) C&W, 1000 cl. (Predicted POS)
    ‣ Word embeddings & Brown clusters as
    extra features for dependency parsing.
    ‣ Data: Google Web Treebank.
    Word reps. helped with Predicted POS data,
    but not with Gold POS data.

    View Slide

  81. /48
    Do they help parsing the web? [Hisamoto+ 2013]
    39
    Answers Emails Newsgroups Reviews Weblogs
    0.33
    0.34
    0.04
    0.96
    -0.11
    0.48
    0.08
    0.19
    0.83
    0.04
    0.91
    0.53
    0.33
    0.91
    -0.61
    -0.08
    0.01
    -0.57
    0.1
    -0.1
    -0.29
    -0.47
    -0.93
    0.13
    -0.07 -0.06
    -0.18
    -0.35
    -0.1
    -0.59
    Brown, 50 cl. (Gold POS) C&W, 50 cl. (Gold POS) C&W, 1000 cl. (Gold POS)
    Brown, 50 cl. (Predicted POS) C&W, 50 cl. (Predicted POS) C&W, 1000 cl. (Predicted POS)
    POS tags very strong information
    for dependency parsing.
    e.g. verbs normally the head.
    ‣ Word embeddings & Brown clusters as
    extra features for dependency parsing.
    ‣ Data: Google Web Treebank.
    Word reps. helped with Predicted POS data,
    but not with Gold POS data.

    View Slide

  82. Dependency Parsing
    with
    Deep Structures
    Our next ideas.

    View Slide

  83. /48
    Calculating edge scores, revisited.
    41
    Word Pair
    (John, saw)
    Features
    p-word: “John”, p-pos: NN,
    c-word: “saw”, c-pos: VB, ...
    Perceptron
    / MIRA
    Score of the edge
    “John → saw”
    Parse
    get max. spanning tree
    Update
    Weights
    “John saw Mary.”

    View Slide

  84. /48
    Calculating edge scores, revisited.
    41
    Word Pair
    (John, saw)
    Features
    p-word: “John”, p-pos: NN,
    c-word: “saw”, c-pos: VB, ...
    Perceptron
    / MIRA
    Score of the edge
    “John → saw”
    Parse
    get max. spanning tree
    Update
    Weights
    “John saw Mary.”

    View Slide

  85. /48
    Calculating edge scores, revisited.
    41
    Word Pair
    (John, saw)
    Features
    p-word: “John”, p-pos: NN,
    c-word: “saw”, c-pos: VB, ...
    Perceptron
    / MIRA
    Score of the edge
    “John → saw”
    Incorporate
    deep structures
    Parse
    get max. spanning tree
    Update
    Weights
    “John saw Mary.”

    View Slide

  86. /48
    Preliminary idea: use multiple perceptrons
    42
    P1
    Training Data

    View Slide

  87. /48
    Preliminary idea: use multiple perceptrons
    42
    P1
    Training Data
    Train
    perceptron “P1”

    View Slide

  88. /48
    Preliminary idea: use multiple perceptrons
    42
    P1
    Training Data

    View Slide

  89. /48
    Preliminary idea: use multiple perceptrons
    42
    P1 P2
    Training Data
    permuted
    data

    View Slide

  90. /48
    Preliminary idea: use multiple perceptrons
    42
    P1 P2 P3
    Training Data
    permuted
    data
    permuted
    data

    View Slide

  91. /48
    Preliminary idea: use multiple perceptrons
    42
    P1 P2 P3 ɾɾɾ
    Training Data
    permuted
    data
    permuted
    data

    View Slide

  92. /48
    Preliminary idea: use multiple perceptrons
    42
    1st layer
    train multiple different
    perceptrons with shuffled data.
    P1 P2 P3 ɾɾɾ
    Training Data
    permuted
    data
    permuted
    data

    View Slide

  93. /48
    Preliminary idea: use multiple perceptrons
    42
    1st layer
    train multiple different
    perceptrons with shuffled data.
    P1 P2 P3 ɾɾɾ
    Training Data
    permuted
    data
    permuted
    data
    Data shuffled, so each
    perceptron are different.

    View Slide

  94. /48
    Preliminary idea: use multiple perceptrons
    42
    1st layer
    train multiple different
    perceptrons with shuffled data.
    P1 P2 P3 ɾɾɾ
    Training Data
    permuted
    data
    permuted
    data
    Data shuffled, so each
    perceptron are different.
    Can see them as
    “samples from a true perceptron”.
    Better generalization
    (important for web text)

    View Slide

  95. /48
    Preliminary idea: use multiple perceptrons
    42
    1st layer
    train multiple different
    perceptrons with shuffled data.
    P1 P2 P3 ɾɾɾ
    Training Data
    score
    permuted
    data
    score score
    permuted
    data
    Data shuffled, so each
    perceptron are different.
    Can see them as
    “samples from a true perceptron”.
    Better generalization
    (important for web text)

    View Slide

  96. /48
    Preliminary idea: use multiple perceptrons
    42
    1st layer
    train multiple different
    perceptrons with shuffled data.
    P1 P2 P3 ɾɾɾ
    P
    Training Data
    score
    permuted
    data
    score score
    permuted
    data
    Data shuffled, so each
    perceptron are different.
    Can see them as
    “samples from a true perceptron”.
    Better generalization
    (important for web text)

    View Slide

  97. /48
    Preliminary idea: use multiple perceptrons
    42
    1st layer
    train multiple different
    perceptrons with shuffled data.
    P1 P2 P3 ɾɾɾ
    P
    Training Data
    score
    Final
    Score
    permuted
    data
    score score
    permuted
    data
    Data shuffled, so each
    perceptron are different.
    Can see them as
    “samples from a true perceptron”.
    Better generalization
    (important for web text)

    View Slide

  98. /48
    Preliminary idea: use multiple perceptrons
    42
    2nd layer
    After 1st layer training done,
    train the perceptron in 2nd layer.
    (layer-by-layer training)
    Take scores from 1st layer as input.
    1st layer
    train multiple different
    perceptrons with shuffled data.
    P1 P2 P3 ɾɾɾ
    P
    Training Data
    score
    Final
    Score
    permuted
    data
    score score
    permuted
    data
    Data shuffled, so each
    perceptron are different.
    Can see them as
    “samples from a true perceptron”.
    Better generalization
    (important for web text)

    View Slide

  99. /48
    Preliminary idea: use multiple perceptrons
    42
    2nd layer
    After 1st layer training done,
    train the perceptron in 2nd layer.
    (layer-by-layer training)
    Take scores from 1st layer as input.
    1st layer
    train multiple different
    perceptrons with shuffled data.
    P1 P2 P3 ɾɾɾ
    P
    Training Data
    score
    Final
    Score
    permuted
    data
    score score
    permuted
    data
    Data shuffled, so each
    perceptron are different.
    Can see them as
    “samples from a true perceptron”.
    Better generalization
    (important for web text)
    Basically Ensemble Learning.
    Similar to Bayes Point Machine.

    View Slide

  100. /48
    Results
    43
    0
    1
    2
    3
    4
    2 3 5 10 20
    UAS relative to baseline / %
    Number of perceptrons in 1st layer
    Answers Emails Newsgroups Reviews Weblogs

    View Slide

  101. /48
    Results
    43
    0
    1
    2
    3
    4
    2 3 5 10 20
    UAS relative to baseline / %
    Number of perceptrons in 1st layer
    Answers Emails Newsgroups Reviews Weblogs
    “Multiple perceptrons”
    improve the result.

    View Slide

  102. /48
    Results
    43
    0
    1
    2
    3
    4
    2 3 5 10 20
    UAS relative to baseline / %
    Number of perceptrons in 1st layer
    Answers Emails Newsgroups Reviews Weblogs
    “Multiple perceptrons”
    improve the result.
    General Trend:
    More #perceptron the better.

    View Slide

  103. /48
    Make it deeper!
    ‣ Beyond 2 layers ...
    44

    View Slide

  104. /48
    Make it deeper!
    ‣ Beyond 2 layers ...
    44
    Failed!

    View Slide

  105. /48
    Recursive SVM with random projection [Vinyals+ 2012]
    45
    SVM 1
    input d
    SVM 2
    prediction o_1
    New Input
    x_2
    Scale as linear SVMs, and
    exhibits better generalization ability than kernel-based SVMs.
    Intuitively, the random projection aims to push data from
    different classes towards different directions.

    View Slide

  106. /48
    Recursive SVM with random projection [Vinyals+ 2012]
    45
    SVM 1
    input d
    SVM 2
    prediction o_1
    New Input
    x_2
    Scale as linear SVMs, and
    exhibits better generalization ability than kernel-based SVMs.
    Random projection
    matrix
    Output of first SVM
    Initial input
    Intuitively, the random projection aims to push data from
    different classes towards different directions.

    View Slide

  107. /48
    Recursive SVM with random projection [Vinyals+ 2012]
    45
    SVM 1
    input d
    SVM 2
    prediction o_1
    New Input
    x_2
    Scale as linear SVMs, and
    exhibits better generalization ability than kernel-based SVMs.
    Random projection
    matrix
    Output of first SVM
    Initial input
    Intuitively, the random projection aims to push data from
    different classes towards different directions.
    → Let’s incorporate this idea
    to dependency parsing.

    View Slide

  108. /48
    Recursive SVM with random projection [Vinyals+ 2012]
    45
    SVM 1
    input d
    SVM 2
    prediction o_1
    New Input
    x_2
    Scale as linear SVMs, and
    exhibits better generalization ability than kernel-based SVMs.
    Random projection
    matrix
    Output of first SVM
    Initial input
    Intuitively, the random projection aims to push data from
    different classes towards different directions.
    → Let’s incorporate this idea
    to dependency parsing.
    Failed!

    View Slide

  109. /48
    Next: Train embeddings & parse, together.
    46
    Word Pair
    (John, saw)
    Features
    p-word: “John”, p-pos: NN,
    c-word: “saw”, c-pos: VB, ...
    Perceptron
    / MIRA
    Score of the edge
    “John → saw”
    Parse
    get max. spanning tree
    Pervious: Update weights.
    Update

    View Slide

  110. /48
    Next: Train embeddings & parse, together.
    46
    Word Pair
    (John, saw)
    Features
    p-word: “John”, p-pos: NN,
    c-word: “saw”, c-pos: VB, ...
    Perceptron
    / MIRA
    Score of the edge
    “John → saw”
    Parse
    get max. spanning tree
    Pervious: Update weights. Our idea: Update weights&embeddings.
    Update

    View Slide

  111. /48
    Next: Train embeddings & parse, together.
    46
    Word Pair
    (John, saw)
    Embeddings
    Perceptron
    / MIRA
    Score of the edge
    “John → saw”
    Parse
    get max. spanning tree
    Update
    Word Pair
    (John, saw)
    Features
    p-word: “John”, p-pos: NN,
    c-word: “saw”, c-pos: VB, ...
    Perceptron
    / MIRA
    Score of the edge
    “John → saw”
    Parse
    get max. spanning tree
    Pervious: Update weights. Our idea: Update weights&embeddings.
    Update

    View Slide

  112. /48
    47
    Deep
    Learning
    Word
    Embedding
    Dependency
    Parsing

    View Slide

  113. /48
    To summarize ...
    ‣ Deep Learning is exciting!
    ‣ Learning deep structures.
    ‣ Unsupervised representation learning.
    ‣ Deep Learning in NLP.
    ‣ How to represent language? → Word embeddings.
    ‣ Structured prediction is the challenge.
    48

    View Slide