Parsing The Web with Deep Learning

C6b97a47d5406cfdef50a5c755751c16?s=47 Sorami Hisamoto
September 04, 2013

Parsing The Web with Deep Learning

@ Rakuten Institute of Technology, New York.

C6b97a47d5406cfdef50a5c755751c16?s=128

Sorami Hisamoto

September 04, 2013
Tweet

Transcript

  1. Parsing the Web with Deep Learning - a shallow introduction.

    Sorami Hisamoto September 4, 2013 @ Rakuten Institute of Technology, New York
  2. /48 Abstract ‣ Deep Learning ‣ What’s it all about,

    and why we should care. ‣ Word Embedding ‣ How to represent language for Deep Learning. ‣ Dependency Parsing ‣ How to incorporate Deep Learning to NLP. 2
  3. /48 Abstract ‣ Deep Learning ‣ What’s it all about,

    and why we should care. ‣ Word Embedding ‣ How to represent language for Deep Learning. ‣ Dependency Parsing ‣ How to incorporate Deep Learning to NLP. 2 Our work!
  4. /48 3 Deep Learning Word Embedding Dependency Parsing

  5. /48 4 Deep Learning Word Embedding Dependency Parsing

  6. /48 How to make a good predictor? ‣ Building good

    predicators on complex domains mean learning complicated functions. ‣ One approach: composition of several layers of non-linearity. ➡ Deep Architecture 5
  7. /48 How to make a good predictor? ‣ Building good

    predicators on complex domains mean learning complicated functions. ‣ One approach: composition of several layers of non-linearity. ➡ Deep Architecture 5
  8. /48 Why “Deep” ? ‣ Parts of human brain (e.g.

    visual cortex) has deep architecture. ‣ Given same number of units, deeper architecture is more expressive than shallow one [Bishop 1995]. ‣ ... and other theoretical arguments in its favor. 6 Figure from [Bishop 1995] "Neural Networks for Pattern Recognition"
  9. /48 But learning a deep architecture is di cult ...

    ‣ Previous approach doesn’t work well [Rumelhart+ 1986]. Initialize at random, then stochastic gradient descent (SGD). ‣ Poor result, very slow. Vanishing gradient problem. ‣ High representation power ↔ Difficult to learn. ➡ ... recent breakthrough of “Deep Learning”. 7
  10. /48 But learning a deep architecture is di cult ...

    ‣ Previous approach doesn’t work well [Rumelhart+ 1986]. Initialize at random, then stochastic gradient descent (SGD). ‣ Poor result, very slow. Vanishing gradient problem. ‣ High representation power ↔ Difficult to learn. ➡ ... recent breakthrough of “Deep Learning”. 7
  11. /48 Deep Learning in action. ‣ Shown very effective in

    various areas (especially in vision & sound). ‣ #1 @ recent competitions in Image Recognition, Sound Recognition, Molecular Activity Prediction, etc. ‣ Google/Microsoft/Baidu’s voice recognition systems, and Apple’s Siri use deep learning algorithms. ‣ Used in Google Street View to recognize human faces. ‣ etc. 8
  12. /48 ... and it’s HOT! ‣ Workshops in NIPS, ICML,

    ACL, CVPR, etc. ‣ Baidu opened Deep Learning research lab. in Silicon Valley. ‣ Google accquired Geoffrey Hinton (one of the originator of deep learning)’s company. ‣ Google’s Jeff Dean now working on Deep Learning. ‣ etc. 9
  13. /48 WIRED UK June 26, 2012

  14. /48 WIRED UK June 26, 2012

  15. /48 MIT Technology Review April 23, 2013

  16. /48 12 The New Yorker November 25, 2012

  17. /48 The New York Times November 23, 2012

  18. /48 The New York Times November 23, 2012 Too much

    HYPE! (similar to “Big Data”...)
  19. /48 So, what’s “Deep Learning” anyway? ‣ 2 different meanings;

    ‣ Traditional: A model with many layers (e.g. neural network), trained in a layer-wise way. ‣ New: Unsupervised feature representation learning, at successively higher levels. ‣ First conference on deep learning; “International Conference on Learning Representations” (ICLR2013). 14
  20. /48 So, what’s “Deep Learning” anyway? ‣ 2 different meanings;

    ‣ Traditional: A model with many layers (e.g. neural network), trained in a layer-wise way. ‣ New: Unsupervised feature representation learning, at successively higher levels. ‣ First conference on deep learning; “International Conference on Learning Representations” (ICLR2013). 14
  21. /48 15 Figure from [Bengio 2009] "Learning Deep Architecture for

    AI" Figure from Andrew Ng's presentation at Microsoft Research Faculty Summit 2013 research.microsoft.com/en-us/events/fs2013/andrew-ng_machinelearning.pdf
  22. /48 ... and how does it work? ‣ One key

    point is a good layer-by-layer initialization (pre-training) with local unsupervised criterion. ‣ 2 different main approaches: ‣ Deep Belief Nets (DBN) [Hinton+ 2006] Initialize by stacking Restricted Boltzmann Machines, fine tune with Up-Down. ‣ Stacked Autoencoders (SAE) [Bengio+ 2007] [Ranzato+2007] Initialize by stacking autoencoders, fine tune with gradient descent. ‣ SAE simpler, but DBN performs better. 16
  23. /48 ... and how does it work? ‣ One key

    point is a good layer-by-layer initialization (pre-training) with local unsupervised criterion. ‣ 2 different main approaches: ‣ Deep Belief Nets (DBN) [Hinton+ 2006] Initialize by stacking Restricted Boltzmann Machines, fine tune with Up-Down. ‣ Stacked Autoencoders (SAE) [Bengio+ 2007] [Ranzato+2007] Initialize by stacking autoencoders, fine tune with gradient descent. ‣ SAE simpler, but DBN performs better. 16
  24. /48 Stacked (denoising) autoencoders. ‣ Autoencoder: Basically a PCA with

    non-linearity. 17 Figure from Pascal Vincent's presentation "Deep Learning with Denoising Autoencoders" http://www.iro.umontreal.ca/~lisa/seminaires/25-03-2008-2.pdf
  25. /48 Why does unsupervised pre-training help? ‣ Hypotheses (shown empirically)

    [Erhan+ 2010]: 1.Regularization. Parameters don’t move much. 2.Good initialization. So the parameters don’t get stuck in local optima! 18
  26. /48 Why does unsupervised pre-training help? ‣ Hypotheses (shown empirically)

    [Erhan+ 2010]: 1.Regularization. Parameters don’t move much. 2.Good initialization. So the parameters don’t get stuck in local optima! 18
  27. /48 Why does unsupervised pre-training help? ‣ Hypotheses (shown empirically)

    [Erhan+ 2010]: 1.Regularization. Parameters don’t move much. 2.Good initialization. So the parameters don’t get stuck in local optima! 18
  28. /48 Improving, non-stop. ‣ “The area is developing extremely fast,

    and what we knew last year may no longer be "true" this year.” - Kevin Duh ‣ Group theory [Mallat]. ‣ Supervised approach: Stacked SVM [Vinyals+]. ‣ Novel applications: 3-D object classification [Socher+], text-image modeling [Srivastava&Salakhutdinov], video [Zou+], computational biology [Di Lena+], etc... ‣ etc. 19
  29. /48 Deep Learning and NLP. ‣ Still not much deep

    learning in NLP (...yet?) . ‣ e.g. Sentiment classification [Glorot+ 2011]. Classification with term features. Not much linguistics aspect. ‣ Success in vision (pixels), sound (digital waves). NLP: How do we represent the text? ‣ Many NLP tasks beyond classification: Structured prediction. 20
  30. /48 Deep Learning and NLP. ‣ Still not much deep

    learning in NLP (...yet?) . ‣ e.g. Sentiment classification [Glorot+ 2011]. Classification with term features. Not much linguistics aspect. ‣ Success in vision (pixels), sound (digital waves). NLP: How do we represent the text? ‣ Many NLP tasks beyond classification: Structured prediction. 20
  31. /48 Deep Learning and NLP. ‣ Still not much deep

    learning in NLP (...yet?) . ‣ e.g. Sentiment classification [Glorot+ 2011]. Classification with term features. Not much linguistics aspect. ‣ Success in vision (pixels), sound (digital waves). NLP: How do we represent the text? ‣ Many NLP tasks beyond classification: Structured prediction. 20
  32. /48 21 Deep Learning Word Embedding Dependency Parsing

  33. /48 22 Deep Learning Word Embedding Dependency Parsing

  34. /48 Mathematical representation of words. ‣ Vector Space Model [Turney&Pantel

    2010]. ‣ LSA[Deerwester+ 1990], LDA[Blei+ 2003]. ➡Word Embeddings with Neural Language Model (2008~). 23
  35. /48 Mathematical representation of words. ‣ Vector Space Model [Turney&Pantel

    2010]. ‣ LSA[Deerwester+ 1990], LDA[Blei+ 2003]. ➡Word Embeddings with Neural Language Model (2008~). 23
  36. /48 Word embeddings a.k.a. distributed representations ‣ Real-valued dense low-dimensional

    (e.g. 50) word vectors. Background also from cognitive science. ‣ Task independent, ideal for language modeling. Shown effective in POS tagging, NER, chunking, SRL. ‣ Often induced from a neural language model (next slide). 24
  37. /48 Neural Language Model (NLM) [Bengio+ 2001] 25 Figure from

    [Bengio+ 2001] "A Neural Probabilistic Language Model"
  38. /48 Neural Language Model (NLM) [Bengio+ 2001] 25

  39. /48 Neural Language Model (NLM) [Bengio+ 2001] 25 Neural Language

    Model Hidden Layers ↓ Word Embeddings
  40. /48 Training embeddings with NLM [Collobert+ 2011] ‣ Idea of

    “Implicit Negative Evidence” [Smith&Eisner 2005]. ‣ Corrupt original phrase by replacing a word with random one. ‣ Model learning by making score(original) > score(corrupted). 26 Figure from Richard Socher et al. 's tutorial "A Neural Probabilistic Language Model" at ACL2012 http://nlp.stanford.edu/~socherr/SocherBengioManning-DeepLearning-ACL2012-20120707-NoMargin.pdf
  41. /48 Training embeddings with NLM [Collobert+ 2011] ‣ Idea of

    “Implicit Negative Evidence” [Smith&Eisner 2005]. ‣ Corrupt original phrase by replacing a word with random one. ‣ Model learning by making score(original) > score(corrupted). 26 Unsupervised!
  42. /48 27 Figures by Masashi Tsubaki

  43. /48 27

  44. /48 27 Figures by Masashi Tsubaki

  45. /48 27 ‣ Similar to a normal NN, but we

    learn the x (word vector) as well. ‣ Word vectors initialized randomly. Learn by SGD, BFGS, etc.
  46. /48 ‣ Multiple Word Prototypes [Huang+ 2012]. ‣ multiple vectors

    for words with multiple meanings. ‣ “Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks” [Tsubaki+ 2013]. To appear in EMNLP2013. 28 Progress in NLMs: some examples
  47. /48 ‣ Multiple Word Prototypes [Huang+ 2012]. ‣ multiple vectors

    for words with multiple meanings. ‣ “Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks” [Tsubaki+ 2013]. To appear in EMNLP2013. 28 Progress in NLMs: some examples Figure from socher.org/index.php/Main/ImprovingWordRepresentationsViaGlobalContextAndMultipleWordPrototypes
  48. /48 ‣ Multiple Word Prototypes [Huang+ 2012]. ‣ multiple vectors

    for words with multiple meanings. ‣ “Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks” [Tsubaki+ 2013]. To appear in EMNLP2013. 28 Progress in NLMs: some examples
  49. /48 ‣ Multiple Word Prototypes [Huang+ 2012]. ‣ multiple vectors

    for words with multiple meanings. ‣ “Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks” [Tsubaki+ 2013]. To appear in EMNLP2013. 28 Progress in NLMs: some examples
  50. /48 King - Man + Woman = Queen? [Mikolov+ 2013]

    ‣ Word vectors capture many linguistic regularities; ‣ One example; ‣ vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'). ‣ vector('king') - vector('man') + vector('woman') is close to vector('queen'). 29
  51. /48 King - Man + Woman = Queen? [Mikolov+ 2013]

    ‣ Word vectors capture many linguistic regularities; ‣ One example; ‣ vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'). ‣ vector('king') - vector('man') + vector('woman') is close to vector('queen'). 29
  52. /48 King - Man + Woman = Queen? [Mikolov+ 2013]

    ‣ Word vectors capture many linguistic regularities; ‣ One example; ‣ vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'). ‣ vector('king') - vector('man') + vector('woman') is close to vector('queen'). 29
  53. /48 King - Man + Woman = Queen? [Mikolov+ 2013]

    ‣ Word vectors capture many linguistic regularities; ‣ One example; ‣ vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'). ‣ vector('king') - vector('man') + vector('woman') is close to vector('queen'). 29
  54. /48 Publicly available resources ‣ Joseph Turian provides an open

    source NLM code, and some pre-trained embeddings. ‣ In August 2013, Google released a tool to train word embeddings, word2vec. 30
  55. /48 31 Deep Learning Word Embedding Dependency Parsing

  56. /48 32 Deep Learning Word Embedding Dependency Parsing

  57. /48 Dependency parsing 33 John hit the ball with the

    bat
  58. /48 Dependency parsing 33 Determine each word’s head in a

    sentence. John hit the ball with the bat
  59. /48 Dependency parsing 33 Determine each word’s head in a

    sentence. John hit the ball with the bat
  60. /48 Dependency parsing 33 “hit” is head of “John” Determine

    each word’s head in a sentence. John hit the ball with the bat
  61. /48 Dependency parsing 33 “hit” is head of “John” Determine

    each word’s head in a sentence. John hit the ball with the bat
  62. /48 Dependency parsing 33 “hit” is head of “John” Determine

    each word’s head in a sentence. John hit the ball with the bat Core process for applications such as Information Retrieval, Machine Translations, etc.
  63. /48 Graph-based approach [McDonald+ 2005] 34 Non-Projective Dependency Parsing using

    Spanning Tree Algorithms “John saw Mary.” Input sentence
  64. /48 Graph-based approach [McDonald+ 2005] 34 Non-Projective Dependency Parsing using

    Spanning Tree Algorithms “John saw Mary.” Input sentence All dependency edges with costs Step 1
  65. /48 Graph-based approach [McDonald+ 2005] 34 Non-Projective Dependency Parsing using

    Spanning Tree Algorithms “John saw Mary.” Input sentence All dependency edges with costs Spanning tree with highest costs Step 1 Step 2
  66. /48 Graph-based approach [McDonald+ 2005] 34 Non-Projective Dependency Parsing using

    Spanning Tree Algorithms “John saw Mary.” Input sentence All dependency edges with costs Spanning tree with highest costs By a graph algorithm (Chu-Liu-Edmonds) Step 1 Step 2
  67. /48 Graph-based approach [McDonald+ 2005] 34 Non-Projective Dependency Parsing using

    Spanning Tree Algorithms “John saw Mary.” Input sentence All dependency edges with costs Spanning tree with highest costs By a graph algorithm (Chu-Liu-Edmonds) How to calculate those scores? Step 1 Step 2
  68. /48 Calculating edge scores 35 ‣ Input: Features of a

    word pair. ‣ Output: Edge score between word pair. ➡ Can we make use of unlabeled data? Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Weights get scores of all edges
  69. /48 Calculating edge scores 35 ‣ Input: Features of a

    word pair. ‣ Output: Edge score between word pair. ➡ Can we make use of unlabeled data? Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Weights get scores of all edges
  70. /48 Calculating edge scores 35 ‣ Input: Features of a

    word pair. ‣ Output: Edge score between word pair. ➡ Can we make use of unlabeled data? Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Weights get scores of all edges Add extra features from unlabeled data (e.g. cluster info.)
  71. /48 Calculating edge scores 35 ‣ Input: Features of a

    word pair. ‣ Output: Edge score between word pair. ➡ Can we make use of unlabeled data? Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Weights get scores of all edges Add extra features from unlabeled data (e.g. cluster info.)
  72. Word Representations for Dependency Parsing Our work, presented at NLP2013.

  73. /48 Semi-supervised learning with word representations ‣ [Koo+ 2008] Brown

    clusters as extra features for dependency parsing. → Simple, yet effective. ‣ [Turian+ 2010] Brown clusters & embeddings as extra features for chunking & NER. → Effective, but Brown better than embeddings. ‣ [Collobert+ 2011] Multitask learning (POS, chunking, NER, SRL, etc.) with word embeddings. Minimal feature engineering. → state-of-art results. 37
  74. /48 Semi-supervised learning with word representations ‣ [Koo+ 2008] Brown

    clusters as extra features for dependency parsing. → Simple, yet effective. ‣ [Turian+ 2010] Brown clusters & embeddings as extra features for chunking & NER. → Effective, but Brown better than embeddings. ‣ [Collobert+ 2011] Multitask learning (POS, chunking, NER, SRL, etc.) with word embeddings. Minimal feature engineering. → state-of-art results. 37
  75. /48 Semi-supervised learning with word representations ‣ [Koo+ 2008] Brown

    clusters as extra features for dependency parsing. → Simple, yet effective. ‣ [Turian+ 2010] Brown clusters & embeddings as extra features for chunking & NER. → Effective, but Brown better than embeddings. ‣ [Collobert+ 2011] Multitask learning (POS, chunking, NER, SRL, etc.) with word embeddings. Minimal feature engineering. → state-of-art results. 37
  76. /48 Word representations for parsing the web ‣ Unsupervised word

    representation features for dependency parsing on web text. ‣ Web domain: lot of unlabeled data. 38 Dependency parsing NER & Chunking Dependency parsing (web text) Brown clustering Word embedding [Koo+ 2008] [Turian+ 2010] our work [Turian+ 2010] our work
  77. /48 Word representations for parsing the web ‣ Unsupervised word

    representation features for dependency parsing on web text. ‣ Web domain: lot of unlabeled data. 38 Dependency parsing NER & Chunking Dependency parsing (web text) Brown clustering Word embedding [Koo+ 2008] [Turian+ 2010] our work [Turian+ 2010] our work Preliminary investigation for incorporating deep structure to parsing.
  78. /48 Do they help parsing the web? [Hisamoto+ 2013] 39

    ‣ Word embeddings & Brown clusters as extra features for dependency parsing. ‣ Data: Google Web Treebank.
  79. /48 Do they help parsing the web? [Hisamoto+ 2013] 39

    Answers Emails Newsgroups Reviews Weblogs 0.33 0.34 0.04 0.96 -0.11 0.48 0.08 0.19 0.83 0.04 0.91 0.53 0.33 0.91 -0.61 -0.08 0.01 -0.57 0.1 -0.1 -0.29 -0.47 -0.93 0.13 -0.07 -0.06 -0.18 -0.35 -0.1 -0.59 Brown, 50 cl. (Gold POS) C&W, 50 cl. (Gold POS) C&W, 1000 cl. (Gold POS) Brown, 50 cl. (Predicted POS) C&W, 50 cl. (Predicted POS) C&W, 1000 cl. (Predicted POS) ‣ Word embeddings & Brown clusters as extra features for dependency parsing. ‣ Data: Google Web Treebank.
  80. /48 Do they help parsing the web? [Hisamoto+ 2013] 39

    Answers Emails Newsgroups Reviews Weblogs 0.33 0.34 0.04 0.96 -0.11 0.48 0.08 0.19 0.83 0.04 0.91 0.53 0.33 0.91 -0.61 -0.08 0.01 -0.57 0.1 -0.1 -0.29 -0.47 -0.93 0.13 -0.07 -0.06 -0.18 -0.35 -0.1 -0.59 Brown, 50 cl. (Gold POS) C&W, 50 cl. (Gold POS) C&W, 1000 cl. (Gold POS) Brown, 50 cl. (Predicted POS) C&W, 50 cl. (Predicted POS) C&W, 1000 cl. (Predicted POS) ‣ Word embeddings & Brown clusters as extra features for dependency parsing. ‣ Data: Google Web Treebank. Word reps. helped with Predicted POS data, but not with Gold POS data.
  81. /48 Do they help parsing the web? [Hisamoto+ 2013] 39

    Answers Emails Newsgroups Reviews Weblogs 0.33 0.34 0.04 0.96 -0.11 0.48 0.08 0.19 0.83 0.04 0.91 0.53 0.33 0.91 -0.61 -0.08 0.01 -0.57 0.1 -0.1 -0.29 -0.47 -0.93 0.13 -0.07 -0.06 -0.18 -0.35 -0.1 -0.59 Brown, 50 cl. (Gold POS) C&W, 50 cl. (Gold POS) C&W, 1000 cl. (Gold POS) Brown, 50 cl. (Predicted POS) C&W, 50 cl. (Predicted POS) C&W, 1000 cl. (Predicted POS) POS tags very strong information for dependency parsing. e.g. verbs normally the head. ‣ Word embeddings & Brown clusters as extra features for dependency parsing. ‣ Data: Google Web Treebank. Word reps. helped with Predicted POS data, but not with Gold POS data.
  82. Dependency Parsing with Deep Structures Our next ideas.

  83. /48 Calculating edge scores, revisited. 41 Word Pair (John, saw)

    Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Weights “John saw Mary.”
  84. /48 Calculating edge scores, revisited. 41 Word Pair (John, saw)

    Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Weights “John saw Mary.”
  85. /48 Calculating edge scores, revisited. 41 Word Pair (John, saw)

    Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Incorporate deep structures Parse get max. spanning tree Update Weights “John saw Mary.”
  86. /48 Preliminary idea: use multiple perceptrons 42 P1 Training Data

  87. /48 Preliminary idea: use multiple perceptrons 42 P1 Training Data

    Train perceptron “P1”
  88. /48 Preliminary idea: use multiple perceptrons 42 P1 Training Data

  89. /48 Preliminary idea: use multiple perceptrons 42 P1 P2 Training

    Data permuted data
  90. /48 Preliminary idea: use multiple perceptrons 42 P1 P2 P3

    Training Data permuted data permuted data
  91. /48 Preliminary idea: use multiple perceptrons 42 P1 P2 P3

    ɾɾɾ Training Data permuted data permuted data
  92. /48 Preliminary idea: use multiple perceptrons 42 1st layer train

    multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ Training Data permuted data permuted data
  93. /48 Preliminary idea: use multiple perceptrons 42 1st layer train

    multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ Training Data permuted data permuted data Data shuffled, so each perceptron are different.
  94. /48 Preliminary idea: use multiple perceptrons 42 1st layer train

    multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ Training Data permuted data permuted data Data shuffled, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)
  95. /48 Preliminary idea: use multiple perceptrons 42 1st layer train

    multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ Training Data score permuted data score score permuted data Data shuffled, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)
  96. /48 Preliminary idea: use multiple perceptrons 42 1st layer train

    multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ P Training Data score permuted data score score permuted data Data shuffled, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)
  97. /48 Preliminary idea: use multiple perceptrons 42 1st layer train

    multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ P Training Data score Final Score permuted data score score permuted data Data shuffled, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)
  98. /48 Preliminary idea: use multiple perceptrons 42 2nd layer After

    1st layer training done, train the perceptron in 2nd layer. (layer-by-layer training) Take scores from 1st layer as input. 1st layer train multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ P Training Data score Final Score permuted data score score permuted data Data shuffled, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)
  99. /48 Preliminary idea: use multiple perceptrons 42 2nd layer After

    1st layer training done, train the perceptron in 2nd layer. (layer-by-layer training) Take scores from 1st layer as input. 1st layer train multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ P Training Data score Final Score permuted data score score permuted data Data shuffled, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text) Basically Ensemble Learning. Similar to Bayes Point Machine.
  100. /48 Results 43 0 1 2 3 4 2 3

    5 10 20 UAS relative to baseline / % Number of perceptrons in 1st layer Answers Emails Newsgroups Reviews Weblogs
  101. /48 Results 43 0 1 2 3 4 2 3

    5 10 20 UAS relative to baseline / % Number of perceptrons in 1st layer Answers Emails Newsgroups Reviews Weblogs “Multiple perceptrons” improve the result.
  102. /48 Results 43 0 1 2 3 4 2 3

    5 10 20 UAS relative to baseline / % Number of perceptrons in 1st layer Answers Emails Newsgroups Reviews Weblogs “Multiple perceptrons” improve the result. General Trend: More #perceptron the better.
  103. /48 Make it deeper! ‣ Beyond 2 layers ... 44

  104. /48 Make it deeper! ‣ Beyond 2 layers ... 44

    Failed!
  105. /48 Recursive SVM with random projection [Vinyals+ 2012] 45 SVM

    1 input d SVM 2 prediction o_1 New Input x_2 Scale as linear SVMs, and exhibits better generalization ability than kernel-based SVMs. Intuitively, the random projection aims to push data from different classes towards different directions.
  106. /48 Recursive SVM with random projection [Vinyals+ 2012] 45 SVM

    1 input d SVM 2 prediction o_1 New Input x_2 Scale as linear SVMs, and exhibits better generalization ability than kernel-based SVMs. Random projection matrix Output of first SVM Initial input Intuitively, the random projection aims to push data from different classes towards different directions.
  107. /48 Recursive SVM with random projection [Vinyals+ 2012] 45 SVM

    1 input d SVM 2 prediction o_1 New Input x_2 Scale as linear SVMs, and exhibits better generalization ability than kernel-based SVMs. Random projection matrix Output of first SVM Initial input Intuitively, the random projection aims to push data from different classes towards different directions. → Let’s incorporate this idea to dependency parsing.
  108. /48 Recursive SVM with random projection [Vinyals+ 2012] 45 SVM

    1 input d SVM 2 prediction o_1 New Input x_2 Scale as linear SVMs, and exhibits better generalization ability than kernel-based SVMs. Random projection matrix Output of first SVM Initial input Intuitively, the random projection aims to push data from different classes towards different directions. → Let’s incorporate this idea to dependency parsing. Failed!
  109. /48 Next: Train embeddings & parse, together. 46 Word Pair

    (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Pervious: Update weights. Update
  110. /48 Next: Train embeddings & parse, together. 46 Word Pair

    (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Pervious: Update weights. Our idea: Update weights&embeddings. Update
  111. /48 Next: Train embeddings & parse, together. 46 Word Pair

    (John, saw) Embeddings Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Pervious: Update weights. Our idea: Update weights&embeddings. Update
  112. /48 47 Deep Learning Word Embedding Dependency Parsing

  113. /48 To summarize ... ‣ Deep Learning is exciting! ‣

    Learning deep structures. ‣ Unsupervised representation learning. ‣ Deep Learning in NLP. ‣ How to represent language? → Word embeddings. ‣ Structured prediction is the challenge. 48