and why we should care. ‣ Word Embedding ‣ How to represent language for Deep Learning. ‣ Dependency Parsing ‣ How to incorporate Deep Learning to NLP. 2
and why we should care. ‣ Word Embedding ‣ How to represent language for Deep Learning. ‣ Dependency Parsing ‣ How to incorporate Deep Learning to NLP. 2 Our work!
predicators on complex domains mean learning complicated functions. ‣ One approach: composition of several layers of non-linearity. ➡ Deep Architecture 5
predicators on complex domains mean learning complicated functions. ‣ One approach: composition of several layers of non-linearity. ➡ Deep Architecture 5
visual cortex) has deep architecture. ‣ Given same number of units, deeper architecture is more expressive than shallow one [Bishop 1995]. ‣ ... and other theoretical arguments in its favor. 6 Figure from [Bishop 1995] "Neural Networks for Pattern Recognition"
‣ Previous approach doesn’t work well [Rumelhart+ 1986]. Initialize at random, then stochastic gradient descent (SGD). ‣ Poor result, very slow. Vanishing gradient problem. ‣ High representation power ↔ Difficult to learn. ➡ ... recent breakthrough of “Deep Learning”. 7
‣ Previous approach doesn’t work well [Rumelhart+ 1986]. Initialize at random, then stochastic gradient descent (SGD). ‣ Poor result, very slow. Vanishing gradient problem. ‣ High representation power ↔ Difficult to learn. ➡ ... recent breakthrough of “Deep Learning”. 7
various areas (especially in vision & sound). ‣ #1 @ recent competitions in Image Recognition, Sound Recognition, Molecular Activity Prediction, etc. ‣ Google/Microsoft/Baidu’s voice recognition systems, and Apple’s Siri use deep learning algorithms. ‣ Used in Google Street View to recognize human faces. ‣ etc. 8
ACL, CVPR, etc. ‣ Baidu opened Deep Learning research lab. in Silicon Valley. ‣ Google accquired Geoffrey Hinton (one of the originator of deep learning)’s company. ‣ Google’s Jeff Dean now working on Deep Learning. ‣ etc. 9
‣ Traditional: A model with many layers (e.g. neural network), trained in a layer-wise way. ‣ New: Unsupervised feature representation learning, at successively higher levels. ‣ First conference on deep learning; “International Conference on Learning Representations” (ICLR2013). 14
‣ Traditional: A model with many layers (e.g. neural network), trained in a layer-wise way. ‣ New: Unsupervised feature representation learning, at successively higher levels. ‣ First conference on deep learning; “International Conference on Learning Representations” (ICLR2013). 14
AI" Figure from Andrew Ng's presentation at Microsoft Research Faculty Summit 2013 research.microsoft.com/en-us/events/fs2013/andrew-ng_machinelearning.pdf
point is a good layer-by-layer initialization (pre-training) with local unsupervised criterion. ‣ 2 different main approaches: ‣ Deep Belief Nets (DBN) [Hinton+ 2006] Initialize by stacking Restricted Boltzmann Machines, fine tune with Up-Down. ‣ Stacked Autoencoders (SAE) [Bengio+ 2007] [Ranzato+2007] Initialize by stacking autoencoders, fine tune with gradient descent. ‣ SAE simpler, but DBN performs better. 16
point is a good layer-by-layer initialization (pre-training) with local unsupervised criterion. ‣ 2 different main approaches: ‣ Deep Belief Nets (DBN) [Hinton+ 2006] Initialize by stacking Restricted Boltzmann Machines, fine tune with Up-Down. ‣ Stacked Autoencoders (SAE) [Bengio+ 2007] [Ranzato+2007] Initialize by stacking autoencoders, fine tune with gradient descent. ‣ SAE simpler, but DBN performs better. 16
and what we knew last year may no longer be "true" this year.” - Kevin Duh ‣ Group theory [Mallat]. ‣ Supervised approach: Stacked SVM [Vinyals+]. ‣ Novel applications: 3-D object classification [Socher+], text-image modeling [Srivastava&Salakhutdinov], video [Zou+], computational biology [Di Lena+], etc... ‣ etc. 19
learning in NLP (...yet?) . ‣ e.g. Sentiment classification [Glorot+ 2011]. Classification with term features. Not much linguistics aspect. ‣ Success in vision (pixels), sound (digital waves). NLP: How do we represent the text? ‣ Many NLP tasks beyond classification: Structured prediction. 20
learning in NLP (...yet?) . ‣ e.g. Sentiment classification [Glorot+ 2011]. Classification with term features. Not much linguistics aspect. ‣ Success in vision (pixels), sound (digital waves). NLP: How do we represent the text? ‣ Many NLP tasks beyond classification: Structured prediction. 20
learning in NLP (...yet?) . ‣ e.g. Sentiment classification [Glorot+ 2011]. Classification with term features. Not much linguistics aspect. ‣ Success in vision (pixels), sound (digital waves). NLP: How do we represent the text? ‣ Many NLP tasks beyond classification: Structured prediction. 20
(e.g. 50) word vectors. Background also from cognitive science. ‣ Task independent, ideal for language modeling. Shown effective in POS tagging, NER, chunking, SRL. ‣ Often induced from a neural language model (next slide). 24
“Implicit Negative Evidence” [Smith&Eisner 2005]. ‣ Corrupt original phrase by replacing a word with random one. ‣ Model learning by making score(original) > score(corrupted). 26 Figure from Richard Socher et al. 's tutorial "A Neural Probabilistic Language Model" at ACL2012 http://nlp.stanford.edu/~socherr/SocherBengioManning-DeepLearning-ACL2012-20120707-NoMargin.pdf
“Implicit Negative Evidence” [Smith&Eisner 2005]. ‣ Corrupt original phrase by replacing a word with random one. ‣ Model learning by making score(original) > score(corrupted). 26 Unsupervised!
for words with multiple meanings. ‣ “Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks” [Tsubaki+ 2013]. To appear in EMNLP2013. 28 Progress in NLMs: some examples
for words with multiple meanings. ‣ “Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks” [Tsubaki+ 2013]. To appear in EMNLP2013. 28 Progress in NLMs: some examples Figure from socher.org/index.php/Main/ImprovingWordRepresentationsViaGlobalContextAndMultipleWordPrototypes
for words with multiple meanings. ‣ “Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks” [Tsubaki+ 2013]. To appear in EMNLP2013. 28 Progress in NLMs: some examples
for words with multiple meanings. ‣ “Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks” [Tsubaki+ 2013]. To appear in EMNLP2013. 28 Progress in NLMs: some examples
‣ Word vectors capture many linguistic regularities; ‣ One example; ‣ vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'). ‣ vector('king') - vector('man') + vector('woman') is close to vector('queen'). 29
‣ Word vectors capture many linguistic regularities; ‣ One example; ‣ vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'). ‣ vector('king') - vector('man') + vector('woman') is close to vector('queen'). 29
‣ Word vectors capture many linguistic regularities; ‣ One example; ‣ vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'). ‣ vector('king') - vector('man') + vector('woman') is close to vector('queen'). 29
‣ Word vectors capture many linguistic regularities; ‣ One example; ‣ vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'). ‣ vector('king') - vector('man') + vector('woman') is close to vector('queen'). 29
Spanning Tree Algorithms “John saw Mary.” Input sentence All dependency edges with costs Spanning tree with highest costs By a graph algorithm (Chu-Liu-Edmonds) Step 1 Step 2
Spanning Tree Algorithms “John saw Mary.” Input sentence All dependency edges with costs Spanning tree with highest costs By a graph algorithm (Chu-Liu-Edmonds) How to calculate those scores? Step 1 Step 2
word pair. ‣ Output: Edge score between word pair. ➡ Can we make use of unlabeled data? Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Weights get scores of all edges
word pair. ‣ Output: Edge score between word pair. ➡ Can we make use of unlabeled data? Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Weights get scores of all edges
word pair. ‣ Output: Edge score between word pair. ➡ Can we make use of unlabeled data? Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Weights get scores of all edges Add extra features from unlabeled data (e.g. cluster info.)
word pair. ‣ Output: Edge score between word pair. ➡ Can we make use of unlabeled data? Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Weights get scores of all edges Add extra features from unlabeled data (e.g. cluster info.)
clusters as extra features for dependency parsing. → Simple, yet effective. ‣ [Turian+ 2010] Brown clusters & embeddings as extra features for chunking & NER. → Effective, but Brown better than embeddings. ‣ [Collobert+ 2011] Multitask learning (POS, chunking, NER, SRL, etc.) with word embeddings. Minimal feature engineering. → state-of-art results. 37
clusters as extra features for dependency parsing. → Simple, yet effective. ‣ [Turian+ 2010] Brown clusters & embeddings as extra features for chunking & NER. → Effective, but Brown better than embeddings. ‣ [Collobert+ 2011] Multitask learning (POS, chunking, NER, SRL, etc.) with word embeddings. Minimal feature engineering. → state-of-art results. 37
clusters as extra features for dependency parsing. → Simple, yet effective. ‣ [Turian+ 2010] Brown clusters & embeddings as extra features for chunking & NER. → Effective, but Brown better than embeddings. ‣ [Collobert+ 2011] Multitask learning (POS, chunking, NER, SRL, etc.) with word embeddings. Minimal feature engineering. → state-of-art results. 37
representation features for dependency parsing on web text. ‣ Web domain: lot of unlabeled data. 38 Dependency parsing NER & Chunking Dependency parsing (web text) Brown clustering Word embedding [Koo+ 2008] [Turian+ 2010] our work [Turian+ 2010] our work
representation features for dependency parsing on web text. ‣ Web domain: lot of unlabeled data. 38 Dependency parsing NER & Chunking Dependency parsing (web text) Brown clustering Word embedding [Koo+ 2008] [Turian+ 2010] our work [Turian+ 2010] our work Preliminary investigation for incorporating deep structure to parsing.
Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Weights “John saw Mary.”
Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Weights “John saw Mary.”
Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Incorporate deep structures Parse get max. spanning tree Update Weights “John saw Mary.”
multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ Training Data permuted data permuted data Data shuffled, so each perceptron are different.
multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ Training Data permuted data permuted data Data shuffled, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)
multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ Training Data score permuted data score score permuted data Data shuffled, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)
multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ P Training Data score permuted data score score permuted data Data shuffled, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)
multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ P Training Data score Final Score permuted data score score permuted data Data shuffled, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)
1st layer training done, train the perceptron in 2nd layer. (layer-by-layer training) Take scores from 1st layer as input. 1st layer train multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ P Training Data score Final Score permuted data score score permuted data Data shuffled, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)
1st layer training done, train the perceptron in 2nd layer. (layer-by-layer training) Take scores from 1st layer as input. 1st layer train multiple different perceptrons with shuffled data. P1 P2 P3 ɾɾɾ P Training Data score Final Score permuted data score score permuted data Data shuffled, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text) Basically Ensemble Learning. Similar to Bayes Point Machine.
5 10 20 UAS relative to baseline / % Number of perceptrons in 1st layer Answers Emails Newsgroups Reviews Weblogs “Multiple perceptrons” improve the result.
5 10 20 UAS relative to baseline / % Number of perceptrons in 1st layer Answers Emails Newsgroups Reviews Weblogs “Multiple perceptrons” improve the result. General Trend: More #perceptron the better.
1 input d SVM 2 prediction o_1 New Input x_2 Scale as linear SVMs, and exhibits better generalization ability than kernel-based SVMs. Intuitively, the random projection aims to push data from different classes towards different directions.
1 input d SVM 2 prediction o_1 New Input x_2 Scale as linear SVMs, and exhibits better generalization ability than kernel-based SVMs. Random projection matrix Output of first SVM Initial input Intuitively, the random projection aims to push data from different classes towards different directions.
1 input d SVM 2 prediction o_1 New Input x_2 Scale as linear SVMs, and exhibits better generalization ability than kernel-based SVMs. Random projection matrix Output of first SVM Initial input Intuitively, the random projection aims to push data from different classes towards different directions. → Let’s incorporate this idea to dependency parsing.
1 input d SVM 2 prediction o_1 New Input x_2 Scale as linear SVMs, and exhibits better generalization ability than kernel-based SVMs. Random projection matrix Output of first SVM Initial input Intuitively, the random projection aims to push data from different classes towards different directions. → Let’s incorporate this idea to dependency parsing. Failed!
(John, saw) Embeddings Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Pervious: Update weights. Our idea: Update weights&embeddings. Update
Learning deep structures. ‣ Unsupervised representation learning. ‣ Deep Learning in NLP. ‣ How to represent language? → Word embeddings. ‣ Structured prediction is the challenge. 48