Parsing The Web with Deep Learning

Slide 1

Slide 1 text

Parsing the Web with Deep Learning - a shallow introduction. Sorami Hisamoto September 4, 2013 @ Rakuten Institute of Technology, New York

Slide 2

Slide 2 text

/48 Abstract ‣ Deep Learning ‣ What’s it all about, and why we should care. ‣ Word Embedding ‣ How to represent language for Deep Learning. ‣ Dependency Parsing ‣ How to incorporate Deep Learning to NLP. 2

Slide 3

Slide 3 text

Slide 4

Slide 4 text

/48 3 Deep Learning Word Embedding Dependency Parsing

Slide 5

Slide 5 text

/48 4 Deep Learning Word Embedding Dependency Parsing

Slide 6

Slide 6 text

/48 How to make a good predictor? ‣ Building good predicators on complex domains mean learning complicated functions. ‣ One approach: composition of several layers of non-linearity. ➡ Deep Architecture 5

Slide 7

Slide 7 text

Slide 8

Slide 8 text

/48 Why “Deep” ? ‣ Parts of human brain (e.g. visual cortex) has deep architecture. ‣ Given same number of units, deeper architecture is more expressive than shallow one [Bishop 1995]. ‣ ... and other theoretical arguments in its favor. 6 Figure from [Bishop 1995] "Neural Networks for Pattern Recognition"

Slide 9

Slide 9 text

/48 But learning a deep architecture is di cult ... ‣ Previous approach doesn’t work well [Rumelhart+ 1986]. Initialize at random, then stochastic gradient descent (SGD). ‣ Poor result, very slow. Vanishing gradient problem. ‣ High representation power ↔ Difﬁcult to learn. ➡ ... recent breakthrough of “Deep Learning”. 7

Slide 10

Slide 10 text

Slide 11

Slide 11 text

/48 Deep Learning in action. ‣ Shown very effective in various areas (especially in vision & sound). ‣ #1 @ recent competitions in Image Recognition, Sound Recognition, Molecular Activity Prediction, etc. ‣ Google/Microsoft/Baidu’s voice recognition systems, and Apple’s Siri use deep learning algorithms. ‣ Used in Google Street View to recognize human faces. ‣ etc. 8

Slide 12

Slide 12 text

/48 ... and it’s HOT! ‣ Workshops in NIPS, ICML, ACL, CVPR, etc. ‣ Baidu opened Deep Learning research lab. in Silicon Valley. ‣ Google accquired Geoffrey Hinton (one of the originator of deep learning)’s company. ‣ Google’s Jeff Dean now working on Deep Learning. ‣ etc. 9

Slide 13

Slide 13 text

/48 WIRED UK June 26, 2012

Slide 14

Slide 14 text

/48 WIRED UK June 26, 2012

Slide 15

Slide 15 text

/48 MIT Technology Review April 23, 2013

Slide 16

Slide 16 text

/48 12 The New Yorker November 25, 2012

Slide 17

Slide 17 text

/48 The New York Times November 23, 2012

Slide 18

Slide 18 text

/48 The New York Times November 23, 2012 Too much HYPE! (similar to “Big Data”...)

Slide 19

Slide 19 text

/48 So, what’s “Deep Learning” anyway? ‣ 2 different meanings; ‣ Traditional: A model with many layers (e.g. neural network), trained in a layer-wise way. ‣ New: Unsupervised feature representation learning, at successively higher levels. ‣ First conference on deep learning; “International Conference on Learning Representations” (ICLR2013). 14

Slide 20

Slide 20 text

Slide 21

Slide 21 text

/48 15 Figure from [Bengio 2009] "Learning Deep Architecture for AI" Figure from Andrew Ng's presentation at Microsoft Research Faculty Summit 2013 research.microsoft.com/en-us/events/fs2013/andrew-ng_machinelearning.pdf

Slide 22

Slide 22 text

/48 ... and how does it work? ‣ One key point is a good layer-by-layer initialization (pre-training) with local unsupervised criterion. ‣ 2 different main approaches: ‣ Deep Belief Nets (DBN) [Hinton+ 2006] Initialize by stacking Restricted Boltzmann Machines, ﬁne tune with Up-Down. ‣ Stacked Autoencoders (SAE) [Bengio+ 2007] [Ranzato+2007] Initialize by stacking autoencoders, ﬁne tune with gradient descent. ‣ SAE simpler, but DBN performs better. 16

Slide 23

Slide 23 text

Slide 24

Slide 24 text

/48 Stacked (denoising) autoencoders. ‣ Autoencoder: Basically a PCA with non-linearity. 17 Figure from Pascal Vincent's presentation "Deep Learning with Denoising Autoencoders" http://www.iro.umontreal.ca/~lisa/seminaires/25-03-2008-2.pdf

Slide 25

Slide 25 text

/48 Why does unsupervised pre-training help? ‣ Hypotheses (shown empirically) [Erhan+ 2010]: 1.Regularization. Parameters don’t move much. 2.Good initialization. So the parameters don’t get stuck in local optima! 18

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

/48 Improving, non-stop. ‣ “The area is developing extremely fast, and what we knew last year may no longer be "true" this year.” - Kevin Duh ‣ Group theory [Mallat]. ‣ Supervised approach: Stacked SVM [Vinyals+]. ‣ Novel applications: 3-D object classiﬁcation [Socher+], text-image modeling [Srivastava&Salakhutdinov], video [Zou+], computational biology [Di Lena+], etc... ‣ etc. 19

Slide 29

Slide 29 text

/48 Deep Learning and NLP. ‣ Still not much deep learning in NLP (...yet?) . ‣ e.g. Sentiment classification [Glorot+ 2011]. Classification with term features. Not much linguistics aspect. ‣ Success in vision (pixels), sound (digital waves). NLP: How do we represent the text? ‣ Many NLP tasks beyond classification: Structured prediction. 20

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

/48 21 Deep Learning Word Embedding Dependency Parsing

Slide 33

Slide 33 text

/48 22 Deep Learning Word Embedding Dependency Parsing

Slide 34

Slide 34 text

/48 Mathematical representation of words. ‣ Vector Space Model [Turney&Pantel 2010]. ‣ LSA[Deerwester+ 1990], LDA[Blei+ 2003]. ➡Word Embeddings with Neural Language Model (2008~). 23

Slide 35

Slide 35 text

/48 Mathematical representation of words. ‣ Vector Space Model [Turney&Pantel 2010]. ‣ LSA[Deerwester+ 1990], LDA[Blei+ 2003]. ➡Word Embeddings with Neural Language Model (2008~). 23

Slide 36

Slide 36 text

/48 Word embeddings a.k.a. distributed representations ‣ Real-valued dense low-dimensional (e.g. 50) word vectors. Background also from cognitive science. ‣ Task independent, ideal for language modeling. Shown effective in POS tagging, NER, chunking, SRL. ‣ Often induced from a neural language model (next slide). 24

Slide 37

Slide 37 text

/48 Neural Language Model (NLM) [Bengio+ 2001] 25 Figure from [Bengio+ 2001] "A Neural Probabilistic Language Model"

Slide 38

Slide 38 text

/48 Neural Language Model (NLM) [Bengio+ 2001] 25

Slide 39

Slide 39 text

/48 Neural Language Model (NLM) [Bengio+ 2001] 25 Neural Language Model Hidden Layers ↓ Word Embeddings

Slide 40

Slide 40 text

/48 Training embeddings with NLM [Collobert+ 2011] ‣ Idea of “Implicit Negative Evidence” [Smith&Eisner 2005]. ‣ Corrupt original phrase by replacing a word with random one. ‣ Model learning by making score(original) > score(corrupted). 26 Figure from Richard Socher et al. 's tutorial "A Neural Probabilistic Language Model" at ACL2012 http://nlp.stanford.edu/~socherr/SocherBengioManning-DeepLearning-ACL2012-20120707-NoMargin.pdf

Slide 41

Slide 41 text

Slide 42

Slide 42 text

/48 27 Figures by Masashi Tsubaki

Slide 43

Slide 43 text

/48 27

Slide 44

Slide 44 text

/48 27 Figures by Masashi Tsubaki

Slide 45

Slide 45 text

/48 27 ‣ Similar to a normal NN, but we learn the x (word vector) as well. ‣ Word vectors initialized randomly. Learn by SGD, BFGS, etc.

Slide 46

Slide 46 text

/48 ‣ Multiple Word Prototypes [Huang+ 2012]. ‣ multiple vectors for words with multiple meanings. ‣ “Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks” [Tsubaki+ 2013]. To appear in EMNLP2013. 28 Progress in NLMs: some examples

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Slide 50

Slide 50 text

/48 King - Man + Woman = Queen? [Mikolov+ 2013] ‣ Word vectors capture many linguistic regularities; ‣ One example; ‣ vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'). ‣ vector('king') - vector('man') + vector('woman') is close to vector('queen'). 29

Slide 51

Slide 51 text

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Slide 54

Slide 54 text

/48 Publicly available resources ‣ Joseph Turian provides an open source NLM code, and some pre-trained embeddings. ‣ In August 2013, Google released a tool to train word embeddings, word2vec. 30

Slide 55

Slide 55 text

/48 31 Deep Learning Word Embedding Dependency Parsing

Slide 56

Slide 56 text

/48 32 Deep Learning Word Embedding Dependency Parsing

Slide 57

Slide 57 text

/48 Dependency parsing 33 John hit the ball with the bat

Slide 58

Slide 58 text

/48 Dependency parsing 33 Determine each word’s head in a sentence. John hit the ball with the bat

Slide 59

Slide 59 text

/48 Dependency parsing 33 Determine each word’s head in a sentence. John hit the ball with the bat

Slide 60

Slide 60 text

/48 Dependency parsing 33 “hit” is head of “John” Determine each word’s head in a sentence. John hit the ball with the bat

Slide 61

Slide 61 text

/48 Dependency parsing 33 “hit” is head of “John” Determine each word’s head in a sentence. John hit the ball with the bat

Slide 62

Slide 62 text

/48 Dependency parsing 33 “hit” is head of “John” Determine each word’s head in a sentence. John hit the ball with the bat Core process for applications such as Information Retrieval, Machine Translations, etc.

Slide 63

Slide 63 text

/48 Graph-based approach [McDonald+ 2005] 34 Non-Projective Dependency Parsing using Spanning Tree Algorithms “John saw Mary.” Input sentence

Slide 64

Slide 64 text

/48 Graph-based approach [McDonald+ 2005] 34 Non-Projective Dependency Parsing using Spanning Tree Algorithms “John saw Mary.” Input sentence All dependency edges with costs Step 1

Slide 65

Slide 65 text

/48 Graph-based approach [McDonald+ 2005] 34 Non-Projective Dependency Parsing using Spanning Tree Algorithms “John saw Mary.” Input sentence All dependency edges with costs Spanning tree with highest costs Step 1 Step 2

Slide 66

Slide 66 text

Slide 67

Slide 67 text

Slide 68

Slide 68 text

/48 Calculating edge scores 35 ‣ Input: Features of a word pair. ‣ Output: Edge score between word pair. ➡ Can we make use of unlabeled data? Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Weights get scores of all edges

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Slide 71

Slide 71 text

Slide 72

Slide 72 text

Word Representations for Dependency Parsing Our work, presented at NLP2013.

Slide 73

Slide 73 text

/48 Semi-supervised learning with word representations ‣ [Koo+ 2008] Brown clusters as extra features for dependency parsing. → Simple, yet effective. ‣ [Turian+ 2010] Brown clusters & embeddings as extra features for chunking & NER. → Effective, but Brown better than embeddings. ‣ [Collobert+ 2011] Multitask learning (POS, chunking, NER, SRL, etc.) with word embeddings. Minimal feature engineering. → state-of-art results. 37

Slide 74

Slide 74 text

Slide 75

Slide 75 text

Slide 76

Slide 76 text

/48 Word representations for parsing the web ‣ Unsupervised word representation features for dependency parsing on web text. ‣ Web domain: lot of unlabeled data. 38 Dependency parsing NER & Chunking Dependency parsing (web text) Brown clustering Word embedding [Koo+ 2008] [Turian+ 2010] our work [Turian+ 2010] our work

Slide 77

Slide 77 text

Slide 78

Slide 78 text

/48 Do they help parsing the web? [Hisamoto+ 2013] 39 ‣ Word embeddings & Brown clusters as extra features for dependency parsing. ‣ Data: Google Web Treebank.

Slide 79

Slide 79 text

/48 Do they help parsing the web? [Hisamoto+ 2013] 39 Answers Emails Newsgroups Reviews Weblogs 0.33 0.34 0.04 0.96 -0.11 0.48 0.08 0.19 0.83 0.04 0.91 0.53 0.33 0.91 -0.61 -0.08 0.01 -0.57 0.1 -0.1 -0.29 -0.47 -0.93 0.13 -0.07 -0.06 -0.18 -0.35 -0.1 -0.59 Brown, 50 cl. (Gold POS) C&W, 50 cl. (Gold POS) C&W, 1000 cl. (Gold POS) Brown, 50 cl. (Predicted POS) C&W, 50 cl. (Predicted POS) C&W, 1000 cl. (Predicted POS) ‣ Word embeddings & Brown clusters as extra features for dependency parsing. ‣ Data: Google Web Treebank.

Slide 80

Slide 80 text

Slide 81

Slide 81 text

/48 Do they help parsing the web? [Hisamoto+ 2013] 39 Answers Emails Newsgroups Reviews Weblogs 0.33 0.34 0.04 0.96 -0.11 0.48 0.08 0.19 0.83 0.04 0.91 0.53 0.33 0.91 -0.61 -0.08 0.01 -0.57 0.1 -0.1 -0.29 -0.47 -0.93 0.13 -0.07 -0.06 -0.18 -0.35 -0.1 -0.59 Brown, 50 cl. (Gold POS) C&W, 50 cl. (Gold POS) C&W, 1000 cl. (Gold POS) Brown, 50 cl. (Predicted POS) C&W, 50 cl. (Predicted POS) C&W, 1000 cl. (Predicted POS) POS tags very strong information for dependency parsing. e.g. verbs normally the head. ‣ Word embeddings & Brown clusters as extra features for dependency parsing. ‣ Data: Google Web Treebank. Word reps. helped with Predicted POS data, but not with Gold POS data.

Slide 82

Slide 82 text

Dependency Parsing with Deep Structures Our next ideas.

Slide 83

Slide 83 text

/48 Calculating edge scores, revisited. 41 Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Weights “John saw Mary.”

Slide 84

Slide 84 text

Slide 85

Slide 85 text

/48 Calculating edge scores, revisited. 41 Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Incorporate deep structures Parse get max. spanning tree Update Weights “John saw Mary.”

Slide 86

Slide 86 text

/48 Preliminary idea: use multiple perceptrons 42 P1 Training Data

Slide 87

Slide 87 text

/48 Preliminary idea: use multiple perceptrons 42 P1 Training Data Train perceptron “P1”

Slide 88

Slide 88 text

/48 Preliminary idea: use multiple perceptrons 42 P1 Training Data

Slide 89

Slide 89 text

/48 Preliminary idea: use multiple perceptrons 42 P1 P2 Training Data permuted data

Slide 90

Slide 90 text

/48 Preliminary idea: use multiple perceptrons 42 P1 P2 P3 Training Data permuted data permuted data

Slide 91

Slide 91 text

/48 Preliminary idea: use multiple perceptrons 42 P1 P2 P3 ɾɾɾ Training Data permuted data permuted data

Slide 92

Slide 92 text

/48 Preliminary idea: use multiple perceptrons 42 1st layer train multiple different perceptrons with shufﬂed data. P1 P2 P3 ɾɾɾ Training Data permuted data permuted data

Slide 93

Slide 93 text

Slide 94

Slide 94 text

/48 Preliminary idea: use multiple perceptrons 42 1st layer train multiple different perceptrons with shufﬂed data. P1 P2 P3 ɾɾɾ Training Data permuted data permuted data Data shufﬂed, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)

Slide 95

Slide 95 text

/48 Preliminary idea: use multiple perceptrons 42 1st layer train multiple different perceptrons with shufﬂed data. P1 P2 P3 ɾɾɾ Training Data score permuted data score score permuted data Data shufﬂed, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)

Slide 96

Slide 96 text

/48 Preliminary idea: use multiple perceptrons 42 1st layer train multiple different perceptrons with shufﬂed data. P1 P2 P3 ɾɾɾ P Training Data score permuted data score score permuted data Data shufﬂed, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)

Slide 97

Slide 97 text

/48 Preliminary idea: use multiple perceptrons 42 1st layer train multiple different perceptrons with shufﬂed data. P1 P2 P3 ɾɾɾ P Training Data score Final Score permuted data score score permuted data Data shufﬂed, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)

Slide 98

Slide 98 text

/48 Preliminary idea: use multiple perceptrons 42 2nd layer After 1st layer training done, train the perceptron in 2nd layer. (layer-by-layer training) Take scores from 1st layer as input. 1st layer train multiple different perceptrons with shufﬂed data. P1 P2 P3 ɾɾɾ P Training Data score Final Score permuted data score score permuted data Data shufﬂed, so each perceptron are different. Can see them as “samples from a true perceptron”. Better generalization (important for web text)

Slide 99

Slide 99 text

Slide 100

Slide 100 text

/48 Results 43 0 1 2 3 4 2 3 5 10 20 UAS relative to baseline / % Number of perceptrons in 1st layer Answers Emails Newsgroups Reviews Weblogs

Slide 101

Slide 101 text

/48 Results 43 0 1 2 3 4 2 3 5 10 20 UAS relative to baseline / % Number of perceptrons in 1st layer Answers Emails Newsgroups Reviews Weblogs “Multiple perceptrons” improve the result.

Slide 102

Slide 102 text

/48 Results 43 0 1 2 3 4 2 3 5 10 20 UAS relative to baseline / % Number of perceptrons in 1st layer Answers Emails Newsgroups Reviews Weblogs “Multiple perceptrons” improve the result. General Trend: More #perceptron the better.

Slide 103

Slide 103 text

/48 Make it deeper! ‣ Beyond 2 layers ... 44

Slide 104

Slide 104 text

/48 Make it deeper! ‣ Beyond 2 layers ... 44 Failed!

Slide 105

Slide 105 text

/48 Recursive SVM with random projection [Vinyals+ 2012] 45 SVM 1 input d SVM 2 prediction o_1 New Input x_2 Scale as linear SVMs, and exhibits better generalization ability than kernel-based SVMs. Intuitively, the random projection aims to push data from different classes towards different directions.

Slide 106

Slide 106 text

/48 Recursive SVM with random projection [Vinyals+ 2012] 45 SVM 1 input d SVM 2 prediction o_1 New Input x_2 Scale as linear SVMs, and exhibits better generalization ability than kernel-based SVMs. Random projection matrix Output of ﬁrst SVM Initial input Intuitively, the random projection aims to push data from different classes towards different directions.

Slide 107

Slide 107 text

/48 Recursive SVM with random projection [Vinyals+ 2012] 45 SVM 1 input d SVM 2 prediction o_1 New Input x_2 Scale as linear SVMs, and exhibits better generalization ability than kernel-based SVMs. Random projection matrix Output of ﬁrst SVM Initial input Intuitively, the random projection aims to push data from different classes towards different directions. → Let’s incorporate this idea to dependency parsing.

Slide 108

Slide 108 text

/48 Recursive SVM with random projection [Vinyals+ 2012] 45 SVM 1 input d SVM 2 prediction o_1 New Input x_2 Scale as linear SVMs, and exhibits better generalization ability than kernel-based SVMs. Random projection matrix Output of ﬁrst SVM Initial input Intuitively, the random projection aims to push data from different classes towards different directions. → Let’s incorporate this idea to dependency parsing. Failed!

Slide 109

Slide 109 text

/48 Next: Train embeddings & parse, together. 46 Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Pervious: Update weights. Update

Slide 110

Slide 110 text

Slide 111

Slide 111 text

/48 Next: Train embeddings & parse, together. 46 Word Pair (John, saw) Embeddings Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Update Word Pair (John, saw) Features p-word: “John”, p-pos: NN, c-word: “saw”, c-pos: VB, ... Perceptron / MIRA Score of the edge “John → saw” Parse get max. spanning tree Pervious: Update weights. Our idea: Update weights&embeddings. Update

Slide 112

Slide 112 text

/48 47 Deep Learning Word Embedding Dependency Parsing

Slide 113

Slide 113 text

/48 To summarize ... ‣ Deep Learning is exciting! ‣ Learning deep structures. ‣ Unsupervised representation learning. ‣ Deep Learning in NLP. ‣ How to represent language? → Word embeddings. ‣ Structured prediction is the challenge. 48